-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize and noun-phrase extraction #119
Conversation
Codecov Report
@@ Coverage Diff @@
## master #119 +/- ##
==========================================
- Coverage 42.2% 41.19% -1.01%
==========================================
Files 9 11 +2
Lines 699 852 +153
==========================================
+ Hits 295 351 +56
- Misses 404 501 +97
Continue to review full report at Codecov.
|
- Rename `type` to `output` - Some linting
Many of these are not working - but that's part of test-driven development.
We have some interesting functionality issues with > txt <- "One space two spaces one\ttab\t\ttwo one\nnewline\n\ntwo."
> spacy_tokenize(txt, remove_separators = FALSE)
$t
[1] "One" " " "space" " " " " "two" " " "spaces" " "
[10] "one" "\t" "tab" "\t\t" "two" " " "one" "\n" "newline"
[19] "\n\n" "two" "."
> spacy_tokenize(txt, remove_separators = TRUE)
$t
[1] "One" "space" " " "two" "spaces" "one" "\t" "tab" "\t\t"
[10] "two" "one" "\n" "newline" "\n\n" "two" "."
> quanteda::tokens(txt, remove_separators = FALSE)
tokens from 1 document.
text1 :
[1] "One" " " "space" " " " " "two" " " "spaces" " "
[10] "one" "\t" "tab" "\t" "\t" "two" " " "one" "\n"
[19] "newline" "\n" "\n" "two" "."
> quanteda::tokens(txt, remove_separators = TRUE)
tokens from 1 document.
text1 :
[1] "One" "space" "two" "spaces" "one" "tab" "two" "one" "newline"
[10] "two" "." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the tests I added. They are breaking but we should work on the code until they pass. If we decide the tests are inappropriate, we should discuss that before changing them.
Other changes: I fixed a bug in the padding condition in the Python code, and renamed the separator argument.
What are the merits of adding arguments to match the quanteda behaviour of remove_hyphens
, remove_twitter
, and remove_symbols
? (The last we could easily do on the final R side.)
remove_twitter
in particular behaves very differently for the spacyr version:
> spacy_tokenize("I am @kenbenoit on Twitter #quanteda.")
$t
[1] "I" "am" "@kenbenoit" "on" "Twitter" "#"
[7] "quanteda" "."
> spacy_tokenize("I am @kenbenoit on Twitter #quanteda.", remove_punct = TRUE)
$t
[1] "I" "am" "@kenbenoit" "on" "Twitter" "quanteda"
remove_hyphens
comparison:
> txt <- "Jacob Rees-Mogg is a floccinaucinihilipilificator"
> spacy_tokenize(txt)
$t
[1] "Jacob" "Rees" "-" "Mogg" "is" "a" "floccinaucinihilipilificator"
> tokens(txt, remove_hyphens = TRUE)
tokens from 1 document.
text1 :
[1] "Jacob" "Rees" "-" "Mogg" "is" "a" "floccinaucinihilipilificator"
> tokens(txt, remove_hyphens = FALSE)
tokens from 1 document.
text1 :
[1] "Jacob" "Rees-Mogg" "is" "a" "floccinaucinihilipilificator"
@kbenoit At the moment, following two lines:
Returns:
That's different from the test expectation. Do you think we should remove characters that may or may not be counted as punctuations such as "£"? |
Merge remote-tracking branch 'origin/master' into tokenize-function # Conflicts: # inst/python/initialize_spacyPython.py
…into tokenize-function
There is a discrepancy in what spaCy considers to be a SYM and the Unicode category classification: > spacy_tokenize("Contains symbols £ ±", remove_symbols = TRUE)
$text1
[1] "Contains" "symbols" "±"
> spacy_parse("Contains symbols £ ±")
doc_id sentence_id token_id token lemma pos entity
1 text1 1 1 Contains contain NOUN
2 text1 1 2 symbols symbol VERB
3 text1 1 3 £ £ SYM
4 text1 1 4 ± ± NOUN MONEY_B
> stringi::stri_detect_charclass(c("£", "±"), "\\p{S}")
[1] TRUE TRUE |
R/spacy_parse.R
Outdated
@@ -46,6 +47,7 @@ spacy_parse <- function(x, | |||
lemma = TRUE, | |||
entity = TRUE, | |||
dependency = FALSE, | |||
noun_phrase = FALSE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove the underscore, since none of the other arguments have it. So just nounphrase
.
I implemented the first version of @kbenoit, what do you think? Code: library(spacyr)
txt <- c(doc1 = "Natural Language Processing is a branch of computer science that employs various Artificial Intelligence (AI) techniques to process content written in natural language. NLP-enhanced wikis can support users in finding, developing and organizing knowledge contained inside the wiki repository. ",
doc2 = "Paul earned a postgraduate degree from MIT.")
spacy_parse(txt, noun_phrase = TRUE) Output: doc_id sentence_id token_id token lemma pos entity
1 doc1 1 1 Natural natural PROPN ORG_B
2 doc1 1 2 Language language PROPN ORG_I
3 doc1 1 3 Processing processing PROPN ORG_I
4 doc1 1 4 is be VERB
5 doc1 1 5 a a DET
6 doc1 1 6 branch branch NOUN
7 doc1 1 7 of of ADP
8 doc1 1 8 computer computer NOUN
9 doc1 1 9 science science NOUN
10 doc1 1 10 that that ADJ
11 doc1 1 11 employs employ VERB
12 doc1 1 12 various various ADJ
13 doc1 1 13 Artificial artificial PROPN ORG_B
14 doc1 1 14 Intelligence intelligence PROPN ORG_I
15 doc1 1 15 ( ( PUNCT ORG_I
16 doc1 1 16 AI ai PROPN ORG_I
17 doc1 1 17 ) ) PUNCT
18 doc1 1 18 techniques technique NOUN
19 doc1 1 19 to to PART
20 doc1 1 20 process process VERB
21 doc1 1 21 content content NOUN
22 doc1 1 22 written write VERB
23 doc1 1 23 in in ADP
24 doc1 1 24 natural natural ADJ
25 doc1 1 25 language language NOUN
26 doc1 1 26 . . PUNCT
27 doc1 2 1 NLP nlp PROPN ORG_B
28 doc1 2 2 - - PUNCT
29 doc1 2 3 enhanced enhance VERB
30 doc1 2 4 wikis wiki NOUN
31 doc1 2 5 can can VERB
32 doc1 2 6 support support VERB
33 doc1 2 7 users user NOUN
34 doc1 2 8 in in ADP
35 doc1 2 9 finding find VERB
36 doc1 2 10 , , PUNCT
37 doc1 2 11 developing develop VERB
38 doc1 2 12 and and CCONJ
39 doc1 2 13 organizing organize VERB
40 doc1 2 14 knowledge knowledge NOUN
41 doc1 2 15 contained contain VERB
42 doc1 2 16 inside inside ADP
43 doc1 2 17 the the DET
44 doc1 2 18 wiki wiki NOUN
45 doc1 2 19 repository repository NOUN
46 doc1 2 20 . . PUNCT
47 doc2 1 1 Paul paul PROPN ORG_B
48 doc2 1 2 earned earn VERB
49 doc2 1 3 a a DET
50 doc2 1 4 postgraduate postgraduate NOUN
51 doc2 1 5 degree degree NOUN
52 doc2 1 6 from from ADP
53 doc2 1 7 MIT mit PROPN ORG_B
54 doc2 1 8 . . PUNCT
noun_phrase noun_phrase_root_text
1 Natural Language Processing Processing
2 Natural Language Processing Processing
3 Natural Language Processing Processing
4 <NA> <NA>
5 a branch branch
6 a branch branch
7 <NA> <NA>
8 computer science science
9 computer science science
10 <NA> <NA>
11 <NA> <NA>
12 various Artificial Intelligence (AI) techniques techniques
13 various Artificial Intelligence (AI) techniques techniques
14 various Artificial Intelligence (AI) techniques techniques
15 various Artificial Intelligence (AI) techniques techniques
16 various Artificial Intelligence (AI) techniques techniques
17 various Artificial Intelligence (AI) techniques techniques
18 various Artificial Intelligence (AI) techniques techniques
19 <NA> <NA>
20 <NA> <NA>
21 content content
22 <NA> <NA>
23 <NA> <NA>
24 natural language language
25 natural language language
26 <NA> <NA>
27 NLP-enhanced wikis wikis
28 NLP-enhanced wikis wikis
29 NLP-enhanced wikis wikis
30 NLP-enhanced wikis wikis
31 <NA> <NA>
32 <NA> <NA>
33 users users
34 <NA> <NA>
35 <NA> <NA>
36 <NA> <NA>
37 <NA> <NA>
38 <NA> <NA>
39 <NA> <NA>
40 knowledge knowledge
41 <NA> <NA>
42 <NA> <NA>
43 the wiki repository repository
44 the wiki repository repository
45 the wiki repository repository
46 <NA> <NA>
47 Paul Paul
48 <NA> <NA>
49 a postgraduate degree degree
50 a postgraduate degree degree
51 a postgraduate degree degree
52 <NA> <NA>
53 MIT MIT
54 <NA> <NA>
noun_phrase_length start_token_id root_token_id
1 3 1 3
2 3 1 3
3 3 1 3
4 NA NA NA
5 2 5 6
6 2 5 6
7 NA NA NA
8 2 8 9
9 2 8 9
10 NA NA NA
11 NA NA NA
12 7 12 18
13 7 12 18
14 7 12 18
15 7 12 18
16 7 12 18
17 7 12 18
18 7 12 18
19 NA NA NA
20 NA NA NA
21 1 21 21
22 NA NA NA
23 NA NA NA
24 2 24 25
25 2 24 25
26 NA NA NA
27 4 1 4
28 4 1 4
29 4 1 4
30 4 1 4
31 NA NA NA
32 NA NA NA
33 1 7 7
34 NA NA NA
35 NA NA NA
36 NA NA NA
37 NA NA NA
38 NA NA NA
39 NA NA NA
40 1 14 14
41 NA NA NA
42 NA NA NA
43 3 17 19
44 3 17 19
45 3 17 19
46 NA NA NA
47 1 1 1
48 NA NA NA
49 3 3 5
50 3 3 5
51 3 3 5
52 NA NA NA
53 1 7 7
54 NA NA NA |
I think it should operate just as entity does, by marking the start and end of the noun phrase, and then using an extract or consolidate function to extract or combine them. The problem with the format above is that it repeats the noun phrases across its components. So with entity: txt3 <- "We analyzed the Supreme Court using natural language processing."
spacy_parse(txt3, entity = TRUE, nounphrase = FALSE)
# doc_id sentence_id token_id token lemma pos entity
# 1 text1 1 1 We -PRON- PRON
# 2 text1 1 2 analyzed analyze VERB
# 3 text1 1 3 the the DET ORG_B
# 4 text1 1 4 Supreme supreme PROPN ORG_I
# 5 text1 1 5 Court court PROPN ORG_I
# 6 text1 1 6 using use VERB
# 7 text1 1 7 natural natural ADJ
# 8 text1 1 8 language language NOUN
# 9 text1 1 9 processing processing NOUN
# 10 text1 1 10 . . PUNCT I think spacy_parse(txt3, entity = FALSE, nounphrase = TRUE)
# doc_id sentence_id token_id token lemma pos nounphrase
# 1 text1 1 1 We -PRON- PRON
# 2 text1 1 2 analyzed analyze VERB
# 3 text1 1 3 the the DET np_beg
# 4 text1 1 4 Supreme supreme PROPN np_mid
# 5 text1 1 5 Court court PROPN np_end
# 6 text1 1 6 using use VERB
# 7 text1 1 7 natural natural ADJ
# 8 text1 1 8 language language NOUN
# 9 text1 1 9 processing processing NOUN
# 10 text1 1 10 . . PUNCT then we use similar code to the entity functions extract and consolidate to define spacy_parse(txt3, entity = TRUE, nounphrase = TRUE) %>%
entity_consolidate() would remove the nounphrase column altogether (and vice-versa). |
Need to add some tests as well... |
This makes it more consistent with our (non) use of underscores in other parse arguments.
I understand that we want to have consistency. However, in that case, we need to sacrifice the re-constructability of original noun phrases. For instance, in the example above, various Artificial Intelligence (AI) techniques is one of the noun phrases. Concatenations with spaces will provide spaces after/before the parentheses. The problem is the One possibility is to include a flag for trailing spaces as a field. This will mean a loop over tokens has to be run. Maybe worth it, though. |
And, also the information about the root token is important for some purposes. So the desirable output might be: spacy_parse(txt3, entity = FALSE, nounphrase = TRUE)
# doc_id sentence_id token_id token lemma pos nounphrase
# 1 text1 1 1 We -PRON- PRON
# 2 text1 1 2 analyzed analyze VERB
# 3 text1 1 3 the the DET beg
# 4 text1 1 4 Supreme supreme PROPN mid
# 5 text1 1 5 Court court PROPN end_root
# 6 text1 1 6 using use VERB
# 7 text1 1 7 natural natural ADJ
# 8 text1 1 8 language language NOUN
# 9 text1 1 9 processing processing NOUN
# 10 text1 1 10 . . PUNCT
|
I implemented a new version of this option in Output: txt <- c(doc1 = "Natural Language Processing is a branch of computer science that employs various Artificial Intelligence (AI) techniques to process content written in natural language. NLP-enhanced wikis can support users in finding, developing and organizing knowledge contained inside the wiki repository. ",
doc2 = "Paul earned a postgraduate degree from MIT.")
(spacy_parse(txt, nounphrase = TRUE))
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.0.16, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
## doc_id sentence_id token_id token lemma pos entity
## 1 doc1 1 1 Natural natural PROPN ORG_B
## 2 doc1 1 2 Language language PROPN ORG_I
## 3 doc1 1 3 Processing processing PROPN ORG_I
## 4 doc1 1 4 is be VERB
## 5 doc1 1 5 a a DET
## 6 doc1 1 6 branch branch NOUN
## 7 doc1 1 7 of of ADP
## 8 doc1 1 8 computer computer NOUN
## 9 doc1 1 9 science science NOUN
## 10 doc1 1 10 that that ADJ
## 11 doc1 1 11 employs employ VERB
## 12 doc1 1 12 various various ADJ
## 13 doc1 1 13 Artificial artificial PROPN ORG_B
## 14 doc1 1 14 Intelligence intelligence PROPN ORG_I
## 15 doc1 1 15 ( ( PUNCT ORG_I
## 16 doc1 1 16 AI ai PROPN ORG_I
## 17 doc1 1 17 ) ) PUNCT
## 18 doc1 1 18 techniques technique NOUN
## 19 doc1 1 19 to to PART
## 20 doc1 1 20 process process VERB
## 21 doc1 1 21 content content NOUN
## 22 doc1 1 22 written write VERB
## 23 doc1 1 23 in in ADP
## 24 doc1 1 24 natural natural ADJ
## 25 doc1 1 25 language language NOUN
## 26 doc1 1 26 . . PUNCT
## 27 doc1 2 1 NLP nlp PROPN ORG_B
## 28 doc1 2 2 - - PUNCT
## 29 doc1 2 3 enhanced enhance VERB
## 30 doc1 2 4 wikis wiki NOUN
## 31 doc1 2 5 can can VERB
## 32 doc1 2 6 support support VERB
## 33 doc1 2 7 users user NOUN
## 34 doc1 2 8 in in ADP
## 35 doc1 2 9 finding find VERB
## 36 doc1 2 10 , , PUNCT
## 37 doc1 2 11 developing develop VERB
## 38 doc1 2 12 and and CCONJ
## 39 doc1 2 13 organizing organize VERB
## 40 doc1 2 14 knowledge knowledge NOUN
## 41 doc1 2 15 contained contain VERB
## 42 doc1 2 16 inside inside ADP
## 43 doc1 2 17 the the DET
## 44 doc1 2 18 wiki wiki NOUN
## 45 doc1 2 19 repository repository NOUN
## 46 doc1 2 20 . . PUNCT
## 47 doc2 1 1 Paul paul PROPN ORG_B
## 48 doc2 1 2 earned earn VERB
## 49 doc2 1 3 a a DET
## 50 doc2 1 4 postgraduate postgraduate NOUN
## 51 doc2 1 5 degree degree NOUN
## 52 doc2 1 6 from from ADP
## 53 doc2 1 7 MIT mit PROPN ORG_B
## 54 doc2 1 8 . . PUNCT
## nounphrase whitespace
## 1 beg TRUE
## 2 mid TRUE
## 3 end_root TRUE
## 4 <NA> TRUE
## 5 beg TRUE
## 6 end_root TRUE
## 7 <NA> TRUE
## 8 beg TRUE
## 9 end_root TRUE
## 10 <NA> TRUE
## 11 <NA> TRUE
## 12 beg TRUE
## 13 mid TRUE
## 14 mid TRUE
## 15 mid FALSE
## 16 mid FALSE
## 17 mid TRUE
## 18 end_root TRUE
## 19 <NA> TRUE
## 20 <NA> TRUE
## 21 beg_root TRUE
## 22 <NA> TRUE
## 23 <NA> TRUE
## 24 beg TRUE
## 25 end_root FALSE
## 26 <NA> TRUE
## 27 beg FALSE
## 28 mid FALSE
## 29 mid TRUE
## 30 end_root TRUE
## 31 <NA> TRUE
## 32 <NA> TRUE
## 33 beg_root TRUE
## 34 <NA> TRUE
## 35 <NA> FALSE
## 36 <NA> TRUE
## 37 <NA> TRUE
## 38 <NA> TRUE
## 39 <NA> TRUE
## 40 beg_root TRUE
## 41 <NA> TRUE
## 42 <NA> TRUE
## 43 beg TRUE
## 44 mid TRUE
## 45 end_root FALSE
## 46 <NA> TRUE
## 47 beg_root TRUE
## 48 <NA> TRUE
## 49 beg TRUE
## 50 mid TRUE
## 51 end_root TRUE
## 52 <NA> TRUE
## 53 beg_root FALSE
## 54 <NA> FALSE |
moved to #134 (comment) |
This PR includes the implementation of two new functions discussed in #109 and #117 :
spacy_tokenize
for tokenizing documents either to list or to data.framespacy_extract_nounphrase
for noun-phrase extraction@kbenoit
We need followings before merging