Rauh dictionary wrong negated forms #24

Astelix · 2019-03-25T07:35:59Z

In the (german) "data_dictionary_Rauh" the negated forms should be "nicht ..." instead of "not ...". For substantivated forms ending on "...ung" it should be "keine".

kbenoit · 2019-03-25T07:45:03Z

Thanks @Astelix! @stefan-mueller want to verify and fix?

stefan-mueller · 2019-03-25T08:58:31Z

Thanks! I am aware of this, but the original dictionary indicates negations though "not" in the categories neg_negative and neg_positive. Thus, changing the forms to "nicht" or "keine" would also imply changing the entries in the original dictionary. Otherwise, negations will not be detected. I am not sure whether we should touch the dictionary entries. What do you think?

library(quanteda.dictionaries)

head(data_dictionary_Rauh$neg_positive, 15)
#>  [1] "not aalen"             "not abbauwürdig"      
#>  [3] "not abfangschirm"      "not abgefahren"       
#>  [5] "not abgeheilt"         "not abgehend"         
#>  [7] "not abgeklärtheit"     "not abgelagert"       
#>  [9] "not abgemacht"         "not abgeschlossenheit"
#> [11] "not abgesichert"       "not abgestimmt"       
#> [13] "not abgeworben"        "not abgleich"         
#> [15] "not abgleichen"

Astelix · 2019-03-25T09:21:07Z

From the original dictionary:

                                       pattern      replacement          feature

kbenoit · 2019-03-27T06:51:45Z

It would work as from the original dictionary if it's structured as a regular expression dictionary. Unlike glob patterns, the regex would permit us to prefix each positive word with the negation possibilities.

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
dicttest <- dictionary(list(neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$")))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

tokens(txt) %>%
  tokens_lookup(dictionary = dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "abarbeiten"   "und"          "NEG_POSITIVE"

We don't currently have a valuetype set in the dictionary object class, but we do have an open issue for it (#1264). This would be a good argument for adding that attribute, so that the lookup functions used that as the default rather than "glob". That would enable us to make sure that every dictionary was associated with the correct pattern matching type (valuetype).

stefan-mueller · 2019-03-27T08:56:57Z

That would be a very elegant solution. I just asked Christian Rauh what he thinks about this idea.

ChRauh · 2019-03-27T10:23:24Z

Great to see interest in the dictionary and thanks again for including it into your fantastic package!

On the issue: The dictionary is structured such that it matches valuetype = "regex" . Thus (and also more generally), I'd consider adding a valuetype attribute to the dictionary object class as very convenient from the user perspective.

Note, however, that I would still suggest to first replace the negation patterns in the original text with a compound marker such as "NOT_[token]" (maybe via tokens_replace()) before retrieving the dictionary counts via tokens_lookup() or dfm(). This makes a difference when aggregating the counts to some sentiment score.

For example, directly counting dictionary terms in the string 'nicht abarbeiten' would retrieve one negative and one negated negative hit. Yet having this replaced with 'NOT_abarbeiten' beforehand would retrieve only the negated negative hit.

Hope this helps...

kbenoit · 2019-03-27T10:41:02Z

Thanks @ChRauh that's a good point. Could be done in two stages:

dicttest <-
  dictionary(list(
    neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$"),
    positive = "^abarbeiten$"
  ))

txt <- c(
  "etwas nicht abarbeiten und etwas keine abarbeiten",
  "etwas abarbeiten und keinen abarbeiten"
)

dfm(txt, dictionary = dicttest, valuetype = "regex")
## Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
## 2 x 2 sparse Matrix of class "dfm"
##        features
## docs    neg_positive positive
##   text1            2        2
##   text2            1        2

tokens(txt) %>%
  tokens_lookup(dicttest["neg_positive"], valuetype = "regex", exclusive = FALSE) %>%
  tokens_lookup(dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas"        "NEG_POSITIVE" "und"          "etwas"       
## [5] "NEG_POSITIVE"
## 
## text2 :
## [1] "etwas"        "POSITIVE"     "und"          "NEG_POSITIVE"

ChRauh · 2019-03-27T11:19:36Z

@kbenoit Yes, 'piping' it in that order does the trick. Learned something, thanks!
Maybe also a useful example for the helpfile in which @stefan-mueller has already flagged the replacement issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rauh dictionary wrong negated forms #24

Rauh dictionary wrong negated forms #24

Astelix commented Mar 25, 2019

kbenoit commented Mar 25, 2019

stefan-mueller commented Mar 25, 2019

Astelix commented Mar 25, 2019

kbenoit commented Mar 27, 2019

stefan-mueller commented Mar 27, 2019

ChRauh commented Mar 27, 2019 •

edited

Loading

kbenoit commented Mar 27, 2019

ChRauh commented Mar 27, 2019 •

edited

Loading

Rauh dictionary wrong negated forms #24

Rauh dictionary wrong negated forms #24

Comments

Astelix commented Mar 25, 2019

kbenoit commented Mar 25, 2019

stefan-mueller commented Mar 25, 2019

Astelix commented Mar 25, 2019

kbenoit commented Mar 27, 2019

stefan-mueller commented Mar 27, 2019

ChRauh commented Mar 27, 2019 • edited Loading

kbenoit commented Mar 27, 2019

ChRauh commented Mar 27, 2019 • edited Loading

ChRauh commented Mar 27, 2019 •

edited

Loading

ChRauh commented Mar 27, 2019 •

edited

Loading