-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rauh dictionary wrong negated forms #24
Comments
Thanks @Astelix! @stefan-mueller want to verify and fix? |
Thanks! I am aware of this, but the original dictionary indicates negations though "not" in the categories library(quanteda.dictionaries)
head(data_dictionary_Rauh$neg_positive, 15)
#> [1] "not aalen" "not abbauwürdig"
#> [3] "not abfangschirm" "not abgefahren"
#> [5] "not abgeheilt" "not abgehend"
#> [7] "not abgeklärtheit" "not abgelagert"
#> [9] "not abgemacht" "not abgeschlossenheit"
#> [11] "not abgesichert" "not abgestimmt"
#> [13] "not abgeworben" "not abgleich"
#> [15] "not abgleichen" |
From the original dictionary:
sentiment |
It would work as from the original dictionary if it's structured as a regular expression dictionary. Unlike glob patterns, the regex would permit us to prefix each positive word with the negation possibilities. library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
dicttest <- dictionary(list(neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$")))
txt <- c(
"etwas nicht abarbeiten und etwas keine abarbeiten",
"etwas abarbeiten und keinen abarbeiten"
)
tokens(txt) %>%
tokens_lookup(dictionary = dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas" "NEG_POSITIVE" "und" "etwas"
## [5] "NEG_POSITIVE"
##
## text2 :
## [1] "etwas" "abarbeiten" "und" "NEG_POSITIVE" We don't currently have a |
That would be a very elegant solution. I just asked Christian Rauh what he thinks about this idea. |
Great to see interest in the dictionary and thanks again for including it into your fantastic package! On the issue: The dictionary is structured such that it matches Note, however, that I would still suggest to first replace the negation patterns in the original text with a compound marker such as "NOT_[token]" (maybe via For example, directly counting dictionary terms in the string 'nicht abarbeiten' would retrieve one negative and one negated negative hit. Yet having this replaced with 'NOT_abarbeiten' beforehand would retrieve only the negated negative hit. Hope this helps... |
Thanks @ChRauh that's a good point. Could be done in two stages: dicttest <-
dictionary(list(
neg_positive = c("^(nicht|nichts|kein|keine|keinen)$ ^abarbeiten$"),
positive = "^abarbeiten$"
))
txt <- c(
"etwas nicht abarbeiten und etwas keine abarbeiten",
"etwas abarbeiten und keinen abarbeiten"
)
dfm(txt, dictionary = dicttest, valuetype = "regex")
## Document-feature matrix of: 2 documents, 2 features (0.0% sparse).
## 2 x 2 sparse Matrix of class "dfm"
## features
## docs neg_positive positive
## text1 2 2
## text2 1 2
tokens(txt) %>%
tokens_lookup(dicttest["neg_positive"], valuetype = "regex", exclusive = FALSE) %>%
tokens_lookup(dicttest, valuetype = "regex", exclusive = FALSE)
## tokens from 2 documents.
## text1 :
## [1] "etwas" "NEG_POSITIVE" "und" "etwas"
## [5] "NEG_POSITIVE"
##
## text2 :
## [1] "etwas" "POSITIVE" "und" "NEG_POSITIVE" |
@kbenoit Yes, 'piping' it in that order does the trick. Learned something, thanks! |
In the (german) "data_dictionary_Rauh" the negated forms should be "nicht ..." instead of "not ...". For substantivated forms ending on "...ung" it should be "keine".
The text was updated successfully, but these errors were encountered: