Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textstat_lexdiv: different results for tokens and dfm objects #55

Open
ElisaWirsching opened this issue Mar 2, 2023 · 0 comments
Open

Comments

@ElisaWirsching
Copy link

Describe the bug

I noticed that textstat_lexdiv produces different results, depending on whether a token or dfm object is used in the function. When I calculate the TTR by hand (for example), the figures match perfectly with the output of textstat_lexdiv with a dfm, but differ from the output of the function with a tokens object. Why is this? Is this behavior expected? It is not clear to me from the source code.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

data(data_corpus_inaugural)
reagan_corpus <- corpus_subset(data_corpus_inaugural, Year == 1981 | Year == 1985)
reagan_tokens <- tokens(reagan_corpus, remove_punct = TRUE, remove_numbers = FALSE,
                        remove_symbols = FALSE)
dfm <- dfm(reagan_tokens, tolower=FALSE)
dfm %>% textstat_lexdiv(measure = c("TTR", "R"),
                                remove_numbers = F, remove_punct = T,
                                remove_symbols = F, remove_hyphens = FALSE)

#  --- -   -   - Versus:

reagan_tokens %>% textstat_lexdiv(measure = c("TTR", "R"), 
                                  remove_numbers = F, remove_punct = T,
                                  remove_symbols = F, remove_hyphens = FALSE) 

#  --- -   -   - by hand:

ntype(dfm) /ntoken(dfm) # this is the same as textstat_lexdiv with a dfm
ntype(reagan_tokens) /ntoken(reagan_tokens) # this is the same as textstat_lexdiv with a dfm

Expected behavior

I would expect both methods to return the same estimates for the TTR.

## System information

Please run sessionInfo() and paste the output.

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.textstats_0.95 quanteda.corpora_0.9.2  quanteda_3.2.1         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         pillar_1.6.4       compiler_4.0.2     stopwords_2.2     
 [5] forcats_0.5.0      tools_4.0.2        digest_0.6.27      evaluate_0.14     
 [9] lifecycle_1.0.3    tibble_3.1.0       lattice_0.20-41    tidylog_1.0.2     
[13] pkgconfig_2.0.3    rlang_1.0.6        fastmatch_1.1-0    Matrix_1.4-1      
[17] cli_3.6.0          rstudioapi_0.13    yaml_2.2.1         parallel_4.0.2    
[21] xfun_0.30          fastmap_1.1.0      dplyr_1.1.0        knitr_1.37        
[25] generics_0.1.2     vctrs_0.5.2        grid_4.0.2         tidyselect_1.2.0  
[29] nsyllable_1.0.1    glue_1.4.2         R6_2.5.0           pbapply_1.4-2     
[33] fansi_0.4.2        rmarkdown_2.14     pacman_0.5.1       tidyr_1.2.0       
[37] purrr_0.3.4        magrittr_2.0.1     clisymbols_1.2.0   ellipsis_0.3.2    
[41] htmltools_0.5.2    corpus_0.10.1      stringdist_0.9.8   utf8_1.1.4        
[45] stringi_1.5.3      RcppParallel_5.0.3 crayon_1.4.1   
@koheiw koheiw transferred this issue from quanteda/quanteda Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant