textstat_lexdiv: different results for tokens and dfm objects #55

ElisaWirsching · 2023-03-02T03:51:49Z

Describe the bug

I noticed that textstat_lexdiv produces different results, depending on whether a token or dfm object is used in the function. When I calculate the TTR by hand (for example), the figures match perfectly with the output of textstat_lexdiv with a dfm, but differ from the output of the function with a tokens object. Why is this? Is this behavior expected? It is not clear to me from the source code.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

data(data_corpus_inaugural)
reagan_corpus <- corpus_subset(data_corpus_inaugural, Year == 1981 | Year == 1985)
reagan_tokens <- tokens(reagan_corpus, remove_punct = TRUE, remove_numbers = FALSE,
                        remove_symbols = FALSE)
dfm <- dfm(reagan_tokens, tolower=FALSE)
dfm %>% textstat_lexdiv(measure = c("TTR", "R"),
                                remove_numbers = F, remove_punct = T,
                                remove_symbols = F, remove_hyphens = FALSE)

#  --- -   -   - Versus:

reagan_tokens %>% textstat_lexdiv(measure = c("TTR", "R"), 
                                  remove_numbers = F, remove_punct = T,
                                  remove_symbols = F, remove_hyphens = FALSE) 

#  --- -   -   - by hand:

ntype(dfm) /ntoken(dfm) # this is the same as textstat_lexdiv with a dfm
ntype(reagan_tokens) /ntoken(reagan_tokens) # this is the same as textstat_lexdiv with a dfm

Expected behavior

I would expect both methods to return the same estimates for the TTR.

## System information

Please run sessionInfo() and paste the output.

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.textstats_0.95 quanteda.corpora_0.9.2  quanteda_3.2.1         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         pillar_1.6.4       compiler_4.0.2     stopwords_2.2     
 [5] forcats_0.5.0      tools_4.0.2        digest_0.6.27      evaluate_0.14     
 [9] lifecycle_1.0.3    tibble_3.1.0       lattice_0.20-41    tidylog_1.0.2     
[13] pkgconfig_2.0.3    rlang_1.0.6        fastmatch_1.1-0    Matrix_1.4-1      
[17] cli_3.6.0          rstudioapi_0.13    yaml_2.2.1         parallel_4.0.2    
[21] xfun_0.30          fastmap_1.1.0      dplyr_1.1.0        knitr_1.37        
[25] generics_0.1.2     vctrs_0.5.2        grid_4.0.2         tidyselect_1.2.0  
[29] nsyllable_1.0.1    glue_1.4.2         R6_2.5.0           pbapply_1.4-2     
[33] fansi_0.4.2        rmarkdown_2.14     pacman_0.5.1       tidyr_1.2.0       
[37] purrr_0.3.4        magrittr_2.0.1     clisymbols_1.2.0   ellipsis_0.3.2    
[41] htmltools_0.5.2    corpus_0.10.1      stringdist_0.9.8   utf8_1.1.4        
[45] stringi_1.5.3      RcppParallel_5.0.3 crayon_1.4.1

The text was updated successfully, but these errors were encountered:

koheiw transferred this issue from quanteda/quanteda Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

textstat_lexdiv: different results for tokens and dfm objects #55

textstat_lexdiv: different results for tokens and dfm objects #55

ElisaWirsching commented Mar 2, 2023

textstat_lexdiv: different results for tokens and dfm objects #55

textstat_lexdiv: different results for tokens and dfm objects #55

Comments

ElisaWirsching commented Mar 2, 2023

Describe the bug

Reproducible code

Expected behavior