Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cumulative percentage > 100% #21

Open
zhizhongpu opened this issue Apr 16, 2024 · 2 comments
Open

cumulative percentage > 100% #21

zhizhongpu opened this issue Apr 16, 2024 · 2 comments
Labels
bug an unexpected problem or unintended behavior good first issue ❤️ good issue for first-time contributors help wanted ❤️ we'd love your help!

Comments

@zhizhongpu
Copy link

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

    city │      Freq.     Percent        Cum. 

──────────────────┼───────────────────────────────────
ancud │ 3 0.2 0.2
antofagasta │ 32 1.8 2.0
arauco │ 10 0.6 2.6
arica │ 14 0.8 3.4
aysen │ 2 0.1 3.5
buin │ 1 0.1 3.6
calbuco │ 2 0.1 3.7
carahue │ 3 0.2 3.9
casablanca │ 1 0.1 4.0
castro │ 1 0.1 4.1
cerro navia │ 2 0.1 4.2
chimbarongo │ 3 0.2 4.4
cobquecura │ 1 0.1 4.5
colbun │ 1 0.1 4.6
colchane │ 1 0.1 4.7
collipulli │ 9 0.5 5.2
concon │ 1 0.1 5.3
coyhaique │ 9 0.5 5.8
cunco │ 5 0.3 6.1
curacautin │ 1 0.1 6.2
curanilahue │ 4 0.2 6.4
curarrehue │ 1 0.1 6.5
dalcahue │ 2 0.1 6.6
el bosque │ 9 0.5 7.1
el quisco │ 1 0.1 7.2
ercilla │ 4 0.2 7.4
florida │ 15 0.8 8.2
futrono │ 4 0.2 8.4
galvarino │ 9 0.5 8.9
higgins │ 23 1.3 10.2
huasco │ 1 0.1 10.3
huechuraba │ 4 0.2 10.5
la ligua │ 1 0.1 10.6
la union │ 2 0.1 10.7
las condes │ 2 0.1 10.8
lautaro │ 10 0.6 11.4
lebu │ 6 0.3 11.7
limache │ 3 0.2 11.9
llanquihue │ 1 0.1 12.0
lo espejo │ 1 0.1 12.1
loncoche │ 1 0.1 12.2
los andes │ 1 0.1 12.3
los lagos │ 5 0.3 12.6
los vilos │ 1 0.1 12.7
lota │ 2 0.1 12.8
macul │ 2 0.1 12.9
marchihue │ 1 0.1 13.0
maule │ 3 0.2 13.2
mejillones │ 1 0.1 13.3
melipilla │ 1 0.1 13.4
mostazal │ 1 0.1 13.5
nancagua │ 1 0.1 13.6
nueva imperial │ 1 0.1 13.7
nunoa │ 5 0.3 14.0
paillaco │ 1 0.1 14.1
panguipulli │ 1 0.1 14.2
penalolen │ 1 0.1 14.3
petorca │ 3 0.2 14.5
pichidegua │ 1 0.1 14.6
pichilemu │ 3 0.2 14.8
pozo almonte │ 2 0.1 14.9
providencia │ 7 0.4 15.3
pudahuel │ 5 0.3 15.6
puente alto │ 7 0.4 16.0
puerto octay │ 1 0.1 16.1
puerto varas │ 1 0.1 16.2
purranque │ 2 0.1 16.3
quellon │ 1 0.1 16.4
quilicura │ 5 0.3 16.7
renca │ 1 0.1 16.8
rio bueno │ 3 0.2 17.0
rio negro │ 6 0.3 17.3
san antonio │ 3 0.2 17.5
san carlos │ 2 0.1 17.6
san gregorio │ 1 0.1 17.7
san javier │ 1 0.1 17.8
san joaquin │ 2 0.1 17.9
santa cruz │ 1 0.1 18.0
santiago │ 1488 82.0 100.0
sierra gorda │ 1 0.1 100.1
taltal │ 1 0.1 100.2
teodoro schmidt │ 1 0.1 100.3
tiltil │ 2 0.1 100.4
tirua │ 1 0.1 100.5
tocopilla │ 2 0.1 100.6
tome │ 1 0.1 100.7
tucapel │ 3 0.2 100.9
valdivia │ 6 0.3 101.2
vallenar │ 1 0.1 101.3
victoria │ 2 0.1 101.4
vilcun │ 1 0.1 101.5
villarrica │ 3 0.2 101.7
vitacura │ 4 0.2 101.9
yumbel │ 1 0.1 102.0
yungay │ 4 0.2 100.0
──────────────────┼───────────────────────────────────
Total │ 1814 100.0

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
i'm using a proprietary dataset, but i think the general problem holds whenever there're a large number of categories with relatively small percentage of the total
For more advice on how to write a great issue, see https://code-review.tidyverse.org/issues/.

Brief description of the problem

# insert reprex here
@gvelasq
Copy link
Owner

gvelasq commented Apr 17, 2024

@zhizhongpu, thank you for reporting this! I have confirmed it below. Would you like to propose a solution in a pull request?

library(tidytab)
library(tidyverse)
enframe(1:1010) %>% tab(value)
#>       value │      Freq.     Percent        Cum. 
#> ────────────┼───────────────────────────────────
#>           1 │          1         0.1         0.1 
#>           2 │          1         0.1         0.2 
#>           3 │          1         0.1         0.3 
#> (redacted)
#>         997 │          1         0.1        99.7 
#>         998 │          1         0.1        99.8 
#>         999 │          1         0.1        99.9 
#>        1000 │          1         0.1       100.0 
#>        1001 │          1         0.1       100.1 
#>        1002 │          1         0.1       100.2 
#>        1003 │          1         0.1       100.3 
#>        1004 │          1         0.1       100.4 
#>        1005 │          1         0.1       100.5 
#>        1006 │          1         0.1       100.6 
#>        1007 │          1         0.1       100.7 
#>        1008 │          1         0.1       100.8 
#>        1009 │          1         0.1       100.9 
#>        1010 │          1         0.1       100.0 
#> ────────────┼───────────────────────────────────
#>       Total │       1010       100.0           

Created on 2024-04-16 with reprex v2.1.0

For reference, this is how Stata outputs the same:

. tab value

      value |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1        0.10        0.10
          2 |          1        0.10        0.20
          3 |          1        0.10        0.30
* (redacted)
        997 |          1        0.10       98.71
        998 |          1        0.10       98.81
        999 |          1        0.10       98.91
       1000 |          1        0.10       99.01
       1001 |          1        0.10       99.11
       1002 |          1        0.10       99.21
       1003 |          1        0.10       99.31
       1004 |          1        0.10       99.41
       1005 |          1        0.10       99.50
       1006 |          1        0.10       99.60
       1007 |          1        0.10       99.70
       1008 |          1        0.10       99.80
       1009 |          1        0.10       99.90
       1010 |          1        0.10      100.00
------------+-----------------------------------
      Total |      1,010      100.00

My preference would be to keep single decimals for the Percent and Cum. columns.

@gvelasq gvelasq added bug an unexpected problem or unintended behavior good first issue ❤️ good issue for first-time contributors help wanted ❤️ we'd love your help! labels Apr 17, 2024
@askawron
Copy link
Contributor

I think the problem comes from the function ftab in tab.R

# tab.R line 139
x <- dplyr::mutate(x, Percent = formatC(.data[["Freq."]] / sum(.data[["Freq."]]) * 100, digits = 1L,, format = "f"), Cum. = formatC(cumsum(.data[["Percent"]]), digits = 1L, format = "f"))

Created on 2024-04-17 with reprex v2.1.0

The problem comes from the digits = 1L because it rounds the percentages computed. This becomes an issue when the table in input is large as the percentages computed become either imprecise or close to zero. I would suggest to remove all the digits = 1L at this line and to round the final numbers at the end of the computation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior good first issue ❤️ good issue for first-time contributors help wanted ❤️ we'd love your help!
Projects
None yet
Development

No branches or pull requests

3 participants