cumulative percentage > 100% #21

zhizhongpu · 2024-04-16T00:59:46Z

Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.

    city │      Freq.     Percent        Cum.

──────────────────┼───────────────────────────────────
ancud │ 3 0.2 0.2
antofagasta │ 32 1.8 2.0
arauco │ 10 0.6 2.6
arica │ 14 0.8 3.4
aysen │ 2 0.1 3.5
buin │ 1 0.1 3.6
calbuco │ 2 0.1 3.7
carahue │ 3 0.2 3.9
casablanca │ 1 0.1 4.0
castro │ 1 0.1 4.1
cerro navia │ 2 0.1 4.2
chimbarongo │ 3 0.2 4.4
cobquecura │ 1 0.1 4.5
colbun │ 1 0.1 4.6
colchane │ 1 0.1 4.7
collipulli │ 9 0.5 5.2
concon │ 1 0.1 5.3
coyhaique │ 9 0.5 5.8
cunco │ 5 0.3 6.1
curacautin │ 1 0.1 6.2
curanilahue │ 4 0.2 6.4
curarrehue │ 1 0.1 6.5
dalcahue │ 2 0.1 6.6
el bosque │ 9 0.5 7.1
el quisco │ 1 0.1 7.2
ercilla │ 4 0.2 7.4
florida │ 15 0.8 8.2
futrono │ 4 0.2 8.4
galvarino │ 9 0.5 8.9
higgins │ 23 1.3 10.2
huasco │ 1 0.1 10.3
huechuraba │ 4 0.2 10.5
la ligua │ 1 0.1 10.6
la union │ 2 0.1 10.7
las condes │ 2 0.1 10.8
lautaro │ 10 0.6 11.4
lebu │ 6 0.3 11.7
limache │ 3 0.2 11.9
llanquihue │ 1 0.1 12.0
lo espejo │ 1 0.1 12.1
loncoche │ 1 0.1 12.2
los andes │ 1 0.1 12.3
los lagos │ 5 0.3 12.6
los vilos │ 1 0.1 12.7
lota │ 2 0.1 12.8
macul │ 2 0.1 12.9
marchihue │ 1 0.1 13.0
maule │ 3 0.2 13.2
mejillones │ 1 0.1 13.3
melipilla │ 1 0.1 13.4
mostazal │ 1 0.1 13.5
nancagua │ 1 0.1 13.6
nueva imperial │ 1 0.1 13.7
nunoa │ 5 0.3 14.0
paillaco │ 1 0.1 14.1
panguipulli │ 1 0.1 14.2
penalolen │ 1 0.1 14.3
petorca │ 3 0.2 14.5
pichidegua │ 1 0.1 14.6
pichilemu │ 3 0.2 14.8
pozo almonte │ 2 0.1 14.9
providencia │ 7 0.4 15.3
pudahuel │ 5 0.3 15.6
puente alto │ 7 0.4 16.0
puerto octay │ 1 0.1 16.1
puerto varas │ 1 0.1 16.2
purranque │ 2 0.1 16.3
quellon │ 1 0.1 16.4
quilicura │ 5 0.3 16.7
renca │ 1 0.1 16.8
rio bueno │ 3 0.2 17.0
rio negro │ 6 0.3 17.3
san antonio │ 3 0.2 17.5
san carlos │ 2 0.1 17.6
san gregorio │ 1 0.1 17.7
san javier │ 1 0.1 17.8
san joaquin │ 2 0.1 17.9
santa cruz │ 1 0.1 18.0
santiago │ 1488 82.0 100.0
sierra gorda │ 1 0.1 100.1
taltal │ 1 0.1 100.2
teodoro schmidt │ 1 0.1 100.3
tiltil │ 2 0.1 100.4
tirua │ 1 0.1 100.5
tocopilla │ 2 0.1 100.6
tome │ 1 0.1 100.7
tucapel │ 3 0.2 100.9
valdivia │ 6 0.3 101.2
vallenar │ 1 0.1 101.3
victoria │ 2 0.1 101.4
vilcun │ 1 0.1 101.5
villarrica │ 3 0.2 101.7
vitacura │ 4 0.2 101.9
yumbel │ 1 0.1 102.0
yungay │ 4 0.2 100.0
──────────────────┼───────────────────────────────────
Total │ 1814 100.0

Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
i'm using a proprietary dataset, but i think the general problem holds whenever there're a large number of categories with relatively small percentage of the total
For more advice on how to write a great issue, see https://code-review.tidyverse.org/issues/.

Brief description of the problem

# insert reprex here

The text was updated successfully, but these errors were encountered:

gvelasq · 2024-04-17T02:26:57Z

@zhizhongpu, thank you for reporting this! I have confirmed it below. Would you like to propose a solution in a pull request?

library(tidytab)
library(tidyverse)
enframe(1:1010) %>% tab(value)
#>       value │      Freq.     Percent        Cum. 
#> ────────────┼───────────────────────────────────
#>           1 │          1         0.1         0.1 
#>           2 │          1         0.1         0.2 
#>           3 │          1         0.1         0.3 
#> (redacted)
#>         997 │          1         0.1        99.7 
#>         998 │          1         0.1        99.8 
#>         999 │          1         0.1        99.9 
#>        1000 │          1         0.1       100.0 
#>        1001 │          1         0.1       100.1 
#>        1002 │          1         0.1       100.2 
#>        1003 │          1         0.1       100.3 
#>        1004 │          1         0.1       100.4 
#>        1005 │          1         0.1       100.5 
#>        1006 │          1         0.1       100.6 
#>        1007 │          1         0.1       100.7 
#>        1008 │          1         0.1       100.8 
#>        1009 │          1         0.1       100.9 
#>        1010 │          1         0.1       100.0 
#> ────────────┼───────────────────────────────────
#>       Total │       1010       100.0

^{Created on 2024-04-16 with reprex v2.1.0}

For reference, this is how Stata outputs the same:

. tab value

      value |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1        0.10        0.10
          2 |          1        0.10        0.20
          3 |          1        0.10        0.30
* (redacted)
        997 |          1        0.10       98.71
        998 |          1        0.10       98.81
        999 |          1        0.10       98.91
       1000 |          1        0.10       99.01
       1001 |          1        0.10       99.11
       1002 |          1        0.10       99.21
       1003 |          1        0.10       99.31
       1004 |          1        0.10       99.41
       1005 |          1        0.10       99.50
       1006 |          1        0.10       99.60
       1007 |          1        0.10       99.70
       1008 |          1        0.10       99.80
       1009 |          1        0.10       99.90
       1010 |          1        0.10      100.00
------------+-----------------------------------
      Total |      1,010      100.00

My preference would be to keep single decimals for the Percent and Cum. columns.

askawron · 2024-04-17T16:02:15Z

I think the problem comes from the function ftab in tab.R

# tab.R line 139
x <- dplyr::mutate(x, Percent = formatC(.data[["Freq."]] / sum(.data[["Freq."]]) * 100, digits = 1L,, format = "f"), Cum. = formatC(cumsum(.data[["Percent"]]), digits = 1L, format = "f"))

^{Created on 2024-04-17 with reprex v2.1.0}

The problem comes from the digits = 1L because it rounds the percentages computed. This becomes an issue when the table in input is large as the percentages computed become either imprecise or close to zero. I would suggest to remove all the digits = 1L at this line and to round the final numbers at the end of the computation.

gvelasq added bug an unexpected problem or unintended behavior good first issue ❤️ good issue for first-time contributors help wanted ❤️ we'd love your help! labels Apr 17, 2024

askawron mentioned this issue Apr 17, 2024

Correct the error in the computation of cumulative percentage #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cumulative percentage > 100% #21

cumulative percentage > 100% #21

zhizhongpu commented Apr 16, 2024

gvelasq commented Apr 17, 2024

askawron commented Apr 17, 2024

cumulative percentage > 100% #21

cumulative percentage > 100% #21

Comments

zhizhongpu commented Apr 16, 2024

gvelasq commented Apr 17, 2024

askawron commented Apr 17, 2024