-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cumulative percentage > 100% #21
Comments
@zhizhongpu, thank you for reporting this! I have confirmed it below. Would you like to propose a solution in a pull request? library(tidytab)
library(tidyverse)
enframe(1:1010) %>% tab(value)
#> value │ Freq. Percent Cum.
#> ────────────┼───────────────────────────────────
#> 1 │ 1 0.1 0.1
#> 2 │ 1 0.1 0.2
#> 3 │ 1 0.1 0.3
#> (redacted)
#> 997 │ 1 0.1 99.7
#> 998 │ 1 0.1 99.8
#> 999 │ 1 0.1 99.9
#> 1000 │ 1 0.1 100.0
#> 1001 │ 1 0.1 100.1
#> 1002 │ 1 0.1 100.2
#> 1003 │ 1 0.1 100.3
#> 1004 │ 1 0.1 100.4
#> 1005 │ 1 0.1 100.5
#> 1006 │ 1 0.1 100.6
#> 1007 │ 1 0.1 100.7
#> 1008 │ 1 0.1 100.8
#> 1009 │ 1 0.1 100.9
#> 1010 │ 1 0.1 100.0
#> ────────────┼───────────────────────────────────
#> Total │ 1010 100.0 Created on 2024-04-16 with reprex v2.1.0 For reference, this is how Stata outputs the same: . tab value
value | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 0.10 0.10
2 | 1 0.10 0.20
3 | 1 0.10 0.30
* (redacted)
997 | 1 0.10 98.71
998 | 1 0.10 98.81
999 | 1 0.10 98.91
1000 | 1 0.10 99.01
1001 | 1 0.10 99.11
1002 | 1 0.10 99.21
1003 | 1 0.10 99.31
1004 | 1 0.10 99.41
1005 | 1 0.10 99.50
1006 | 1 0.10 99.60
1007 | 1 0.10 99.70
1008 | 1 0.10 99.80
1009 | 1 0.10 99.90
1010 | 1 0.10 100.00
------------+-----------------------------------
Total | 1,010 100.00 My preference would be to keep single decimals for the |
I think the problem comes from the function ftab in tab.R # tab.R line 139
x <- dplyr::mutate(x, Percent = formatC(.data[["Freq."]] / sum(.data[["Freq."]]) * 100, digits = 1L,, format = "f"), Cum. = formatC(cumsum(.data[["Percent"]]), digits = 1L, format = "f"))
Created on 2024-04-17 with reprex v2.1.0 The problem comes from the digits = 1L because it rounds the percentages computed. This becomes an issue when the table in input is large as the percentages computed become either imprecise or close to zero. I would suggest to remove all the digits = 1L at this line and to round the final numbers at the end of the computation. |
Please briefly describe your problem and what output you expect. If you have a question, please don't use this form. Instead, ask on https://stackoverflow.com/ or https://community.rstudio.com/.
──────────────────┼───────────────────────────────────
ancud │ 3 0.2 0.2
antofagasta │ 32 1.8 2.0
arauco │ 10 0.6 2.6
arica │ 14 0.8 3.4
aysen │ 2 0.1 3.5
buin │ 1 0.1 3.6
calbuco │ 2 0.1 3.7
carahue │ 3 0.2 3.9
casablanca │ 1 0.1 4.0
castro │ 1 0.1 4.1
cerro navia │ 2 0.1 4.2
chimbarongo │ 3 0.2 4.4
cobquecura │ 1 0.1 4.5
colbun │ 1 0.1 4.6
colchane │ 1 0.1 4.7
collipulli │ 9 0.5 5.2
concon │ 1 0.1 5.3
coyhaique │ 9 0.5 5.8
cunco │ 5 0.3 6.1
curacautin │ 1 0.1 6.2
curanilahue │ 4 0.2 6.4
curarrehue │ 1 0.1 6.5
dalcahue │ 2 0.1 6.6
el bosque │ 9 0.5 7.1
el quisco │ 1 0.1 7.2
ercilla │ 4 0.2 7.4
florida │ 15 0.8 8.2
futrono │ 4 0.2 8.4
galvarino │ 9 0.5 8.9
higgins │ 23 1.3 10.2
huasco │ 1 0.1 10.3
huechuraba │ 4 0.2 10.5
la ligua │ 1 0.1 10.6
la union │ 2 0.1 10.7
las condes │ 2 0.1 10.8
lautaro │ 10 0.6 11.4
lebu │ 6 0.3 11.7
limache │ 3 0.2 11.9
llanquihue │ 1 0.1 12.0
lo espejo │ 1 0.1 12.1
loncoche │ 1 0.1 12.2
los andes │ 1 0.1 12.3
los lagos │ 5 0.3 12.6
los vilos │ 1 0.1 12.7
lota │ 2 0.1 12.8
macul │ 2 0.1 12.9
marchihue │ 1 0.1 13.0
maule │ 3 0.2 13.2
mejillones │ 1 0.1 13.3
melipilla │ 1 0.1 13.4
mostazal │ 1 0.1 13.5
nancagua │ 1 0.1 13.6
nueva imperial │ 1 0.1 13.7
nunoa │ 5 0.3 14.0
paillaco │ 1 0.1 14.1
panguipulli │ 1 0.1 14.2
penalolen │ 1 0.1 14.3
petorca │ 3 0.2 14.5
pichidegua │ 1 0.1 14.6
pichilemu │ 3 0.2 14.8
pozo almonte │ 2 0.1 14.9
providencia │ 7 0.4 15.3
pudahuel │ 5 0.3 15.6
puente alto │ 7 0.4 16.0
puerto octay │ 1 0.1 16.1
puerto varas │ 1 0.1 16.2
purranque │ 2 0.1 16.3
quellon │ 1 0.1 16.4
quilicura │ 5 0.3 16.7
renca │ 1 0.1 16.8
rio bueno │ 3 0.2 17.0
rio negro │ 6 0.3 17.3
san antonio │ 3 0.2 17.5
san carlos │ 2 0.1 17.6
san gregorio │ 1 0.1 17.7
san javier │ 1 0.1 17.8
san joaquin │ 2 0.1 17.9
santa cruz │ 1 0.1 18.0
santiago │ 1488 82.0 100.0
sierra gorda │ 1 0.1 100.1
taltal │ 1 0.1 100.2
teodoro schmidt │ 1 0.1 100.3
tiltil │ 2 0.1 100.4
tirua │ 1 0.1 100.5
tocopilla │ 2 0.1 100.6
tome │ 1 0.1 100.7
tucapel │ 3 0.2 100.9
valdivia │ 6 0.3 101.2
vallenar │ 1 0.1 101.3
victoria │ 2 0.1 101.4
vilcun │ 1 0.1 101.5
villarrica │ 3 0.2 101.7
vitacura │ 4 0.2 101.9
yumbel │ 1 0.1 102.0
yungay │ 4 0.2 100.0
──────────────────┼───────────────────────────────────
Total │ 1814 100.0
Please include a minimal reproducible example (AKA a reprex). If you've never heard of a reprex before, start by reading https://www.tidyverse.org/help/#reprex.
i'm using a proprietary dataset, but i think the general problem holds whenever there're a large number of categories with relatively small percentage of the total
For more advice on how to write a great issue, see https://code-review.tidyverse.org/issues/.
Brief description of the problem
# insert reprex here
The text was updated successfully, but these errors were encountered: