-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-descriptive-statistics.qmd
3108 lines (2207 loc) · 98.8 KB
/
02-descriptive-statistics.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Descriptive statistics {#sec-chap02}
```{r}
#| label: setup
#| include: false
base::source(file = "R/helper.R")
```
## Achievements to unlock
::: {#obj-chap02}
::: {.my-objectives}
::: {.my-objectives-header}
Objectives for chapter 02
:::
::: {.my-objectives-container}
**SwR Achievements**
- **Achievement 1**: (~~Understanding variable types and data types~~)
(@sec-variable-data-types). As I already know this information I
will skip this achievement. This empty section is only mentioned to
get the same section numbering as in the book.
- **Achievement 2**: Choosing and conducting descriptive analyses for
categorical (factor)
variables (@sec-chap02-describe-categorical-variables)
- **Achievement 3**: Choosing and conducting descriptive analyses for
continuous (numeric) variables (@sec-chap02-describe-numeric-variables)
- **Achievement 4**: Developing clear tables for reporting descriptive
statistics (@sec-chap02-create-tables-for-publishing)
:::
:::
Achievements for chapter 02
:::
## The transgender health care problem
Subject of this chapter is the transgender health care problem.
- `r glossary("Transgender")` people are people whose biological sex
is not consistent with their gender.
- `r glossary("Cisgender")` people are people whose gender identity
matches their biological sex.
In this chapter we are going to use data from the `r glossary("BRFSS")` website. The Behavioral Risk Factor Surveillance System (BRFSS) is a system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world. (<a href="https://www.cdc.gov/brfss/index.html">BRFSS</a>)
The rationale of this chapter is the fact, that prior to examining
statistical relationships among variables, published studies in academic
journals usually present descriptive statistics. What kind of measures exist and how to describe the data set is an important task before one can start to analyse statistical relations. The following
screenshot from [@narayan2017a, p.876] demonstrates how such descriptive statistics can be summarized to create a so-called
`r glossary("Table 1")` as an introduction to the subject:
[![Screenshot of Table 1, "Transgender Survey Participants Demographics"
from Narayan et. al.
(2017)](img/chap02/table1-example-min.png){#fig-transgender-table1
fig-alt="Screenshot of Table 1, \"Transgender Survey Participants Demographics\" from Narayan et. al. (2017)"}](https://link.springer.com/article/10.1007/s10549-017-4461-8)
In this chapter we are going to reproduce @fig-transgender-table1 and add some additional information.
## Data, codebook, and R packages
::: {.my-resource}
::: {.my-resource-header}
:::::: {#lem-chap02-resources}
: Data, codebook, and R packages for learning about descriptive statistics
::::::
:::
::: {.my-resource-container}
**Data**
There are two options:
1. Download the clean data file `transgender_hc_ch2.csv` from
<https://edge.sagepub.com/harris1e>.
2. Get the data file from the `r glossary("BRFSS")` website. (See
@lst-chap02-get-brfss-2014-data).
- The page with the 2014 data is at
<https://www.cdc.gov/brfss/annual_data/annual_2014.html>.
- The URL for the SAS transport file is at
<http://www.cdc.gov/brfss/annual_data/2014/files/LLCP2014XPT.zip>.
The ZIP-file is with 69 MB pretty big. Unzipped it will have almost 1
GB! --- I will only work with this second option[^chap02-1].
[^chap02-1]: This was the original plan, but after several chapters working with the original data (downloading, recoding etc.) wasn't instructive anymore. I didn't learn new things because it was more a less a repetitive task with redundant code lines. So starting with @sec-chap07 I have used the data files provided by the book and to strengthen my focus on the statistical issues.
**Codebook**
Again there are two options:
1. Download `brfss_2014_codebook.pdf` from
<https://edge.sagepub.com/harris1e>.
2. Download the codebook as PDF from the BRFSS webpage.
- The location of the codebook is the same as above:
<https://www.cdc.gov/brfss/annual_data/annual_2014.html>.
- The PDF codebook (2.74 MB) can be found at
[https://www.cdc.gov/brfss/annual_data/2014/pdf/CODEBOOK14_LLCP.pdf](https://www.cdc.gov/brfss/annual_data/annual_2014.html)
**Packages**
1. Packages used with the book (sorted alphabetically):
- descr (@sec-descr)
- haven (@sec-haven)
- kableExtra {@sec-kableExtra}
- knitr
- tidyverse (@sec-tidyverse)
- dplyr (@sec-dplyr)
- ggplot2 (@sec-ggplot2)
- forcats (@sec-forcats)
- tidyr (@sec-tidyr)
- tableone
- semTools (@sec-semTools)
- qualvar (@sec-qualvar)
- labelled (@sec-labelled)
- Hmisc (@sec-Hmisc)
2. My additional packages (sorted alphabetically)
- gtsummary (@sec-gtsummary)
- skimr (@sec-skimr)
:::
:::
## Understanding variable types and data types {#sec-variable-data-types}
### Introduction
To know about variable and data types is essential as different types
require different approaches for the analysis.
The following outline of the next sections is slightly adapted from the book. Harris has some measures of categorical variables (mode and index of qualitative variation) explained in the section on numeric variables.
- **Categorical variables**:
- **Central tendency**: The two most commonly used and reported descriptive statistics for categorical (or factor) data types are `r glossary("frequencies")` and `r glossary("percentages")`. Sometimes also the `r glossary("mode xy", "mode")` is used to identify the most common (or typical) category of a factor variable.
- **Spread / Variety**: is for for categorical variables not often reported. Harris mentions the `r glossary("index of qualitative variation")`, which quantifies how much the observations are spread across the categories of a categorical variable.
- **Numerical variables**:
- **Central tendency**: The most important measures are the `r glossary("arithmetic mean", "mean")`, `r glossary("median")` and `r glossary("mode xy", "mode")`.
- **Spread**:
- in relation to the mean: the `r glossary("variance var", "variance")` and `r glossary("standard deviation")`
- in relation to the median: `r glossary("range")`, `r glossary("IQR", "interquartile range")` (IQR) and `r glossary("quantile")`.
For the decision if one should report mean or median in numerical variables the measures for `r glossary("skewness")` and `r glossary("kurtosis")` are helpful. Another issue with numerical variables is usage of `r glossary("scientific notation")`.
I will explain explicitly only to those subjects where I am not firm. This is
::: {#bul-subjects-explained}
:::::{.my-bullet-list}
:::{.my-bullet-list-header}
Bullet List
:::
::::{.my-bullet-list-container}
- categorical variables:
- index of qualitative variation (@sec-index-variation)
- numeric variables:
- skewness (@sec-chap02-skewness)
- mode (@sec-chap02-mode)
- kurtosis (@sec-chap02-kurtosis)
- scientific notation
::::
:::::
Subjects reviewed in this chapter
:::
***
But I will address all the other measures in the examples resp. exercises.
### Get the data
But before we are going to work with the data we have do download it form the `r glossary("BRFSS")` website.
::: {.callout-warning}
Downloading the huge file (69 MB as ZIP and 1 GB unzipped) will take
some minutes. So be patient!
:::
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-get-brfss-2014-data}
: Download the SAS transport file data from the BRFSS website and save
dataframe with selected variables as R object
::::::
:::
::::{.my-r-code-container}
:::{#lst-chap02-get-brfss-2014-data}
```{r}
#| label: get-brfss-2014-data
#| eval: false
#| cache: true
## run this code chunk only once (manually)
url <- "http://www.cdc.gov/brfss/annual_data/2014/files/LLCP2014XPT.zip"
utils::download.file(
url = url,
destfile = tf <- base::tempfile(),
mode = "wb"
)
brfss_2014 <- haven::read_xpt(tf)
brfss_tg_2014 <- brfss_2014 |>
dplyr::select(
TRNSGNDR,
`_AGEG5YR`,
`_RACE`,
`_INCOMG`,
`_EDUCAG`,
HLTHPLN1,
HADMAM,
`_AGE80`,
PHYSHLTH)
data_folder <- base::paste0(here::here(), "/data/")
if (!base::file.exists(data_folder))
{base::dir.create(data_folder)}
chap02_folder <- base::paste0(here::here(), "/data/chap02/")
if (!base::file.exists(chap02_folder))
{base::dir.create(chap02_folder)}
base::saveRDS(object = brfss_tg_2014,
file = paste0(chap02_folder, "/brfss_tg_2014_raw.rds"))
```
Download and save the transgender data 2014 from the BRFSS website
:::
***
The original file has 279 variables and 464664 observations. After selecting
only 9 variable the file is with 31.9 MB (memory usage) and stored
compressed (2.2 MB) much smaller.
::: my-note
::: my-note-header
Four observations about the code
:::
::: my-note-container
1. It is not possible to download the file with
`haven::read_xpt(<URL>)` directly. At first one has to create a
temporary file to store the zipped file.
2. Whenever you meet a variable / row name with a forbidden R syntax
surround the name with `r glossary("backtick", "backticks")` ([grave
accents](https://en.wikipedia.org/wiki/Grave_accent)).
3. Instead of exporting the R object into a `r glossary("CSV", ".csv")`
file I save the data as a compressed R object that can loaded easily
again into the computer memory with the `base::readRDS()` function.
4. With regard to the `base::saveRDS()` function I have to remind
myself that the first argument is the R object *without* quotes. I
committed this error several times.
:::
:::
::::
:::::
## Descriptive analyses for categorical (factor) variables {#sec-chap02-describe-categorical-variables}
The goal of this section is to summarize and interpret the categorical variable `TRNSGNDR`. In contrast to the book I will work with a dataframe consisting only of the variable which I will recode to `transgender`.
### Summarize categorical variables without recoding
::: my-example
::: my-example-header
::: {#exm-chap02-interpret-categorical-variables}
: Summarize categorical variables without recoding
:::
:::
::: my-example-container
::: panel-tabset
###### select()
```{r}
#| label: select-transgender
chap02_folder <- base::paste0(here::here(), "/data/chap02/")
## create transgender_pb ########
## as intermediate data
transgender_pb <-
base::readRDS(base::paste0(chap02_folder,
"brfss_tg_2014_raw.rds")) |>
dplyr::select(TRNSGNDR)
base::saveRDS(object = transgender_pb,
file = base::paste0(chap02_folder, "transgender_pb.rds"))
```
(*This code chunk has no visible output.*)
###### str()
::: my-r-code
::: my-r-code-header
::: {#cnj-chap02-str-transgender}
: Show structure of the `transgender` data with `utils::str()`
:::
:::
::: my-r-code-container
```{r}
#| label: str-transgender
#| lst-label: lst-chap02-str-transgender
#| lst-cap: "Show structure of the `transgender` data with `utils::str()`"
#| cache: true
utils::str(transgender_pb)
```
***
{**haven**} has imported `r glossary("labelled data")` for the variables. We can use these labels to find the appropriate passages in the 126 pages of the
codebook. Just copy the variable label and search this string in
the PDF of the codebook.
------------------------------------------------------------------------
![Behavior Risk Factor Surveillance Systems (BRFSS) Codebook Report,
2014 Land-Line and Cell-Phone
data](img/chap02/transgender-codebook-min.png){#fig-chap02-codebook-transgender
fig-alt="The table displays the sexual orientation and gender identity as an answer of the question 'Do you consider yourself to be transgender?' - The answers were: Yes, male-to-female: 363; Yes, female-to-male: 212; Yes, gender nonconforming: 116; No: 150,765; Don't know / Not sure: 1,138; Refused: 1,468; Not asked or missing: 310,602."}
:::
:::
Note that variable labels are restricted to 40 characters. I think
this is an import limitation of {**haven**} because there is no such
restriction using `haven::labelled()`. But we need to recode the data
anyway, especially as there are no value labels available.
###### summary()
::: my-r-code
::: my-r-code-header
::: {#cnj-chap02-summary-transgender}
: Summarize `transgender` data with `base::summary()`
:::
:::
::: my-r-code-container
```{r}
#| label: summary-transgender
#| cache: true
base::summary(transgender_pb)
```
***
If categories have no labels for their levels, then the `summary()` function is useless.
:::
:::
###### skim()
::: my-r-code
::: my-r-code-header
::: {#cnj-chap02-skim-transgender}
: Summarize transgender data with `skimr::skim()`
:::
:::
::: my-r-code-container
```{r}
#| label: skim-transgender_pb
#| cache: true
skimr::skim(transgender_pb)
```
Again: As long as the categorical variable is of numeric class all the statistics about the distribution of the level values are pointless.
:::
:::
###### gtsummary()
::: my-r-code
::: my-r-code-header
::: {#cnj-chap02-gtsummary-transgender}
: Summarize and display transgender data with `gtsummary::tbl_summary()`
:::
:::
::: my-r-code-container
```{r}
#| label: tbl-gtsummary-transgender
#| tbl-cap: "A firt approach to produce a 'Table 1' statistics with {gtsummary} using labelled data"
#| cache: true
gtsummary::tbl_summary(transgender_pb)
```
------------------------------------------------------------------------
{**gtsummary**} uses here the labelled data imported with {**haven**}.
Not bad, isn't it? Just a one-liner produces this first trial for a pbulic ready
`r glossary("Table 1")` statistics. This first attempt of table would be
even better if the data had also value labels.
For more
information how to work with with labelled data see the full
@sec-chap01-labelled-data with all its sub-sections. You will find
some functions how to adapt @tbl-gtsummary-transgender to get a better
descriptive in @sec-chap01-experiments-gtsummary. But follow also this chapter
along!
:::
:::
:::
:::
:::
To summarize a categorical variable without recoding is generally not purposeful. An exception is `utils::str()` as it display the internal structure including attributes. With `utils::str()` one can detect for example if the data are labelled. If this is the case a quick glance at the data with `gtsummary::tbl_summary()` could be sensible.
### Convert numerical variable to factor and recode its levels
:::::{.my-procedure}
:::{.my-procedure-header}
Categorical variable: Four steps to get sensible values for reporting data
:::
::::{.my-procedure-container}
1. To print a basic `table()` is always a useful first try.
2. Check the `class()` of the variable. For a categorical variable is has to be `factor`.
3. If this not the case then you have to recode the variable `forcats::as_factor`.
4. Then you can `forcats::fct_count()` the different levels of the factor.
5. Finally you can now recode the levels of the factor variable with `forcats::fct_recode()` to match the description in the codebook.
::::
:::::
::: my-example
::: my-example-header
::: {#exm-chap02-data-wrnagling-categorical}
: Convert numerical variable to factor and recode its level
:::
:::
::: my-example-container
::: panel-tabset
###### table()
::: my-r-code
::: my-r-code-header
::: {#cnj-code-chap02-table-transgender}
: Using `base::table()` to count factor levels
:::
:::
::: my-r-code-container
```{r}
#| label: table-transgender
## load transgender_pb ##########
chap02_folder <- base::paste0(here::here(), "/data/chap02/")
transgender_pb <-
base::readRDS(base::paste0(chap02_folder, "transgender_pb.rds"))
base::table(transgender_pb$TRNSGNDR)
```
------------------------------------------------------------------------
Compare the values with the screenshot from the codebook in
@fig-chap02-codebook-transgender.
:::
:::
Normally `table()` is used for cross-classifying factors to build a contingency table of the counts at each combination of factor levels. But here I have only one variable.
###### class()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-class-transgender}
: Check class of categorical variable `TRNSGNDR```
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: class-transgender
class(transgender_pb$TRNSGNDR)
```
::::
:::::
###### as_factor()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-as-factor-transgender}
: Convert numerical variable TRNSGNDR to categorical variable
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: as-factor-transgender
transgender_pb$TRNSGNDR <-
forcats::as_factor(transgender_pb$TRNSGNDR)
class(transgender_pb$TRNSGNDR)
```
Instead of `base::as.factor()` I am using `as_factor()` from the {**forcats**} package which is part of the {**tidyverse**} collection. See @sec-forcats.
::::
:::::
###### fct_count()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-get-levels}
: Get information about the levels of a factor variable
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: get-levels
transgender_pb |>
dplyr::pull(TRNSGNDR) |>
forcats::fct_count()
```
***
`forcats::fct-count()` gives you all the information you need to recode a categorical variable:
1. You will see the number of `NA`s.
2. And even more important: If there are (currently) unused levels then `forcats::fct_count()` will also list them as $0$ observations.
3. You will see if some levels are not used much so you can decide if you should probably merge them to a category of "other". {**forcats**} has several function to support you in this task.
::::
:::::
You can now recode the levels according to the information in the codebook (see @fig-chap02-codebook-transgender).
###### fct_recode()
::: my-r-code
::: my-r-code-header
::: {#cnj-chap02-recode-transgender}
: Recode the `TRNSGNDR` variable to match the codebook levels
:::
:::
::: my-r-code-container
::: {#lst-chap02-recode-transgender}
```{r}
#| label: recode-transgender
#| results: hold
## create transgender_clean #########
transgender_clean <- transgender_pb |>
dplyr::mutate(TRNSGNDR = forcats::fct_recode(TRNSGNDR,
"Male to female" = '1',
"Female to male" = '2',
"Gender nonconforming" = '3',
"Not transgender" = '4',
"Don’t know/Not sure" = '7',
"Refused" = '9'
))
base::summary(transgender_clean)
## saving the variable is useful in the developing stage
## it helps to work with individual code chunks separately
chap02_folder <- base::paste0(here::here(), "/data/chap02/")
base::saveRDS(object = transgender_clean,
file = base::paste0(chap02_folder, "transgender_clean.rds"))
```
Recoded `TRNSGNDR` variable to match the codebook levels
:::
:::
:::
Harris explains in the book the superseded function `dplyr::recode_factor()`. It's new equivalent is now `forcats::fct_recode()`.
:::
:::
:::
### Data management
#### Display categorical variable for reports
After recoding `TRNSGNDR` we have in @lst-chap02-recode-transgender printed the resulted tibble. The output summarizes already understandable the essence of the variable. But we are still missing some information:
1. Besides the frequencies we need also the percentages to better understand the data.
2. As there is a huge amount of missing data, we need to calculate these percentages with and without NA's.
In the following example we discuss functions of different packages to get the desired result.
::: my-example
::: my-example-header
::: {#exm-chap02-data-wrangling-categorical}
: Data management to display categorical variable for reports
:::
:::
::: my-example-container
::: panel-tabset
###### skim()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-skim-transgender-clean}
: Display summary of the `TRNSGNDR` variable with `skimr::skim()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: skim-transgender-clean
skimr::skim(transgender_clean)
```
***
The result of `skimr::skim()` for categorical variable is somewhat disappointing. It does not report the levels of the variable. The abbreviations of the top counts (4 levels of 6) are not understandable.
::::
:::::
###### describe()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-describe-transgender}
: Display summary of the `TRNSGNDR` variable with `Hmisc::describe()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: describe-transgender
## load transgender_clean ##########
chap02_folder <- base::paste0(here::here(), "/data/chap02/")
transgender_clean <-
base::readRDS(base::paste0(chap02_folder, "transgender_clean.rds"))
Hmisc::describe(transgender_clean)
```
***
`Hmisc::describe()` is a completely new function for me. Sometimes I met in my reading the {**Hmisc**} package, but I have never applied functions independently. (Fore more information on {**Hmisc**} see @sec-Hmisc.)
In contrast to other methods it does not list the levels vertically but horizontally. This is unfortunately not super readable and one needs --- at least in this case --- to use the horizontal scroll bar.
But is has the advantage to display not only the frequencies but also the percentages.
::::
:::::
###### {report}
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-report-transgender}
: Display summary of the `TRNSGNDR` variable with `report::report_table()`
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: report-transgender
#| comment: ''
#| results: markup
report::report_table(transgender_clean)
```
I just learned about the existence of {**report**}. It is specialized to facilitate reporting of results and statistical models (See: @sec-report). The one-liner shows the levels of the variable, frequencies (`n_obs`) and percentages. Not bad!
In this example I have used the `report::report_table()` function because the verbal result of the standard `report::report()` function is rather underwhelming as you can see:
***
`r report::report(transgender_clean)`
***
Compare this to the example of the book:
::: {#rep-chap02-transgender1}
::: {.callout-tip style="color: darkgreen;"}
The 2014 BRFSS had a total of 464,664 participants. Of these, 310,602 (66.8%) were not asked or were otherwise missing a response to the `r glossary("transgender")` status question. A few participants refused to answer (n = 1,468; 0.32%), and a small number were unsure of their status (n = 1,138; 0.24%). Most reported being not transgender (n = 150,765; 32.4%), 116 were `r glossary("gender nonconforming", "gender non-conforming")` (0.03%), 212 were female to male (0.05%), and 363 were male to female (0.08%).
:::
Do you consider yourself to be transgender? (Figures with missing values)
:::
(There is another report of the `TRNSGNDR` without the many `NA`s: See @rep-chap02-transgender2 for a comparison.)
But it seems to me that with more complex results (e.g., reports from models) {**report**} is quite useful. In the course of this book I will try it out and compare with the reports of other functions.
***
Another thought: I have filed an issue because I think that it would be a great idea to provide a markdown compatible table so that Quarto resp. pandoc could interpret as a table and visualizing it accordingly. The table above would then appear as the following (copied and slightly edited) example:
Variable | Level | n_Obs | percentage_Obs
---------|----------------------|--------|----------------
TRNSGNDR | Male to female | 363 | 0.08
TRNSGNDR | Female to male | 212 | 0.05
TRNSGNDR | Gender nonconforming | 116 | 0.02
TRNSGNDR | Not transgender | 150765 | 32.45
TRNSGNDR | Don’t know/Not sure | 1138 | 0.24
TRNSGNDR | Refused | 1468 | 0.32
TRNSGNDR | missing | 310602 | 66.84
: Slightly modified {**report**} table to get a Markdown compatible table
::::
:::::
###### na.omit()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-na-omit-transgender}
: Display summary of the `TRNSGNDR` variable with the recoding method from the book
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: na-omit-transgender
transgender_clean |>
dplyr::group_by(TRNSGNDR) |>
dplyr::summarize(n = dplyr::n()) |>
dplyr::mutate(perc_all = 100 * (n / sum(n))) |>
dplyr::mutate(perc_valid = 100 * (n / sum(n[na.omit(TRNSGNDR)])))
```
::::
:::::
:::::{.my-watch-out}
:::{.my-watch-out-header}
WATCH OUT! Wrong `perc_valid` cell value in `NA` row
:::
::::{.my-watch-out-container}
I have the same logic used as the author in the book and got the same result too. But the result of one cell is wrong! The `perc_valid` cell value in the `NA` row should be empty, but it shows the value 202 (rounded).
:::::
::::
###### freq()
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-freq-transgender}
: Display summary of the `TRNSGNDR` variable with `descr::freq()`
::::::
:::
::::{.my-r-code-container}
::: {#lst-chap02-freq-transgender}
```{r}
#| label: freq-transgender
descr::freq(transgender_clean$TRNSGNDR, plot = FALSE)
```
Using `descr::freq()` to calculate values with and without NA's
:::
This is the first time I used the {**descr**} package and it shows so far the best result! (See @sec-descr.) The one-liner shows levels, frequencies, percentage with and without missing values. It even plots a bar graph, but this is not useful here, so I omitted it with `plot = FALSE`.
The only disadvantage: One has to learn a new package. And --- searching about packages about descriptive statistics I learned that there are at least 10 other packages.
::::
:::::
###### {tidyverse}
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-chap02-full-join-transgender}
: Display summary of the `TRNSGNDR` variable with my recoding method using {**dplyr**} and {**tidyr**} from the {**tidyverse**}
::::::
:::
::::{.my-r-code-container}
```{r}
#| label: full-join-transgender
tg1 <- transgender_clean |>
dplyr::group_by(TRNSGNDR) |>
dplyr::summarize(n = dplyr::n()) |>
dplyr::mutate(perc_all = 100 * (n / sum(n)))
tg2 <- transgender_clean |>
dplyr::group_by(TRNSGNDR) |>
tidyr::drop_na() |>
dplyr::summarize(n = dplyr::n()) |>
dplyr::mutate(perc_valid = 100 * (n / sum(n)))
dplyr::full_join(tg1, tg2, by = dplyr::join_by(TRNSGNDR, n))
```
Here I created a tibble with frequencies and percentages with and another one without missing data. Then I join these two tibbles together.
Although this code with packages from the {**tidyverse**} collection is more complex and verbose than the one-liner from @lst-chap02-freq-transgender using {**descr**} it has the advantage that one does not need to install and learn a new package. I believe that sometimes it is more efficient to be proficient with some packages than to know many packages superficially.
::::
:::::
::: {#rep-chap02-transgender2}
::: {.callout-tip style="color: darkgreen;"}
The 2014 BRFSS had a total of 464,664 participants. Of these, 310,602 (66.8%) were not asked or were otherwise missing a response to the `r glossary("transgender")` status question. Of the 33.2% who responded, some refused to answer (n = 1,468; 0.95%), and a small number were unsure of their status (n = 1,138, 0.74%). Most reported being not transgender (n = 150,765; 97.9%), 116 were `r glossary("gender nonconforming", "gender non-conforming")` (0.08%), 212 were female to male (0.14%), and 363 were male to female (0.24%).
:::
Do you consider yourself to be transgender? (Figures without missing values)
:::
Compare with @rep-chap02-transgender1.
:::
:::
:::
#### Index of qualitative variation {#sec-index-variation}
##### Introduction
The `r glossary("index of qualitative variation")` (IQV) quantifies how much the observations are spread across categories of a categorical variable. While these indexes are computed in different ways, they all range from 0 to 1. The resulting values are high when observations are spread out among categories and low when they are not.
If, for instance, a variable has the same amount of data in each of its levels, e.g. the data are perfectly spread across groups, then the index value would be 1. If all variable levels are empty but one, then there is no spread at all and the index value would be 0.
:::::{.my-assessment}
:::{.my-assessment-header}
:::::: {#cor-chap02-index-variation}
: Assessment of spread in categorical variables with the index of qualitative variation
::::::
:::
::::{.my-assessment-container}
The index ranges from 0 to 1:
- **0**: No spread at all: All observations are in only one group.
- **1**: Data perfectly spread over all levels of the categorical variable: all levels have the same amount of observations.
The qualitative assessment would be "low" or "high" depending if the index is (far) below or (far) above 0.5. But the most important use is the comparison of the proportions of levels in different observation. For a practical application see @tbl-compare2-iqv.
::::
:::::
Harris recommends the `qualvar::B()` function. But the `B` index relies on the geometric mean and is therefore not usable if one of the proportions equals to $0$, e.g., if one of the category levels has no observation the result is wrong. It returns always $0$ independently of the frequency of categories in other levels.
The {**qualvar**} package has six indices of qualitative variations: The value of these standardized indices does not depend on the number of categories or number of observations:
- **ADA**: Average Deviation Analog
- **B**: modified geometric mean
- **DM**: Deviation from the mode (DM)
- **HREL**: Shannon Index for computing the "surprise" or uncertainty.
- **MDA**: Mean Difference Analog
- **VA**: Variance Analog
With the exception of two (`B` and `HREL`) these indices do not have problems with proportions that equals $0$ (HREL returns `NaN`).
:::::{.my-resource}
:::{.my-resource-header}
:::::: {#lem-chap02-iqv}
Resources about indices of variation (IQVs)
::::::
:::
::::{.my-resource-container}
- A [vignette by Joël Gombin](https://cran.r-project.org/web/packages/qualvar/vignettes/wilcox1973.html) explains the origins of the several indices in the {**qualvar**} package.
- Gombin’s vignette references as source the article [Indices of Qualitative Variation and Political Measurement](https://www.jstor.org/stable/446831) by Allen R. Wilcox [-@wilcox1973].
- The explication of the three indices calculated with the `iqv()` function of the {**statpsych**} package can be found in [Introduction to Categorical Analysis](https://bpb-us-e1.wpmucdn.com/sites.ucsc.edu/dist/2/1389/files/2023/12/part3.pdf) by Douglas G. Bonett [-@bonettb, p.82f.].
- [Wikipedia](https://en.wikipedia.org/wiki/Qualitative_variation) has a more complete list of IQVs.
::::
:::::
##### Experimenting with different IQV's
In @def-hadmam-iqv I am going to apply several indices for qualitative variations using frequencies of the `HADMAM`variable. Tab "IQV 1" the calculation uses the original frequencies of the categorical variable but without missing values. In tab "IQV 2" I added a new level with no observations (frequency = 0).
:::::{.my-experiment}
:::{.my-experiment-header}
:::::: {#def-hadmam-iqv}
: Computing the index of qualitative variation (IQV) for the `HADMAM` variable.
::::::
:::
::::{.my-experiment-container}
::: {.panel-tabset}
###### IQV 1
:::::{.my-r-code}
:::{.my-r-code-header}
:::::: {#cnj-compare-iqv}
: Using different IQVs with the `HADMAM` variable
::::::
:::
::::{.my-r-code-container}
::: {#lst-chap02-compare-iqv}
```{r}
#| label: compare-iqv
#| results: hold
hadmam_clean <- readRDS("data/chap02/hadmam_clean.rds")
x <- hadmam_clean$n[1:4]
glue::glue("The frequencies of the `HADMAM` variable (without missing values) are:")
x
glue::glue("Applying `statpsych::iqv(x)` to a vector of these frequenecies:")
statpsych::iqv(x)
```
Computing the index of qualitative variation (IQV) with `statpsych::iqv(x)`
:::
With {**statpsych**} I have found another package that computes three indices of qualitative variations (see: @sec-statpsych). They have different names (Simpson, Berger and Shannon) but correspond to one of the Wilcox indices:
- Simpson = VA
- Berger = DM
- Shannon = HREL
Compare these three values with @tbl-table-comp-iqv. Again I have used the frequencies of the `HADMAM` variable: