Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion of new function: describe_missing() #561

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.12
Version: 0.13.0.13
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ export(data_write)
export(degroup)
export(demean)
export(describe_distribution)
export(describe_missing)
export(detrend)
export(distribution_coef_var)
export(distribution_mode)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ BREAKING CHANGES
* Argument `drop_na` in `data_match()` is deprecated now. Please use `remove_na`
instead.

NEW FUNCTIONS

* `describe_missing()`, to comprehensively report on missing values in a data frame.

CHANGES

* The `select` argument, which is available in different functions to select
Expand Down
115 changes: 115 additions & 0 deletions R/describe_missing.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#' @title Describe Missing Values in Data According to Guidelines
#'
#' @description Provides a detailed description of missing values in a data frame.
#' This function reports both absolute and percentage missing values of specified
#' column lists or scales, following recommended guidelines. Some authors recommend
#' reporting item-level missingness per scale, as well as a participant's maximum
#' number of missing items by scale. For example, Parent (2013) writes:
#'
#' *I recommend that authors (a) state their tolerance level for missing data by scale
#' or subscale (e.g., "We calculated means for all subscales on which participants gave
#' at least 75% complete data") and then (b) report the individual missingness rates
#' by scale per data point (i.e., the number of missing values out of all data points
#' on that scale for all participants) and the maximum by participant (e.g., "For Attachment
#' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant
#' missing more than a single data point").*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved everything after "Some authors recommend" to @details.

Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like skimr::skim() (that's the "easy" part). What is missing is a way to handle, as you highlight, survey data in that field-specific way. I thought it still fits with datawizard even if offers additional field-specific features, although we can probably try to make it more general for other users. In the details section, I added a paragraph adding more context about scales as used in psychology:

#' In psychology, it is common to ask participants to answer questionnaires in
#' which people answer several questions about a specific topic. For example,
#' people could answer 10 different questions about how extroverted they are.
#' In turn, researchers calculate the average for those 10 questions (called
#' items). These questionnaires are called (e.g., Likert) "scales" (such as the
#' Rosenberg Self-Esteem Scale, also known as the RSES).

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one question we have to answer is: do we want to have describe_missing only report basic missing info that is field-general a bit more like skim(), OR we do we also want it to include the features specific to the survey format? (or said another way, should we remove or keep the survey feature)

#'
#' @param data The data frame to be analyzed.
#' @param vars Variable (or lists of variables) to check for missing values (NAs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use select, exclude, etc. in all other dataframe functions, I think we should here as well.

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it works a little bit differently than select elsewhere. vars takes a list of list of strings (such as list(c("openness_1", "openness_2", "openness_3"), c("extroversion_1", "extroversion_2", "extroversion_3"))) to take into account the nested structure of the items / columns. I can rename it to select, but do you think it will create confusion or expectations that it should rely on and work with .select_nse? Or should we include select and exclude in addition to vars? I'm not sure how .select_nse could accommodate the nested structure like I'm doing right now 🤔

#' @param scales The scale names to check for missing values (as a character vector).
rempsyc marked this conversation as resolved.
Show resolved Hide resolved
#' @keywords missing values NA guidelines
rempsyc marked this conversation as resolved.
Show resolved Hide resolved
#' @return A dataframe with the following columns:
#' - `var`: Variables selected.
#' - `items`: Number of items for selected variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unique_values instead of items would be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid unique_values would suggest unique responses for a given column.

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be n_items, so perhaps we can do n_columns??

#' - `na`: Number of missing cell values for those variables (e.g., 2 missing
#' values for the first participant + 2 missing values for the second participant
#' = total of 4 missing values).
rempsyc marked this conversation as resolved.
Show resolved Hide resolved
#' - `cells`: Total number of cells (i.e., number of participants multiplied by
#' the number of variables, `items`).
#' - `na_percent`: The percentage of missing values (`na` divided by `cells`).
#' - `na_max`: The number of missing values for the participant with the most
#' missing values for the selected variables.
#' - `na_max_percent`: The amount of missing values for the participant with
#' the most missing values for the selected variables, as a percentage
#' (i.e., `na_max` divided by the number of selected variables, `items`).
#' - `all_na`: The number of participants missing 100% of items for that scale
#' (the selected variables).
#'
#' @export
#' @references Parent, M. C. (2013). Handling item-level missing
#' data: Simpler is just as good. *The Counseling Psychologist*,
#' *41*(4), 568-600. https://doi.org/10.1177%2F0011000012445176
#' @examples
#' # Use the entire data frame
#' describe_missing(airquality)
#'
#' # Use selected columns explicitly
#' describe_missing(airquality,
#' vars = list(
#' c("Ozone", "Solar.R", "Wind"),
#' c("Temp", "Month", "Day")
#' )
#' )
#'
#' # If the questionnaire items start with the same name, e.g.,
#' set.seed(15)
#' fun <- function() {
#' c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
#' }
#' df <- data.frame(
#' ID = c("idz", NA),
#' open_1 = fun(), open_2 = fun(), open_3 = fun(),
#' extrovert_1 = fun(), extrovert_2 = fun(), extrovert_3 = fun(),
#' agreeable_1 = fun(), agreeable_2 = fun(), agreeable_3 = fun()
#' )
#'
#' # One can list the scale names directly:
#' describe_missing(df, scales = c("ID", "open", "extrovert", "agreeable"))
describe_missing <- function(data, vars = NULL, scales = NULL) {
classes <- lapply(data, class)
rempsyc marked this conversation as resolved.
Show resolved Hide resolved
if (missing(vars) && missing(scales)) {
vars.internal <- names(data)
} else if (!missing(scales)) {
vars.internal <- lapply(scales, function(x) {
grep(paste0("^", x), names(data), value = TRUE)
})
}
if (!missing(vars)) {
vars.internal <- vars
}
if (!is.list(vars.internal)) {
vars.internal <- list(vars.internal)
}
na_df <- .describe_missing(data)
if (!missing(vars) || !missing(scales)) {
na_list <- lapply(vars.internal, function(x) {
data_subset <- data[, x, drop = FALSE]
.describe_missing(data_subset)
})
na_df$var <- "Total"
na_df <- do.call(rbind, c(na_list, list(na_df)))
}
na_df
}

.describe_missing <- function(data) {
my_var <- paste0(names(data)[1], ":", names(data)[ncol(data)])
items <- ncol(data)
na <- sum(is.na(data))
cells <- nrow(data) * ncol(data)
na_percent <- round(na / cells * 100, 2)
na_max <- max(rowSums(is.na(data)))
na_max_percent <- round(na_max / items * 100, 2)
all_na <- sum(apply(data, 1, function(x) all(is.na(x))))

data.frame(
var = my_var,
items = items,
na = na,
cells = cells,
na_percent = na_percent,
na_max = na_max,
na_max_percent = na_max_percent,
all_na = all_na
)
}
14 changes: 5 additions & 9 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,13 @@ CMD
Carle
Catran
Crosstables
Dhaliwal
Disaggregating
DOI
De
Dom
Dhaliwal
Disaggregating
EFC
Enders
EUROFAMCARE
Enders
Fairbrother
GLMM
Gelman
Expand Down Expand Up @@ -54,7 +53,6 @@ Winsorizing
al
behaviour
behaviours
bmwiernik
codebook
codebooks
coercible
Expand All @@ -77,7 +75,6 @@ joss
labelled
labelling
leptokurtic
lifecycle
lm
lme
meaned
Expand All @@ -88,7 +85,6 @@ modelling
nd
panelr
partialization
patilindrajeets
platykurtic
poorman
pre
Expand All @@ -102,7 +98,6 @@ recodes
recoding
recodings
relevel
rempsyc
reproducibility
rescale
rescaled
Expand All @@ -111,7 +106,8 @@ rio
rowid
sd
stackexchange
strengejacke
subscale
subscales
tailedness
th
tibble
Expand Down
86 changes: 86 additions & 0 deletions man/describe_missing.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pkgdown/_pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ reference:
- data_tabulate
- data_peek
- data_seek
- describe_missing
- means_by_group
- contains("distribution")
- kurtosis
Expand Down
Loading
Loading