Suggestion of new function: `describe_missing()` #561

rempsyc · 2024-11-11T11:03:12Z

Fixes #454

etiennebacher

Thank you, I think it would be good to have describe_missing() but the way it is implemented and documented looks very field-specific to me. I find the output of skimr::skim() easier to understand with n_missing and complete_rate for instance. I'm also not familiar at all with aggregating stats on missing values across several variables (e.g. Ozone:Wind) and the default output looks unexpected to me (I'd rather expect one row per variable).

etiennebacher · 2024-11-12T14:19:02Z

R/describe_missing.R

+#' @description Provides a detailed description of missing values in a data frame.
+#' This function reports both absolute and percentage missing values of specified
+#' column lists or scales, following recommended guidelines. Some authors recommend
+#' reporting item-level missingness per scale, as well as a participant's maximum
+#' number of missing items by scale. For example, Parent (2013) writes:
+#'
+#' *I recommend that authors (a) state their tolerance level for missing data by scale
+#' or subscale (e.g., "We calculated means for all subscales on which participants gave
+#' at least 75% complete data") and then (b) report the individual missingness rates
+#' by scale per data point (i.e., the number of missing values out of all data points
+#' on that scale for all participants) and the maximum by participant (e.g., "For Attachment
+#' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant
+#' missing more than a single data point").*


This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).

I moved everything after "Some authors recommend" to @details.

Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like skimr::skim() (that's the "easy" part). What is missing is a way to handle, as you highlight, survey data in that field-specific way. I thought it still fits with datawizard even if offers additional field-specific features, although we can probably try to make it more general for other users. In the details section, I added a paragraph adding more context about scales as used in psychology:

#' In psychology, it is common to ask participants to answer questionnaires in #' which people answer several questions about a specific topic. For example, #' people could answer 10 different questions about how extroverted they are. #' In turn, researchers calculate the average for those 10 questions (called #' items). These questionnaires are called (e.g., Likert) "scales" (such as the #' Rosenberg Self-Esteem Scale, also known as the RSES).

I suppose one question we have to answer is: do we want to have describe_missing only report basic missing info that is field-general a bit more like skim(), OR we do we also want it to include the features specific to the survey format? (or said another way, should we remove or keep the survey feature)

etiennebacher · 2024-11-12T14:23:25Z

R/describe_missing.R

+#' missing more than a single data point").*
+#'
+#' @param data The data frame to be analyzed.
+#' @param vars Variable (or lists of variables) to check for missing values (NAs).


We use select, exclude, etc. in all other dataframe functions, I think we should here as well.

Here it works a little bit differently than select elsewhere. vars takes a list of list of strings (such as list(c("openness_1", "openness_2", "openness_3"), c("extroversion_1", "extroversion_2", "extroversion_3"))) to take into account the nested structure of the items / columns. I can rename it to select, but do you think it will create confusion or expectations that it should rely on and work with .select_nse? Or should we include select and exclude in addition to vars? I'm not sure how .select_nse could accommodate the nested structure like I'm doing right now 🤔

R/describe_missing.R

etiennebacher · 2024-11-12T14:28:31Z

R/describe_missing.R

+#' @keywords missing values NA guidelines
+#' @return A dataframe with the following columns:
+#'  - `var`: Variables selected.
+#'  - `items`: Number of items for selected variables.


I think unique_values instead of items would be clearer.

Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid unique_values would suggest unique responses for a given column.

It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be n_items, so perhaps we can do n_columns??

R/describe_missing.R

Co-authored-by: Etienne Bacher <[email protected]>

rempsyc · 2024-12-17T02:36:33Z

Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use missing_ instead of na_ and other suggestions (I initially chose na to make shorter column names so the whole output could fit on my rather narrow console). I can also add a new column complete_rate to mirror skim(). Otherwise, skim() and describe_missing() have the same relative structure (variables in the first column and aggregate stats on the other columns).

the default output looks unexpected to me (I'd rather expect one row per variable).

There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the : indicator to show which variables each row includes).

But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work.

rempsyc · 2024-12-17T03:20:19Z

Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows:

However, this behaviour is overwritten if scales or variables are specified:

library(datawizard)

# Use the entire data frame
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#>           variable n_columns n_missing cells missing_percent complete_percent
#> 1               ID         1         7    14           50.00            50.00
#> 2       openness_1         1         4    14           28.57            71.43
#> 3       openness_2         1         4    14           28.57            71.43
#> 4       openness_3         1         3    14           21.43            78.57
#> 5   extroversion_1         1         6    14           42.86            57.14
#> 6   extroversion_2         1         6    14           42.86            57.14
#> 7   extroversion_3         1         5    14           35.71            64.29
#> 8  agreeableness_1         1         3    14           21.43            78.57
#> 9  agreeableness_2         1         4    14           28.57            71.43
#> 10 agreeableness_3         1         3    14           21.43            78.57
#> 11           Total        10        45   140           32.14            67.86
#>    missing_max missing_max_percent all_missing
#> 1            1                 100           7
#> 2            1                 100           4
#> 3            1                 100           4
#> 4            1                 100           3
#> 5            1                 100           6
#> 6            1                 100           6
#> 7            1                 100           5
#> 8            1                 100           3
#> 9            1                 100           4
#> 10           1                 100           3
#> 11          10                 100           2

# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

# Otherwise you can provide nested columns manually:
describe_missing(df,
                 select = list(
                   c("ID"),
                   c("openness_1", "openness_2", "openness_3"),
                   c("extroversion_1", "extroversion_2", "extroversion_3"),
                   c("agreeableness_1", "agreeableness_2", "agreeableness_3")
                 )
)
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

^{Created on 2024-12-16 with reprex v2.1.1}

etiennebacher · 2024-12-17T15:31:13Z

I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la skimr rather than something specialized for psychology that I think could live in the rempsyc package.

@easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function?

mattansb · 2024-12-18T06:49:21Z

I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅)

codecov · 2024-12-18T15:01:41Z

Codecov Report

Attention: Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.25%. Comparing base (81dd0e0) to head (0e83588).
Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
R/describe_missing.R	95.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   91.14%   91.25%   +0.11%     
==========================================
  Files          76       77       +1     
  Lines        6045     6144      +99     
==========================================
+ Hits         5510     5607      +97     
- Misses        535      537       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

DominiqueMakowski · 2024-12-18T15:06:50Z

If I understand, the main outstanding issue is what to do with the "scales" argument. I would indeed remove it (soz Rémi ^^) and replace it by a by argument as in our other function. If users want to compute the amount of missing per dimension, they should do it using a more traditional approach and first pivot to longer and then run describe_missing(select="item", by="dimension") otherwise I'm afraid it gets messy if we have a bespoke scales argument only for this function

rempsyc · 2024-12-18T15:42:10Z

Alright, in this case, I think I can introduce select, exclude, and by and make it more consistent with the rest of datawizard 🤓

rempsyc · 2024-12-19T19:39:29Z

Alright, this is a much simplified version which now also support "by". So this is what I have so far:

library(datawizard)

describe_missing(airquality, select = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Ozone        37           24.18            75.82
#> 2  Solar.R         7            4.58            95.42
#> 3     Wind         0            0.00           100.00
#> 4     Temp         0            0.00           100.00
#> 5    Total        44            7.19            92.81

describe_missing(airquality, exclude = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Month         0               0              100
#> 2      Day         0               0              100
#> 3    Total         0               0              100

# Testing the 'by' argument for survey scales
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)

df_long <- reshape_longer(
  df,
  select = -1,
  names_sep = "_",
  names_to = c("dimension", "item"))

describe_missing(df_long, 
                 select = -c(1, 3), 
                 by = "dimension")
#>        variable n_missing missing_percent complete_percent
#> 1 agreeableness        10           23.81            76.19
#> 2  extroversion        17           40.48            59.52
#> 3      openness        11           26.19            73.81
#> 4         Total        38           15.08            84.92

^{Created on 2024-12-19 with reprex v2.1.1}

Anything else you'd find desirable in the function?

Suggestion of new function: describe_missing()

f879900

Fixes #454

rempsyc marked this pull request as draft November 11, 2024 11:31

rempsyc added 3 commits November 11, 2024 21:25

Suggestion of new function: describe_missing()

ab9f006

Fixes #454

styler, update dic

218b7f4

Suggestion of new function: describe_missing()

ebaeb68

Fixes #454

rempsyc marked this pull request as ready for review November 11, 2024 21:19

rempsyc requested a review from etiennebacher November 11, 2024 21:19

news.md

c3c1302

etiennebacher requested changes Nov 12, 2024

View reviewed changes

Merge branch 'main' into rempsyc/issue454

357dbbc

rempsyc marked this pull request as draft December 2, 2024 15:55

rempsyc and others added 2 commits December 16, 2024 19:00

Merge branch 'main' into rempsyc/issue454

0c25fef

Update R/describe_missing.R

fbdd26d

Co-authored-by: Etienne Bacher <[email protected]>

rempsyc added 2 commits December 16, 2024 22:36

address comments and suggestions

72041f5

update snapshots, wordlist, lintrs, styler, note

835b3bb

etiennebacher mentioned this pull request Dec 17, 2024

Release 1.0.0 #574

Draft

Merge branch 'main' into rempsyc/issue454

0e83588

rework describe_missing

e8d393d

styler, lints

f26f247

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion of new function: `describe_missing()` #561

Suggestion of new function: `describe_missing()` #561

rempsyc commented Nov 11, 2024

etiennebacher left a comment •

edited

Loading

etiennebacher Nov 12, 2024

rempsyc Dec 17, 2024

rempsyc Dec 17, 2024 •

edited

Loading

etiennebacher Nov 12, 2024

rempsyc Dec 17, 2024 •

edited

Loading

etiennebacher Nov 12, 2024

rempsyc Dec 17, 2024

rempsyc Dec 17, 2024 •

edited

Loading

rempsyc commented Dec 17, 2024 •

edited

Loading

rempsyc commented Dec 17, 2024

etiennebacher commented Dec 17, 2024

mattansb commented Dec 18, 2024

codecov bot commented Dec 18, 2024

DominiqueMakowski commented Dec 18, 2024

rempsyc commented Dec 18, 2024

rempsyc commented Dec 19, 2024

Suggestion of new function: describe_missing() #561

Are you sure you want to change the base?

Suggestion of new function: describe_missing() #561

Conversation

rempsyc commented Nov 11, 2024

etiennebacher left a comment • edited Loading

Choose a reason for hiding this comment

etiennebacher Nov 12, 2024

Choose a reason for hiding this comment

rempsyc Dec 17, 2024

Choose a reason for hiding this comment

rempsyc Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

etiennebacher Nov 12, 2024

Choose a reason for hiding this comment

rempsyc Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

etiennebacher Nov 12, 2024

Choose a reason for hiding this comment

rempsyc Dec 17, 2024

Choose a reason for hiding this comment

rempsyc Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

rempsyc commented Dec 17, 2024 • edited Loading

rempsyc commented Dec 17, 2024

etiennebacher commented Dec 17, 2024

mattansb commented Dec 18, 2024

codecov bot commented Dec 18, 2024

Codecov Report

DominiqueMakowski commented Dec 18, 2024

rempsyc commented Dec 18, 2024

rempsyc commented Dec 19, 2024

Suggestion of new function: `describe_missing()` #561

Suggestion of new function: `describe_missing()` #561

etiennebacher left a comment •

edited

Loading

rempsyc Dec 17, 2024 •

edited

Loading

rempsyc Dec 17, 2024 •

edited

Loading

rempsyc Dec 17, 2024 •

edited

Loading

rempsyc commented Dec 17, 2024 •

edited

Loading