Skip to content

Commit

Permalink
tentatively add frequency grids
Browse files Browse the repository at this point in the history
  • Loading branch information
lhdjung committed Dec 1, 2023
1 parent be37cb0 commit 4de2043
Show file tree
Hide file tree
Showing 8 changed files with 123 additions and 47 deletions.
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ VignetteBuilder: knitr
Collate:
'counts.R'
'frequencies.R'
'frequency-grid-df.R'
'frequency-grid-plot.R'
'mode-proper.R'
'mode-df.R'
'mode-possible.R'
Expand Down
42 changes: 33 additions & 9 deletions R/frequency-grid-df.R
Original file line number Diff line number Diff line change
@@ -1,20 +1,25 @@
#' Frequency grid data frame
#'
#' `frequency_grid_df()` takes a vector and creates an extended frequency table
#' about it. Internally, this is used as a basis for `frequency_grid_plot()`.
#' @description NOTE: This function is currently experimental and shouldn't be
#' relied upon.
#'
#' `frequency_grid_df()` takes a vector and creates an extended frequency
#' table about it. Internally, this is used as a basis for
#' `frequency_grid_plot()`.
#'
#' @param x A vector.
#' @inheritParams mode_is_trivial
#'
#' @return A data frame with these columns:
#' - `x`: The input vector, with each unique known value repeated to be as
#' frequent as the most frequent one.
#' - `freq` (integer): Hypothetical frequency of each `x` value.
#' - `is_missing` (Boolean): Is the observation absent from the input vector?
#' - `can_be_filled` (Boolean): Are there enough `NA`s so that one of them might
#' - `is_missing` (logical): Is the observation absent from the input vector?
#' - `can_be_filled` (logical): Are there enough `NA`s so that one of them might
#' hypothetically represent the `x` value in question, implying that there
#' would be at least as many observations of that value as the respective
#' frequency (`freq`) indicates?
#' - `is_supermodal` (Boolean): Is the frequency of this value greater than the
#' - `is_supermodal` (logical): Is the frequency of this value greater than the
#' maximum frequency among known values?
#'
#' @section Limitations: See the limitations section of `frequency_grid_plot()`.
Expand All @@ -25,7 +30,7 @@
#' x <- c("a", "a", "a", "b", "b", "c", NA, NA, NA, NA, NA)
#' frequency_grid_df(x)

frequency_grid_df <- function(x) {
frequency_grid_df <- function(x, max_unique = NULL) {
n_x <- length(x)
x <- sort(x[!is.na(x)])
n_na <- n_x - length(x)
Expand All @@ -49,12 +54,31 @@ frequency_grid_df <- function(x) {
}
unique_x <- unique(x)
freq_max_known <- max(freq)

# For the `max_unique` argument:
max_unique <- handle_max_unique_input(
x, max_unique, length(unique_x), n_na, "frequency_grid_df"
)

n_slots_empty <- freq_max_known * length(unique_x) - length(x)
n_na_surplus <- n_na - n_slots_empty
freq_diff <- max(0L, ceiling(n_na_surplus / length(unique_x)))
if (is.na(freq_diff)) {
freq_diff <- 0L

# TODO: Fix this whole if-else block! Maybe put `freq_diff` to the end; it's
# the difference between `freq_max_known` and the "supermode".
freq_diff <- 0L
if (is.null(max_unique)) {
# max_unique <- max_unique %/% freq_max_known
} else if (max_unique == length(unique_x)) {
# START of the `max_unique = "known"`-assumption-specific part:
freq_diff <- max(0L, ceiling(n_na_surplus / length(unique_x)))
if (is.na(freq_diff)) {
freq_diff <- 0L
}
# END of the `max_unique = "known"`-assumption-specific part
} else if (max_unique > length(unique_x)) {
n_slots_empty_new_vals <- count_slots_empty_new_vals(n_na, freq_max)
}

freq_max <- freq_max_known + freq_diff
n_final <- freq_max * length(unique_x)

Expand Down
9 changes: 6 additions & 3 deletions R/frequency-grid-plot.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#' Frequency grid ggplot
#'
#' @description Call `frequency_grid_plot()` to visualize the absolute
#' frequencies of values in a vector. Each observation is plotted distinctly,
#' resulting in a hybrid of a histogram and a scatterplot.
#' @description NOTE: This function is currently experimental and shouldn't be
#' relied upon.
#'
#' Call `frequency_grid_plot()` to visualize the absolute frequencies of
#' values in a vector. Each observation is plotted distinctly, resulting in a
#' hybrid of a histogram and a scatterplot.
#'
#' - Boxes are known values.
#' - Circles with `NA` labels are missing values.
Expand Down
1 change: 1 addition & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ articles:
- missings
- metadata
- performance
- frequency-grids
reference:
- title: Actual modes
- contents:
Expand Down
22 changes: 16 additions & 6 deletions man/frequency_grid_df.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 6 additions & 3 deletions man/frequency_grid_plot.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

52 changes: 26 additions & 26 deletions tests/testthat/test-frequency-grid-df.R
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@

# Test vectors:
x1 <- c("a", "a", "a", "b", "b", "c", rep(NA, times = 5))
x2 <- c(1, 1, 2, 3, rep(NA, times = 6))


test_that("`frequency_grid_df()` works with `x1`", {
expect_equal(frequency_grid_df(x1), structure(list(
x = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c"),
freq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
is_missing = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE),
can_be_filled = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
is_supermodal = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)
), class = "data.frame", row.names = c(NA, -12L)))
})

test_that("`frequency_grid_df()` works with `x2`", {
expect_equal(frequency_grid_df(x2), structure(list(
x = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
freq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
is_missing = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE),
can_be_filled = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE),
is_supermodal = c(FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)
), class = "data.frame", row.names = c(NA, -12L)))
})

#
# # Test vectors:
# x1 <- c("a", "a", "a", "b", "b", "c", rep(NA, times = 5))
# x2 <- c(1, 1, 2, 3, rep(NA, times = 6))
#
#
# test_that("`frequency_grid_df()` works with `x1`", {
# expect_equal(frequency_grid_df(x1), structure(list(
# x = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c"),
# freq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
# is_missing = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE),
# can_be_filled = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
# is_supermodal = c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)
# ), class = "data.frame", row.names = c(NA, -12L)))
# })
#
# test_that("`frequency_grid_df()` works with `x2`", {
# expect_equal(frequency_grid_df(x2), structure(list(
# x = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
# freq = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
# is_missing = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE),
# can_be_filled = c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE),
# is_supermodal = c(FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE)
# ), class = "data.frame", row.names = c(NA, -12L)))
# })
#
33 changes: 33 additions & 0 deletions vignettes/frequency-grids.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
title: "Frequency grids"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Frequency grids}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r setup}
library(moder)
```

NOTE: This is not (yet) a proper documentation vignette.

TODO: Either elaborate this into a real vignette or turn it into a (final) section of the metadata vignette!

The output of moder's metadata functions can be puzzling. Why do they return `NA` for this vector but not for that one? Frequency grids will help you understand.

A frequency grid is a special kind of histogram. It is meant to depict possible ways in which the true values behind missing values may be distributed. As such, it illustrates the rationale of metadata functions such as `mode_count_range()`.

```{r}
# x <- c("a", "a", "a", "b", "b", "c", NA, NA, NA, NA, NA)
# frequency_grid_plot(x)
```

0 comments on commit 4de2043

Please sign in to comment.