Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_arrange() gets a by argument #564

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.13
Version: 0.13.0.14
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ CHANGES
* `data_modify()` now recognizes `n()`, for example to create an index for data groups
with `1:n()` (#535).

* `data_arrange()` gets a `by` argument, to arrange data grouped by values or
levels of certain variables.

BUG FIXES

* `describe_distribution()` no longer errors if the sample was too sparse to compute
Expand Down
92 changes: 65 additions & 27 deletions R/data_arrange.R
Original file line number Diff line number Diff line change
@@ -1,15 +1,22 @@
#' Arrange rows by column values
#' @title Arrange rows by column values
#' @name data_arrange
#'
#' @description
#' `data_arrange()` orders the rows of a data frame by the values of selected
#' columns.
#'
#' @param data A data frame, or an object that can be coerced to a data frame.
#' @param data A (grouped) data frame, or an object that can be coerced to a
#' data frame.
#' @param select Character vector of column names. Use a dash just before column
#' name to arrange in decreasing order, for example `"-x1"`.
#' name to arrange in decreasing order, for example `"-x1"`.
#' @param by Optional character string, indicating the names of one or more
#' variables in the data frame. If supplied, data rows will be sorted within
#' each group.
#' @param safe Do not throw an error if one of the variables specified doesn't
#' exist.
#' exist.
#' @param ... Currently not used.
#'
#' @return A data frame.
#' @return A data frame, where rows are sorted according to the selected columns.
#'
#' @examples
#'
Expand All @@ -19,17 +26,68 @@
#' # Arrange in decreasing order
#' data_arrange(head(mtcars), "-carb")
#'
#' # compare: arrange variables vs. arrange variable within groups
#' set.seed(123)
#' sample_rows <- sample(seq_len(nrow(iris)), 10, replace = TRUE)
#' x <- iris[sample_rows, c("Sepal.Width", "Species")]
#' data_arrange(x, c("Sepal.Width", "Species"))
#' data_arrange(x, "Sepal.Width", by = "Species")
#'
#' # Throw an error if one of the variables specified doesn't exist
#' try(data_arrange(head(mtcars), c("gear", "foo"), safe = FALSE))
#' @export
data_arrange <- function(data, select = NULL, safe = TRUE) {
data_arrange <- function(data, ...) {
UseMethod("data_arrange")
}


#' @rdname data_arrange
#' @export
data_arrange.default <- function(data, select = NULL, by = NULL, safe = TRUE, ...) {
if (!is.null(by)) {

Check warning on line 47 in R/data_arrange.R

View workflow job for this annotation

GitHub Actions / lint / lint

file=R/data_arrange.R,line=47,col=7,[if_not_else_linter] Prefer `if (A) x else y` to the less-readable `if (!A) y else x` in a simple if/else statement.

Check warning on line 47 in R/data_arrange.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=R/data_arrange.R,line=47,col=7,[if_not_else_linter] Prefer `if (A) x else y` to the less-readable `if (!A) y else x` in a simple if/else statement.
# check "by" argument for valid names
.sanitize_by_argument(data, by)
split_data <- split(data, data[by], drop = TRUE)
# we remove names, else rownames are not correct - these would be prefixed
# by the values for each list-element
names(split_data) <- NULL
out <- lapply(split_data, function(x) {

Check warning on line 54 in R/data_arrange.R

View workflow job for this annotation

GitHub Actions / lint / lint

file=R/data_arrange.R,line=54,col=31,[unnecessary_lambda_linter] Pass .data_arrange directly as a symbol to lapply() instead of wrapping it in an unnecessary anonymous function. For example, prefer lapply(DF, sum) to lapply(DF, function(x) sum(x)).

Check warning on line 54 in R/data_arrange.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=R/data_arrange.R,line=54,col=31,[unnecessary_lambda_linter] Pass .data_arrange directly as a symbol to lapply() instead of wrapping it in an unnecessary anonymous function. For example, prefer lapply(DF, sum) to lapply(DF, function(x) sum(x)).
.data_arrange(x, select = select, safe = safe)
})
out <- do.call(rbind, out)
# remove rownames if original data had none
if (!insight::object_has_rownames(data)) {
rownames(out) <- NULL
}
} else {
out <- .data_arrange(data, select = select, safe = safe)
}
out
}


#' @export
data_arrange.default <- function(data, select = NULL, safe = TRUE) {
data_arrange.grouped_df <- function(data, select = NULL, by = NULL, safe = TRUE, ...) {
# extract group variables
grps <- attr(data, "groups", exact = TRUE)
group_variables <- data_remove(grps, ".rows")
# if "by" is not supplied, use group variables
if (is.null(by)) {
by <- colnames(group_variables)
}
# remember attributes
info <- attributes(data)
out <- data_arrange.default(data = data, select = select, by = by, safe = safe, ...)

# set back class, so data frame still works with dplyr
attributes(out) <- utils::modifyList(info, attributes(out))
out
}


# utilities ----------------------

.data_arrange <- function(data, select = NULL, safe = TRUE) {
if (is.null(select) || length(select) == 0) {
return(data)
}
Expand Down Expand Up @@ -98,23 +156,3 @@

out
}



#' @export
data_arrange.grouped_df <- function(data, select = NULL, safe = TRUE) {
grps <- attr(data, "groups", exact = TRUE)
grps <- grps[[".rows"]]

out <- lapply(grps, function(x) {
data_arrange.default(data[x, ], select = select, safe = safe)
})

out <- do.call(rbind, out)

if (!insight::object_has_rownames(data)) {
rownames(out) <- NULL
}

out
}
46 changes: 26 additions & 20 deletions R/data_summary.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
#' data frame or a matrix.
#'
#' @param x A (grouped) data frame.
#' @param by Optional character string, indicating the name of a variable in `x`.
#' If supplied, the data will be split by this variable and summary statistics
#' will be computed for each group.
#' @param by Optional character string, indicating the names of one or more
#' variables in the data frame. If supplied, the data will be split by these
#' variables and summary statistics will be computed for each group.
#' @param remove_na Logical. If `TRUE`, missing values are omitted from the
#' grouping variable. If `FALSE` (default), missing values are included as a
#' level in the grouping variable.
Expand Down Expand Up @@ -84,23 +84,8 @@ data_summary.data.frame <- function(x, ..., by = NULL, remove_na = FALSE) {
out <- data.frame(summarise)
colnames(out) <- vapply(summarise, names, character(1))
} else {
# sanity check - is "by" a character string?
if (!is.character(by)) {
insight::format_error("Argument `by` must be a character string indicating the name of variables in the data.")
}
# is "by" in the data?
if (!all(by %in% colnames(x))) {
by_not_found <- by[!by %in% colnames(x)]
insight::format_error(
paste0(
"Variable",
ifelse(length(by_not_found) > 1, "s ", " "),
text_concatenate(by_not_found, enclose = "\""),
" not found in the data."
),
.misspelled_string(colnames(x), by_not_found, "Possibly misspelled?")
)
}
# check "by" argument for valid names
.sanitize_by_argument(x, by)
# split data, add NA levels, if requested
l <- lapply(x[by], function(i) {
if (remove_na || !anyNA(i)) {
Expand Down Expand Up @@ -207,6 +192,27 @@ data_summary.grouped_df <- function(x, ..., by = NULL, remove_na = FALSE) {
}


.sanitize_by_argument <- function(x, by) {
# sanity check - is "by" a character string?
if (!is.character(by)) {
insight::format_error("Argument `by` must be a character string indicating the name of variables in the data.")
}
# is "by" in the data?
if (!all(by %in% colnames(x))) {
by_not_found <- by[!by %in% colnames(x)]
insight::format_error(
paste0(
"Variable",
ifelse(length(by_not_found) > 1, "s ", " "),
text_concatenate(by_not_found, enclose = "\""),
" not found in the data."
),
.misspelled_string(colnames(x), by_not_found, "Possibly misspelled?")
)
}
}


# methods ----------------------------------------------------------------------

#' @export
Expand Down
23 changes: 20 additions & 3 deletions man/data_arrange.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/data_summary.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

52 changes: 52 additions & 0 deletions tests/testthat/test-data_arrange.R
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@

test_that("data_arrange works with grouped df", {
set.seed(123)
x <- mtcars[sample(seq_len(nrow(mtcars)), 10, replace = TRUE), c("cyl", "mpg")]

Check warning on line 78 in tests/testthat/test-data_arrange.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=tests/testthat/test-data_arrange.R,line=78,col=15,[sample_int_linter] sample.int(n, m, ...) is preferable to sample(seq_len(n), m, ...).
g <- data_group(x, cyl)

expected <- data.frame(
Expand All @@ -97,6 +97,58 @@
)
})

test_that("data_arrange works with by", {
set.seed(123)
x <- mtcars[sample(seq_len(nrow(mtcars)), 10, replace = TRUE), c("cyl", "mpg")]

Check warning on line 102 in tests/testthat/test-data_arrange.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=tests/testthat/test-data_arrange.R,line=102,col=15,[sample_int_linter] sample.int(n, m, ...) is preferable to sample(seq_len(n), m, ...).

expected <- data.frame(
cyl = c(4, 4, 4, 6, 6, 8, 8, 8, 8, 8),
mpg = c(22.8, 30.4, 32.4, 17.8, 19.2, 10.4, 15, 15.2, 15.5, 18.7)
)
rownames(expected) <- c(
"Datsun 710", "Honda Civic", "Fiat 128", "Merc 280C", "Merc 280",
"Cadillac Fleetwood", "Maserati Bora", "Merc 450SLC", "Dodge Challenger",
"Hornet Sportabout"
)

expect_identical(
data_arrange(x, "mpg", by = "cyl"),
expected,
ignore_attr = TRUE
)

# works for df without rownames
set.seed(123)
x <- iris[sample(seq_len(nrow(iris)), 10, replace = TRUE), c("Sepal.Width", "Species")]

Check warning on line 122 in tests/testthat/test-data_arrange.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=tests/testthat/test-data_arrange.R,line=122,col=13,[sample_int_linter] sample.int(n, m, ...) is preferable to sample(seq_len(n), m, ...).
rownames(x) <- NULL

expected <- data.frame(
Sepal.Width = c(3, 3, 3.2, 3.3, 2.5, 2.6, 2.6, 3, 3.8, 3.8),
Species = structure(
c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
levels = c("setosa", "versicolor", "virginica"),
class = "factor"
)
)
rownames(expected) <- NULL

expect_identical(
data_arrange(x, "Sepal.Width", by = "Species"),
expected,
ignore_attr = TRUE
)

# errors
expect_error(
data_arrange(mtcars, "mpg", by = "cxl"),
regex = "Variable \"cxl\" not found"
)
expect_error(
data_arrange(mtcars, "mpg", by = 2),
regex = "must be a character string"
)
})

test_that("data_arrange works with NA", {
# without groups

Expand Down
Loading