Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data_seek() #458

Merged
merged 19 commits into from
Sep 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.8.0.12
Version: 0.8.0.13
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ S3method(normalize,matrix)
S3method(normalize,numeric)
S3method(plot,visualisation_recipe)
S3method(print,data_codebook)
S3method(print,data_seek)
S3method(print,dw_data_peek)
S3method(print,dw_data_tabulate)
S3method(print,dw_data_tabulates)
Expand Down Expand Up @@ -236,6 +237,7 @@ export(data_rename_rows)
export(data_reorder)
export(data_restoretype)
export(data_rotate)
export(data_seek)
export(data_select)
export(data_separate)
export(data_tabulate)
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@ NEW FUNCTIONS
* `means_by_group()`, to compute mean values of variables, grouped by levels
of specified factors.

* `seek_variables()`, to seek for variables in a data frame, based on their
column names, variables labels, value labels or factor levels. Searching for
labels only works for "labelled" data, i.e. when variables have a `label` or
`labels` attribute.

CHANGES

* `recode_into()` gains an `overwrite` argument to skip overwriting already
Expand Down
168 changes: 168 additions & 0 deletions R/data_seek.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
#' @title Find variables by their names, variable or value labels
#' @name data_seek
#'
#' @description This functions seeks variables in a data frame, based on patterns
#' that either match the variable name (column name), variable labels, value labels
#' or factor levels. Matching variable and value labels only works for "labelled"
#' data, i.e. when the variables either have a `label` attribute or `labels`
#' attribute.
#'
#' `data_seek()` is particular useful for larger data frames with labelled
#' data - finding the correct variable name can be a challenge. This function
#' helps to find the required variables, when only certain patterns of variable
#' names or labels are known.
#'
#' @param data A data frame.
#' @param pattern Character string (regular expression) to be matched in `data`.
#' May also be a character vector of length > 1. `pattern` is searched for in
#' column names, variable label and value labels attributes, or factor levels of
#' variables in `data`.
#' @param seek Character vector, indicating where `pattern` is sought. Use one
#' or more of the following options:
#'
#' - `"names"`: Searches in column names. `"column_names"` and `"columns"` are
#' aliases for `"names"`.
#' - `"labels"`: Searches in variable labels. Only applies when a `label` attribute
#' is set for a variable.
#' - `"values"`: Searches in value labels or factor levels. Only applies when a
#' `labels` attribute is set for a variable, or if a variable is a factor.
#' `"levels"` is an alias for `"values"`.
#' - `"all"`: Searches in all of the above.
#' @param fuzzy Logical. If `TRUE`, "fuzzy matching" (partial and close distance
#' matching) will be used to find `pattern`.
#'
#' @return A data frame with three columns: the column index, the column name
#' and - if available - the variable label of all matched variables in `data`.
#'
#' @examples
#' # seek variables with "Length" in variable name or labels
#' data_seek(iris, "Length")
#'
#' # seek variables with "dependency" in names or labels
#' # column "e42dep" has a label-attribute "elder's dependency"
#' data(efc)
#' data_seek(efc, "dependency")
#'
#' # "female" only appears as value label attribute - default search is in
#' # variable names and labels only, so no match
#' data_seek(efc, "female")
#' # when we seek in all sources, we find the variable "e16sex"
#' data_seek(efc, "female", seek = "all")
#'
#' # typo, no match
#' data_seek(iris, "Lenght")
#' # typo, fuzzy match
#' data_seek(iris, "Lenght", fuzzy = TRUE)
#' @export
data_seek <- function(data, pattern, seek = c("names", "labels"), fuzzy = FALSE) {
# check valid args
if (!is.data.frame(data)) {
insight::format_error("`data` must be a data frame.")
}

# check valid args
seek <- intersect(seek, c("names", "labels", "values", "levels", "column_names", "columns", "all"))
if (is.null(seek) || !length(seek)) {
insight::format_error("`seek` must be one of \"names\", \"labels\", \"values\", a combination of these options, or \"all\".")

Check warning on line 66 in R/data_seek.R

View workflow job for this annotation

GitHub Actions / lint-changed-files / lint-changed-files

file=R/data_seek.R,line=66,col=121,[line_length_linter] Lines should not be more than 120 characters. This line is 129 characters.
}

pos1 <- pos2 <- pos3 <- NULL

pos <- unlist(lapply(pattern, function(search_pattern) {
# search in variable names?
if (any(seek %in% c("names", "columns", "column_names", "all"))) {
pos1 <- which(grepl(search_pattern, colnames(data)))
# find in near distance?
if (fuzzy) {
pos1 <- c(pos1, .fuzzy_grep(x = colnames(data), pattern = search_pattern))
}
}

# search in variable labels?
if (any(seek %in% c("labels", "all"))) {
labels <- insight::compact_character(unlist(lapply(data, attr, which = "label", exact = TRUE)))
if (!is.null(labels) && length(labels)) {
found <- grepl(search_pattern, labels)
pos2 <- match(names(labels)[found], colnames(data))
# find in near distanc?
if (fuzzy) {
found <- .fuzzy_grep(x = labels, pattern = search_pattern)
if (length(found)) {
pos2 <- c(pos2, match(names(labels)[found], colnames(data)))
}
}
}
}

# search for pattern in value labels or levels?
if (any(seek %in% c("values", "levels", "all"))) {
values <- insight::compact_list(lapply(data, function(i) {
l <- attr(i, "labels", exact = TRUE)
if (is.null(l) && is.factor(i)) {
levels(i)
} else {
names(l)
}
}))
if (!is.null(values) && length(values)) {
found <- vapply(values, function(i) any(grepl(search_pattern, i)), logical(1))
pos3 <- match(names(found)[found], colnames(data))
# find in near distance
if (fuzzy) {
found <- vapply(
values,
function(i) {
length(.fuzzy_grep(x = i, pattern = search_pattern)) > 0
},
logical(1)
)
if (any(found)) {
pos3 <- c(pos3, match(names(found)[found], colnames(data)))
}
}
}
}
c(pos1, pos2, pos3)
}))

# clean up
pos <- unique(pos)

# variable labels of matching variables
labels <- vapply(
colnames(data[pos]),
function(i) {
l <- attr(data[[i]], "label", exact = TRUE)
if (is.null(l)) {
i
} else {
l
}
},
character(1)
)

out <- data.frame(
index = pos,
column = colnames(data)[pos],
labels = labels,
stringsAsFactors = FALSE
)
# no row names
rownames(out) <- NULL

class(out) <- c("data_seek", "data.frame")
out
}


# methods ---------------------------------------------------------------------

#' @export
print.data_seek <- function(x, ...) {
if (nrow(x) == 0) {
cat("No matches found.\n")
} else {
cat(insight::export_table(x, ...))
}
}
1 change: 1 addition & 0 deletions _pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ reference:
- data_codebook
- data_tabulate
- data_peek
- data_seek
- means_by_group
- contains("distribution")
- kurtosis
Expand Down
2 changes: 1 addition & 1 deletion man/data_peek.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/data_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

68 changes: 68 additions & 0 deletions man/data_seek.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 10 additions & 3 deletions man/describe_distribution.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/recode_into.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

17 changes: 17 additions & 0 deletions tests/testthat/_snaps/data_seek.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# data_seek - print

Code
data_seek(iris, "Length")
Output
index | column | labels
-----------------------------------
1 | Sepal.Length | Sepal.Length
3 | Petal.Length | Petal.Length

---

Code
data_seek(iris, "abc")
Output
No matches found.

Loading
Loading