-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roll up summaries of nonscalar variables #152
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #152 +/- ##
==========================================
- Coverage 93.89% 90.58% -3.32%
==========================================
Files 41 43 +2
Lines 2654 2751 +97
==========================================
Hits 2492 2492
- Misses 162 259 +97
Continue to review full report at Codecov.
|
Having a think about how to handle
|
NULL | ||
|
||
parse_variable_indices <- function(x){ | ||
vars_indices <- strsplit(x, "(\\[|\\])") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You use a number of regexes in a row here to parse x
. Can you combine them into a single regex and use capture groups to pull out the relevant pieces? That should be faster, and will also be more flexible if at some point in the future we want to provide people with options to parse different formats of indices (like different separators or whatever).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this: here's a regular expression that matches well-formed indexed variable names. Starting/ending the regex with "^" and "$" guarantees it must match the whole name, so failures imply an ill-formed name:
x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f")
m <- regexec("^([^][]+)(\\[([^][]+)\\])?$", x)
vars_matches <- regmatches(x, m)
vars_matches
[[1]]
[1] "a[1]" "a" "[1]" "1"
[[2]]
character(0)
[[3]]
[1] "c[1,2]" "c" "[1,2]" "1,2"
[[4]]
character(0)
[[5]]
character(0)
[[6]]
[1] "f" "f" "" ""
Then you can check for ill-formed names with something like this:
if (any(lengths(vars_matches) == 0)) {
stop_no_call("insert informative error")
}
Variables without indices will have the fourth element of their corresponding entry in vars_matches
be empty, and variables with indices will have it be the indices which can then be split.
This regex is fairly strict, and I'm not sure how strict we want to be --- e.g., do we want names like "x]"
to fail, or do we want them just to be parsed as non-indexed variables? I can see an argument for a looser version of this parsing that rarely (or maybe never) fails but instead lets weird-looking names go by and treats them as non-indexed. E.g. something like this:
x <- c("a[1]", "b[", "c[1,2]", "d[]", "e]", "f")
m <- regexec("^(.+?)(\\[(.+)\\])?$", x)
vars_matches <- regmatches(x, m)
vars_matches
[[1]]
[1] "a[1]" "a" "[1]" "1"
[[2]]
[1] "b[" "b[" "" ""
[[3]]
[1] "c[1,2]" "c" "[1,2]" "1,2"
[[4]]
[1] "d[]" "d[]" "" ""
[[5]]
[1] "e]" "e]" "" ""
[[6]]
[1] "f" "f" "" ""
i.e. in this example "b["
, "d[]"
, and "e]"
all get treated as weird-looking variable names.
This is also a bit more flexible in that it would be pretty easy to allow people to pass in different choices for the opening/closing bracket and update the regex accordingly.
# Variables with brackets are given dimensionality 1, even if they contain just one element. | ||
dimensionality_elementwise <- sapply(vars_indices, function(x){ | ||
if (length(x) == 2) { | ||
out <- length(strsplit(x[2], ",")[[1]]) # number of commas plus one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not generate the dimensionality_elementwise
variable later, after you have parsed the variable indices? By generating it here you have to parse the indices twice; generating it later you could parse them just once.
} | ||
out | ||
}) | ||
variable_indices_info <- lapply(var_names, function (x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have three instances of *apply over the same structure (var_name
) in a row here. Is each element of the intermediate variables (dimensionality[[i]]
, dimensionality_elementwise[[i]]
) only used in corresponding element i
of this last apply? If so I would skip creating these intermediate variables and move the relevant code into the final apply. This would also resolve my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could well be missing a clever trick here, but I think this and the previous comment are not quite straightforward.
Of these three apply statements, the first applies a function over vars_indices
and the second and third apply over var_names
. It would be easy to bring the second inside the third. However, the second apply statement is very cheap and includes a check for inconsistent dimensionality. I thought it might be better to error quickly on inconsistent dimensionality rather than start the relatively expensive variable parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I misunderstood what this was doing.
I think this still could be simplified so that the parsing of variable indices only has to be done once (which should be faster and also have lower maintenance cost, since we wouldn't have to modify that parsing in two places for future features). If you move this line:
indices <- sapply(vars_indices[var_i], `[[`, 2)
Above its containing if/else block, then I think you could also factor out the vectorized strsplit from the line after the above one to get a list of indices, upon which you could then use lengths / etc to get elementwise ndims and check for consistency. Then index parsing only happens once / in one place.
all.x = TRUE, sort = FALSE) | ||
# need to do the sort manually after merge because when sort = TRUE, merge | ||
# sorts factors as if they were strings, and we need factors to be sorted as factors | ||
indices <- indices[do.call(order, as.list(indices[, -ncol(indices), drop = FALSE])),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though I wrote this sort originally I'm not actually sure this sort is needed, because I think the all_indices
grid might already be sorted in the correct order? worth checking on this, because if so dropping this sort will be a big time saver on variables with lots of indices.
R/parse_variable_indices.R
Outdated
#' $dimensionality the number of dimensions of V. Returns 0 for scalars with no brackets | ||
#' but 1 for `y[1]` even if `y` has no other entries in `x`. | ||
#' | ||
#' $dimensions a vector of the actual dimensions of V, as determined by the number of unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dimensions
currently used for anything?
Or put another way: is there a situation in which it is clear that someone would ever want both dimensions
and implied_dimensions
at the same time? I asked because I think it could simplify things to just calculate and return the implied dimensions as the canonical "dimensions". Possibly with an option to only calculate the non-implied dimensions if those are desired (in which case the function can skip all the expensive calculations involved in figuring out the implied dimensions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.
In rollup_summary
I thought it might be more useful to report the dimensions
rather than the implied_dimensions
since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.
TBH, for my own purposes I don't expect dimensions
and implied_dimensions
to ever differ, so I'll definitely follow your lead on which quantity is more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main use case for both would be if somebody wants to check whether they match. Not sure that's a "real" use case though.
Yeah I'd vote for waiting to see if this ever comes up instead of designing for it in advance.
In rollup_summary I thought it might be more useful to report the dimensions rather than the implied_dimensions since the former give a better (albeit imperfect) sense of big the rolled-up variable actually is.
Good point, I agree. In that case, that function doesn't need any of the stuff that comes with implied_dimensions right? Then I would propose adding a flag to parse_variable_indices that either fills in the implied dimensions or not (with default TRUE
). I'm not sure what to call it -- fill_implied_dimensions
is too long. Maybe just fill
or fill_dim
or fill_indices
or fill_missing
? Then if fill_missing
is TRUE
, dimensions
would be reported with missing indices filled in and otherwise not.
Thanks @jsocolar! I added (hopefully not too many :) ) comments with some suggestions for simplification. It's very exciting to see this getting done, it's gonna be a very useful function (I am probably going to rewrite some tidybayes stuff on top of it :)). A minor point of discussion about the format: currently you have |
Co-authored-by: Matthew Kay <[email protected]>
Co-authored-by: Matthew Kay <[email protected]>
Co-authored-by: Matthew Kay <[email protected]>
Co-authored-by: Matthew Kay <[email protected]>
@mjskay just mentioning that I have some time to jump back on this now. Once I remember where we're at, I'll keep it moving along. |
Wonderful to hear! I think that the variable index parsing aspects of this PR may also end up being related to some of the conversation in #187 about exposing internal functions for flattening / unflattening arrays. I think that work can be done after this PR though, since it's already doing a couple of different things at once. |
See #43
This is not ready for primetime, but I want the authors to be able to commit here if they want, so here it is.
Now has:
parse_variable_indices
function that does the work of parsing the dimensions/indicesStill needs:
parse_variable_indices.R
and, if it looks good, rewritingas_draws_rvars.draws_matrix()
to rely onparse_variable_indices
(see also Coding matrix/vector variables in tidy form #61)