-
Notifications
You must be signed in to change notification settings - Fork 987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix type coercion in bmerge #6603
base: master
Are you sure you want to change the base?
Conversation
Generated via commit 8deaa6a Download link for the artifact containing the test results: ↓ atime-results.zip
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6603 +/- ##
=======================================
Coverage 98.60% 98.61%
=======================================
Files 79 79
Lines 14516 14542 +26
=======================================
+ Hits 14314 14340 +26
Misses 202 202 ☔ View full report in Codecov by Sentry. |
@MichaelChirico is it worth to make the (length(unique(icols))!=length(icols) || length(unique(xcols))!=length(xcols)) ensuring that either |
Sorry, I just saw that this is a CRAN requirement, checking now. |
R/bmerge.R
Outdated
@@ -71,7 +71,13 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos | |||
stopf("Incompatible join types: %s (%s) and %s (%s). Factor columns must join to factor or character columns.", xname, xclass, iname, iclass) | |||
} | |||
if (xclass == iclass) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: {x,i}class
seem like poor name choices here, given that they typically come from typeof()
(and class()
is never called, except maybe internally by inherits()
).
{x,i}type
would be more appropriate, but {x,i}_merge_type
might be even better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another aside: It looks like we do x[[xc]]
and i[[ic]]
extraction at least one and up to several times. Should we just store the extracted column locally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside:
{x,i}class
seem like poor name choices here, given that they typically come fromtypeof()
(andclass()
is never called, except maybe internally byinherits()
).
Yes but didn't want to change more than needed for the quick fix
Definitely thanks for triaging a fix here. I think we both agree it's pretty hack-y & ideally not needed. I'm not sure I fully understand the bug yet, but as stated in OP we're doing IMO |
Unfortunately this "contract" does not hold. I have a Windows dev version installed here and get the following > typeof(.Date(0L))
[1] "integer"
> typeof(as.Date(.Date(0L)))
[1] "integer" |
Thanks, also confirmed that on 4.4.1 |
This is strange as Kurt mentioned there wasn't an intention to do this but I figure it would have been fixed by now if that were the case. |
Update this fix now can convert into one direction from integer to double # this works
x = data.table(a=1L)
y = data.table(c=1L, d=2)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]
# this still needs to fixed
x = data.table(a=1)
y = data.table(c=1, d=2L)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)] |
Should I consider this PR in-progress for now? The diff has grown enough that it would help to add a brief overview of the changes to the PR description to orient reading |
R/bmerge.R
Outdated
newvalue = chmatch(i[[ic]], levels(x[[xc]]), nomatch=0L) | ||
if (anyNA(i[[ic]])) newvalue[is.na(i[[ic]])] = NA_integer_ # NA_character_ should match to NA in factor, #3809 | ||
set(i, j=ic, value=newvalue) | ||
if (nrow(i)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't read carefully yet, but given the description of the problem as I understand it now, I had expected something like this to be needed:
(1) A pass over the join columns to determine the "join type" of each column
(2) A second pass to actually do the coercions as needed
Does that make sense to you? If so, each step could be a helper function which would improve readability & modularity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory yes, but we actually only have to take a look the case of mixing double
and integer
columns, since this is the only case where we previously made bmerge
smart enough to cater for different types (besides character
and factors
)
Just received the "deadline" on this from Kurt:
|
@MichaelChirico the PR is ready for review. What I'm unsure about is if we always want to try to copy at |
R/bmerge.R
Outdated
xclass=="factor" || iclass=="factor") { | ||
cfl = c("character", "logical", "factor") | ||
if (x_merge_type %chin% cfl || i_merge_type %chin% cfl) { | ||
msg = "Coercing all-NA %s column %s to type %s to match type of %s.\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for translations we'll want to do
msg = gettext(...)
...
catf(fmt, domain=NULL)
The run-time difference is 0, but we rely on static analysis to pull out the messages eligible for translation, which my suggestion will facilitate.
@@ -34,86 +34,120 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos | |||
ans | |||
} | |||
|
|||
cast_with_atts = function(x, as.f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these helpers be declared in the {data.table} namespace instead (not exported)?
if bmerge()
gets called dozens/hundreds of times, these helpers would have to be re-created each time. my approach is usually to put such helpers outside function bodies unless it's strongly beneficial for them to inherit directly from the function environment (not the case here AFAICT).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if it's about matching how getClass
(now mergeType
) is handled, I would move that helper out too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see coerce_col
inherits verbose
.
Suggestion: verbose_msg = NULL
in the signature, then if (!is.null(verbose_msg))
to replace if (verbose)
I found myself wondering `ic`... "`i` character? `i` class?". Simpler to encode more info in the name
if (anyNA(i[[ic]]) && allNA(i[[ic]])) { | ||
if (verbose) catf("Coercing all-NA %s (%s) to type %s to match type of %s.\n", iname, iclass, xclass, xname) | ||
set(i, j=ic, value=match.fun(paste0("as.", xclass))(i[[ic]])) | ||
cfl = c("character", "logical", "factor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weren't factor
columns handled in the previous branch?
if (xclass == iclass) { | ||
if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, xclass, xname) | ||
if (x_merge_type == i_merge_type) { | ||
if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, x_merge_type, xname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside, probably for follow-up: it's a bit odd for this message to follow the x_merge_type == "factor" || i_merge_type == "factor"
branch, since we get "no coercion needed" after coercion was already possibly done...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind, it's not possible to encounter this if we enter the previous branch -- it's always next
'd. Still, it might make sense to move this step first as logically it's the simplest case & improves the flow of the code:
(1) if i_merge_type == x_merge_type
, done, just move to next step
(2) now start checking what coercion is needed or it's impossible
if (wclass=="integer" || (wclass=="double" && !isReallyReal(w[[wc]]))) { | ||
if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which contains no fractions)" else "", nm[2L]) | ||
set(w, j=wc, value=bit64::as.integer64(w[[wc]])) | ||
} else stopf("Incompatible join types: %s is type integer64 but %s is type double and contains fractions", nm[2L], nm[1L]) | ||
} else { | ||
# just integer and double left | ||
if (iclass=="double") { | ||
if (!isReallyReal(i[[ic]])) { | ||
ic_idx = which(icol == icols) # check if on is joined on multiple conditions, #6602 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first read I think "this can be doing a lot of duplicate work since it's basically for (icol in icols) icol==icols
", but actually it only applies to the subset icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")]
.
length(icols)
will not really be large in general (almost always <10
), the subset there is even smaller, I doubt there's any real concern about the "lost efficiency" here unless you see something obvious (my first instinct is duplicated()
but that's not quite right, everything else that comes to mind harms readability).
but it might be worth calling out the intentional redundancy in a comment.
IINM this also closes #6554 (cc @rikivillalba) |
set(x, j=xc, value=as.double(x[[xc]])) | ||
if (length(ic_idx)>1L) { | ||
xc_idx = xcols[ic_idx] | ||
for (xb in xc_idx[which(vapply_1c(x[0L, xc_idx, with=FALSE], mergeType) == "double")]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: vapply_1c(.shallow(x, xc_idx), mergeType)
? I'm wary of [
overhead even if the resulting table is tiny... I suspect it's easier to digest the .shallow()
approach too at a glance
ditto below
@@ -15727,7 +15727,8 @@ DT = data.table(z = 1i) | |||
test(2069.33, DT[DT, on = 'z'], error = "Type 'complex' is not supported for joining/merging") | |||
|
|||
# forder verbose message when !isReallyReal Date, #1738 | |||
DT = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days"), 20, replace=TRUE)) | |||
date_dbl = as.Date(as.double(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days")), origin="1970-01-01") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to edit this test? I'd've thought one effect of this PR is to make the old test keep working as intended...
# coerce Dates to double if join on multiple columns, #6602 | ||
x = data.table(a=1L) | ||
y = data.table(c=1L, d=1) | ||
test(2297.01, options=c(datatable.verbose=TRUE), y[x, on=.(c == a, d == a)], data.table(c=1L, d=1L), output="Coercing .*c to type double") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: WDYT about using verbose=TRUE
argument to [
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
also gets coerced, right? should that be reflected in the output=
test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, I might expect d
to be the coerced column here...)
test(2297.02, options=c(datatable.verbose=TRUE), y[x, on=.(d == a, c == a)], data.table(c=1L, d=1L), output="Coercing .*c to type double") | ||
x = data.table(a=1) | ||
y = data.table(c=1, d=1L) | ||
test(2297.03, options=c(datatable.verbose=TRUE), y[x, on=.(c == a, d == a)], data.table(c=1L, d=1L), output="Coercing double column x.c (which contains no fractions) to type integer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto about a
's coercion appearing in the output
# real double | ||
x = data.table(a=1) | ||
y = data.table(c=1.5, d=1L) | ||
test(2297.21, y[x, on=.(c == a, d == a)], data.table(c=1, d=1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to include a non-join column in these tests, here at first glance it's surprising, but this is the right-join-by-default effect at play (i.e. there'd be no rows if nomatch=NULL
).
Closes #6602
Base does not encounter this problem since one join column can not be in multiple join conditions.
What do we do:
In bmerge we check if the types to merge x and i in
x[i]
are compatible. This also does some type coercion for example when the responsible columns are integer and double.If a single column of i is in multiple conditions with columns of x, all of the to be joined columns need to coerce, this is why we introduce the extra check
length(ic_idx)>1L
and then iterate over the corresponding columns of x.