fix type coercion in bmerge #6603

ben-schwen · 2024-11-03T21:05:32Z

Base does not encounter this problem since one join column can not be in multiple join conditions.

What do we do:
In bmerge we check if the types to merge x and i in x[i] are compatible. This also does some type coercion for example when the responsible columns are integer and double.
If a single column of i is in multiple conditions with columns of x, all of the to be joined columns need to coerce, this is why we introduce the extra check length(ic_idx)>1L and then iterate over the corresponding columns of x.

github-actions · 2024-11-03T21:21:06Z

Generated via commit 8deaa6a

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 38 seconds
Installing different package versions	7 minutes and 54 seconds
Running and plotting the test cases	2 minutes and 31 seconds

codecov · 2024-11-03T21:32:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.61%. Comparing base (6a15f86) to head (8deaa6a).
Report is 10 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6603   +/-   ##
=======================================
  Coverage   98.60%   98.61%           
=======================================
  Files          79       79           
  Lines       14516    14542   +26     
=======================================
+ Hits        14314    14340   +26     
  Misses        202      202

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ben-schwen · 2024-11-03T21:59:14Z

@MichaelChirico is it worth to make the if more specific to this corner case e.g.

(length(unique(icols))!=length(icols) || length(unique(xcols))!=length(xcols))

ensuring that either icols has a double condition or xcols has a double condition?

R/bmerge.R

MichaelChirico · 2024-11-22T21:08:32Z

Sorry, I just saw that this is a CRAN requirement, checking now.

MichaelChirico · 2024-11-22T21:19:39Z

R/bmerge.R

@@ -71,7 +71,13 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
      stopf("Incompatible join types: %s (%s) and %s (%s). Factor columns must join to factor or character columns.", xname, xclass, iname, iclass)
    }
    if (xclass == iclass) {


aside: {x,i}class seem like poor name choices here, given that they typically come from typeof() (and class() is never called, except maybe internally by inherits()).

{x,i}type would be more appropriate, but {x,i}_merge_type might be even better?

another aside: It looks like we do x[[xc]] and i[[ic]] extraction at least one and up to several times. Should we just store the extracted column locally?

aside: {x,i}class seem like poor name choices here, given that they typically come from typeof() (and class() is never called, except maybe internally by inherits()).

Yes but didn't want to change more than needed for the quick fix

MichaelChirico · 2024-11-22T21:32:59Z

Definitely thanks for triaging a fix here. I think we both agree it's pretty hack-y & ideally not needed.

I'm not sure I fully understand the bug yet, but as stated in OP we're doing as.Date().

IMO as.Date() should always coerce to double storage, even if it's possible to get Date objects with integer storage. Therefore I would hold off until (1) R-core confirms that yes, as.Date(x) might return integer sometimes; and/or (2) I get a clearer understanding of the underlying issue here.

ben-schwen · 2024-11-22T21:58:57Z

IMO as.Date() should always coerce to double storage, even if it's possible to get Date objects with integer storage. Therefore I would hold off until (1) R-core confirms that yes, as.Date(x) might return integer sometimes; and/or (2) I get a clearer understanding of the underlying issue here.

Unfortunately this "contract" does not hold. I have a Windows dev version installed here and get the following

> typeof(.Date(0L))
[1] "integer"
> typeof(as.Date(.Date(0L)))
[1] "integer"

MichaelChirico · 2024-11-22T22:19:35Z

I have a Windows dev version installed here and get the following

Thanks, also confirmed that on 4.4.1

TysonStanley · 2024-11-22T22:20:36Z

This is strange as Kurt mentioned there wasn't an intention to do this but I figure it would have been fixed by now if that were the case.

ben-schwen · 2024-11-24T16:36:10Z

Update this fix now can convert into one direction from integer to double

# this works
x = data.table(a=1L)
y = data.table(c=1L, d=2)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]
# this still needs to fixed
x = data.table(a=1)
y = data.table(c=1, d=2L)
y[x, on=.(c==a, d==a)]
y[x, on=.(d==a, c==a)]

MichaelChirico · 2024-11-24T23:52:44Z

this still needs to fixed

Should I consider this PR in-progress for now?

The diff has grown enough that it would help to add a brief overview of the changes to the PR description to orient reading

MichaelChirico · 2024-11-24T23:55:19Z

R/bmerge.R

-          newvalue = chmatch(i[[ic]], levels(x[[xc]]), nomatch=0L)
-          if (anyNA(i[[ic]])) newvalue[is.na(i[[ic]])] = NA_integer_  # NA_character_ should match to NA in factor, #3809
-          set(i, j=ic, value=newvalue)
+  if (nrow(i)) {


Haven't read carefully yet, but given the description of the problem as I understand it now, I had expected something like this to be needed:

(1) A pass over the join columns to determine the "join type" of each column
(2) A second pass to actually do the coercions as needed

Does that make sense to you? If so, each step could be a helper function which would improve readability & modularity.

In theory yes, but we actually only have to take a look the case of mixing double and integer columns, since this is the only case where we previously made bmerge smart enough to cater for different types (besides character and factors)

TysonStanley · 2024-11-25T17:06:49Z

Just received the "deadline" on this from Kurt:

Please correct before 2024-12-09 to safely retain your package on CRAN. Note that this will be the final reminder.

This reverts commit 562a9fd.

ben-schwen · 2024-11-27T19:57:51Z

@MichaelChirico the PR is ready for review. What I'm unsure about is if we always want to try to copy at cast_with_atts or maybe want to introduce an extra argument for that. Moreover, I tried to keep everything quite modular.

MichaelChirico · 2024-11-27T22:32:13Z

R/bmerge.R

-        xclass=="factor" || iclass=="factor") {
+    cfl = c("character", "logical", "factor")
+    if (x_merge_type %chin% cfl || i_merge_type %chin% cfl) {
+      msg = "Coercing all-NA %s column %s to type %s to match type of %s.\n"


nit: for translations we'll want to do

msg = gettext(...) ... catf(fmt, domain=NULL)

The run-time difference is 0, but we rely on static analysis to pull out the messages eligible for translation, which my suggestion will facilitate.

MichaelChirico · 2024-11-27T22:37:04Z

R/bmerge.R

@@ -34,86 +34,120 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
    ans
  }

+  cast_with_atts = function(x, as.f) {


should these helpers be declared in the {data.table} namespace instead (not exported)?

if bmerge() gets called dozens/hundreds of times, these helpers would have to be re-created each time. my approach is usually to put such helpers outside function bodies unless it's strongly beneficial for them to inherit directly from the function environment (not the case here AFAICT).

(if it's about matching how getClass (now mergeType) is handled, I would move that helper out too)

oh, I see coerce_col inherits verbose.

Suggestion: verbose_msg = NULL in the signature, then if (!is.null(verbose_msg)) to replace if (verbose)

I found myself wondering `ic`... "`i` character? `i` class?". Simpler to encode more info in the name

MichaelChirico · 2024-11-27T22:53:37Z

R/bmerge.R

-      if (anyNA(i[[ic]]) && allNA(i[[ic]])) {
-        if (verbose) catf("Coercing all-NA %s (%s) to type %s to match type of %s.\n", iname, iclass, xclass, xname)
-        set(i, j=ic, value=match.fun(paste0("as.", xclass))(i[[ic]]))
+    cfl = c("character", "logical", "factor")


weren't factor columns handled in the previous branch?

MichaelChirico · 2024-11-27T22:58:39Z

R/bmerge.R

-    if (xclass == iclass) {
-      if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, xclass, xname)
+    if (x_merge_type == i_merge_type) {
+      if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, x_merge_type, xname)


Aside, probably for follow-up: it's a bit odd for this message to follow the x_merge_type == "factor" || i_merge_type == "factor" branch, since we get "no coercion needed" after coercion was already possibly done...

Oh, nevermind, it's not possible to encounter this if we enter the previous branch -- it's always next'd. Still, it might make sense to move this step first as logically it's the simplest case & improves the flow of the code:

(1) if i_merge_type == x_merge_type, done, just move to next step
(2) now start checking what coercion is needed or it's impossible

R/bmerge.R

MichaelChirico · 2024-11-27T23:04:20Z

R/bmerge.R

      if (wclass=="integer" || (wclass=="double" && !isReallyReal(w[[wc]]))) {
        if (verbose) catf("Coercing %s column %s%s to type integer64 to match type of %s.\n", wclass, nm[1L], if (wclass=="double") " (which contains no fractions)" else "", nm[2L])
        set(w, j=wc, value=bit64::as.integer64(w[[wc]]))
      } else stopf("Incompatible join types: %s is type integer64 but %s is type double and contains fractions", nm[2L], nm[1L])
    } else {
      # just integer and double left
-      if (iclass=="double") {
-        if (!isReallyReal(i[[ic]])) {
+      ic_idx = which(icol == icols) # check if on is joined on multiple conditions, #6602


On first read I think "this can be doing a lot of duplicate work since it's basically for (icol in icols) icol==icols", but actually it only applies to the subset icols[icols != xcols & i_merge_type %in% c("integer", "double", "complex")].

length(icols) will not really be large in general (almost always <10), the subset there is even smaller, I doubt there's any real concern about the "lost efficiency" here unless you see something obvious (my first instinct is duplicated() but that's not quite right, everything else that comes to mind harms readability).

but it might be worth calling out the intentional redundancy in a comment.

MichaelChirico · 2024-11-27T23:36:56Z

IINM this also closes #6554 (cc @rikivillalba)

MichaelChirico · 2024-11-27T23:45:31Z

Aside for follow-up (#6628):

I think we should implement mergeType returning I?Date, POSIXct as separate cases. That would for example solve #3506 and could make the solution here clearer.

MichaelChirico · 2024-11-27T23:54:07Z

R/bmerge.R

-          set(x, j=xc, value=as.double(x[[xc]]))
+          if (length(ic_idx)>1L) {
+            xc_idx = xcols[ic_idx]
+            for (xb in xc_idx[which(vapply_1c(x[0L, xc_idx, with=FALSE], mergeType) == "double")]) {


nit: vapply_1c(.shallow(x, xc_idx), mergeType)? I'm wary of [ overhead even if the resulting table is tiny... I suspect it's easier to digest the .shallow() approach too at a glance

ditto below

MichaelChirico · 2024-11-27T23:56:11Z

inst/tests/tests.Rraw

@@ -15727,7 +15727,8 @@ DT = data.table(z = 1i)
 test(2069.33, DT[DT, on = 'z'], error = "Type 'complex' is not supported for joining/merging")

 # forder verbose message when !isReallyReal Date, #1738
-DT = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days"), 20, replace=TRUE))
+date_dbl = as.Date(as.double(seq(as.Date("2015-01-01"), as.Date("2015-01-05"), by="days")), origin="1970-01-01")


do we still need to edit this test? I'd've thought one effect of this PR is to make the old test keep working as intended...

MichaelChirico · 2024-11-27T23:57:04Z

inst/tests/tests.Rraw

+# coerce Dates to double if join on multiple columns, #6602
+x = data.table(a=1L)
+y = data.table(c=1L, d=1)
+test(2297.01, options=c(datatable.verbose=TRUE), y[x, on=.(c == a, d == a)], data.table(c=1L, d=1L), output="Coercing .*c to type double")


nit: WDYT about using verbose=TRUE argument to [ instead?

a also gets coerced, right? should that be reflected in the output= test?

(also, I might expect d to be the coerced column here...)

MichaelChirico · 2024-11-28T00:00:22Z

inst/tests/tests.Rraw

+test(2297.02, options=c(datatable.verbose=TRUE), y[x, on=.(d == a, c == a)], data.table(c=1L, d=1L), output="Coercing .*c to type double")
+x = data.table(a=1)
+y = data.table(c=1, d=1L)
+test(2297.03, options=c(datatable.verbose=TRUE), y[x, on=.(c == a, d == a)], data.table(c=1L, d=1L), output="Coercing double column x.c (which contains no fractions) to type integer")


Ditto about a's coercion appearing in the output

MichaelChirico · 2024-11-28T00:05:00Z

inst/tests/tests.Rraw

+# real double
+x = data.table(a=1)
+y = data.table(c=1.5, d=1L)
+test(2297.21, y[x, on=.(c == a, d == a)], data.table(c=1, d=1))


It might help to include a non-join column in these tests, here at first glance it's surprising, but this is the right-join-by-default effect at play (i.e. there'd be no rows if nomatch=NULL).

ben-schwen added 2 commits November 3, 2024 22:03

fix type coercion in bmerge

e04e67f

fix bracket

c55bfcc

ben-schwen added 2 commits November 3, 2024 22:24

add test cases

272fc59

fix lint

3aba1cc

fix old test case

1d7a8c2

ben-schwen marked this pull request as ready for review November 3, 2024 21:52

ben-schwen requested a review from MichaelChirico as a code owner November 3, 2024 21:52

iago-pssjd reviewed Nov 19, 2024

View reviewed changes

R/bmerge.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Nov 22, 2024

View reviewed changes

MichaelChirico mentioned this pull request Nov 22, 2024

Emphasize S3-S4 difference in recommendation for how to fix class_equals_linter() r-lib/lintr#2688

Merged

ben-schwen added 4 commits November 24, 2024 15:58

rename x/i class

ff934cc

add minimal test

a75fb0e

indent loop

1bb60bf

add fix in one direction

7ee12aa

MichaelChirico reviewed Nov 24, 2024

View reviewed changes

ben-schwen marked this pull request as draft November 25, 2024 11:58

ben-schwen added 3 commits November 26, 2024 22:06

remove indent to cater for diff

562a9fd

Revert "remove indent to cater for diff"

49f849b

This reverts commit 562a9fd.

remove indent

cfa5d71

ben-schwen added 8 commits November 27, 2024 09:52

start modularize

ca6756f

fix cases

d41bbf2

ensure same types for test

33d7009

add test for codecov

91c53c5

simplify

6c68fb7

fix test on windows

5aaee92

simplify

7a64f22

add coerce function

69a7b20

ben-schwen marked this pull request as ready for review November 27, 2024 18:33

modularize more

3023b5c