Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joins do not warn user when using POSc and Date comparisons #6605

Open
al-obrien opened this issue Nov 6, 2024 · 9 comments · May be fixed by #6610
Open

Joins do not warn user when using POSc and Date comparisons #6605

al-obrien opened this issue Nov 6, 2024 · 9 comments · May be fixed by #6610
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins

Comments

@al-obrien
Copy link

Issue

There is a possible silent warning/error that occurs when joining on POSc and Date columns. It is unclear if this is expected behaviour based on my readings.

The example below shows how the same tables (dt1 and dt2) either join or fail to (left) join based upon the format of the date/time columns in dt1 (either POSc or Date). Although the simple <= and >= comparisons on vectors pass a warning, there is no warning passed when performing a join, which led to confusion as to the cause of join difference.

Example

library(data.table)

# POSc
dt1_posc <- data.table(id = 1,
		date_time1 = as.POSIXct('2017-01-01'),
		date_time2 = as.POSIXct('2020-01-01'), 
		infocol = 'test')
dt1 <- data.table(id = 1,
		date_time1 = as.Date('2017-01-01'),
		date_time2 = as.Date('2020-01-01'), 
		infocol = 'test')

dt2 <- data.table(id = c(1, 1, 2),
		date_1=as.Date(c('2018-01-01', '2019-01-01', '2010-01-01')),
		date_2=as.Date(c('2018-06-01', '2020-11-01', '2010-01-01')))

# Works
dt1[dt2, on = .(id, date_time1 <= date_2, date_time2 >= date_1)]
#>      id date_time1 date_time2 infocol
#>   <num>     <Date>     <Date>  <char>
#>1:     1 2018-06-01 2018-01-01    test
#>2:     1 2020-11-01 2019-01-01    test
#>3:     2 2010-01-01 2010-01-01    <NA>

# Doesnt work
dt1_posc[dt2, on = .(id, date_time1 <= date_2, date_time2 >= date_1)]
#>      id date_time1 date_time2 infocol
#>   <num>     <Date>     <Date>  <char>
#>1:     1 2018-06-01 2018-01-01    <NA>
#>2:     1 2020-11-01 2019-01-01    <NA>
#>3:     2 2010-01-01 2010-01-01    <NA>

# Check simple comparison
dt1_posc$date_time1 <= dt2$date_2 & dt1_posc$date_time2 >= dt2$date_1
#> [1] FALSE FALSE FALSE
#> Warning message:
#> Incompatible methods ("Ops.POSIXt", "Ops.Date") for "<=" 
dt1$date_time1 <= dt2$date_2 & dt1$date_time2 >= dt2$date_1
#> [1]  TRUE  TRUE FALSE

Session info

I used the latest version of data.table on CRAN (1.16.2) in the following session

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_Canada.utf8  LC_CTYPE=English_Canada.utf8    LC_MONETARY=English_Canada.utf8 LC_NUMERIC=C                    LC_TIME=English_Canada.utf8    

time zone: America/Edmonton
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.16.2

loaded via a namespace (and not attached):
[1] compiler_4.3.1    tools_4.3.1       rstudioapi_0.15.0 
@r2evans
Copy link

r2evans commented Nov 6, 2024

I think your issue is not that it does or doesn't join, it's that base R's warning is either bypassed or suppressed, is that right? Unfortunately in R, the units for Date (day) and POSIXt (second) are not automatically adjusted, requiring user interaction to perform accurately. (Fixing this is on my wishlist-for-R.)

The lack of a warning does make me wonder (not quite enough yet to look under the hood at the source) how data.table is implementing a number-like join.

(Similarly, I'd really like data.table to stop warning about difference in "tzone" when comparing just times, since it is not about units, and they compare quite accurately. But I don't want to hijack this thread for that wishlist item :-)

@al-obrien
Copy link
Author

I think your issue is not that it does or doesn't join, it's that base R's warning is either bypassed or suppressed, is that right?

That's correct. Its understandable that joins on a POSc and Date column would behave differently but the warning to notify the user on the type difference, in this instance, seems to be bypassed (compared to when using those operators outside of a join context).

@r2evans
Copy link

r2evans commented Nov 6, 2024

My guess is that something internal "knows" that the underlying join keys are both numeric under the hood, and it just works regardless of other attributes of the numbers. (That's one way I might have approached it.)

@ben-schwen
Copy link
Member

This is somewhat related to #6602 and seems to be the same as #1200. Under the hood joins in data.table check if the join keys check are compatible and coerces, if needed (as @r2evans suspected). This can be seen nicely when turning on verbosity options(datatable.verbose=TRUE).

@r2evans
Copy link

r2evans commented Nov 7, 2024

This question seems to be a better structure for the "issue", suggest it might be useful to either keep this issue and close 1200 as a dupe, or at least update 1200 with a better reprex (either from the SO question it references or from this issue).

Further, do you (@ben-schwen) think it's a good idea and likely that having the data.table-join mechanics check for date/posix attributes to see if this is necessary? It can't have a "zero" effect on join performance, though I don't think it'd be much if even perceivable to mortals.

@ben-schwen
Copy link
Member

Yes, we should do some prechecks on compatible types. Ofc, this will never catch everything but it should at least catch common pitfalls (similar as we already do in rbindlist).

@r2evans
Copy link

r2evans commented Nov 9, 2024

@ben-schwen , I haven't studied the data.table-join source, but I can try taking a look at it ...

Some questions:

  • other than this issue, are there other class-checks that have been discussed elsewhere? it makes sense to roll all such prechecks into the same PR
  • for Date-vs-POSIXt, two options in order of simplicity and straight-forward-ness:
    1. Warn, coerce the date to Date 00:00:00 (early midnight), include a hint to that in the warning; or
    2. Based on LHS-vs-RHS, try to infer whether it should be 00:00:00 or 24:00:00 - 1e-9; I haven't thought fully about this yet, not sure if there are gotchas

Offhand, do you know exactly where this precheck would go in the code?

@r2evans
Copy link

r2evans commented Nov 9, 2024

@ben-schwen, here is one possible patch (I'm happy to provide as a PR). It works, but it is doing precisely one bespoke precheck: if both columns to join are in c("Date", "POSIXt", "ITime", "times") (the latter is from chron::) but they are not the same, then it warns and continues with the join.

dt1_posc[dt2, on = .(id, date_time1 <= date_2, date_time2 >= date_1)]
# Warning in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  :
#   Attempting to join column x.date_time1 (POSIXt) with column i.date_2 (Date), they are unlikely to be compatible numbers. Suggest you convert one to the other's class.
# Warning in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  :
#   Attempting to join column x.date_time2 (POSIXt) with column i.date_1 (Date), they are unlikely to be compatible numbers. Suggest you convert one to the other's class.
#       id date_time1 date_time2 infocol
#    <num>     <Date>     <Date>  <char>
# 1:     1 2018-06-01 2018-01-01    <NA>
# 2:     1 2020-11-01 2019-01-01    <NA>
# 3:     2 2010-01-01 2010-01-01    <NA>

With this patch:

--- a/R/bmerge.R
+++ b/R/bmerge.R
@@ -70,6 +70,14 @@ bmerge = function(i, x, icols, xcols, roll, rollends, nomatch, mult, ops, verbos
       }
       stopf("Incompatible join types: %s (%s) and %s (%s). Factor columns must join to factor or character columns.", xname, xclass, iname, iclass)
     }
+    timeclasses = c("Date", "POSIXt", "ITime", "times")
+    xclass_time = intersect(class(x[[xc]]), timeclasses)
+    iclass_time = intersect(class(i[[ic]]), timeclasses)
+    if (length(xclass_time) > 0L && length(iclass_time) > 0L &&
+          !identical(sort(xclass_time), sort(iclass_time))) {
+      warningf("Attempting to join column %s (%s) with column %s (%s), they are unlikely to be compatible numbers. Suggest you convert one to the other's class.",
+               xname, toString(xclass_time), iname, toString(iclass_time))
+    }
     if (xclass == iclass) {
       if (verbose) catf("%s has same type (%s) as %s. No coercion needed.\n", iname, xclass, xname)
       next

I'm making the assumption that the follow-on checks (using getClass(..)) will find issues if a time-class is joined against something else (that is not number-like). If you can think of other class incompatibilities that must be checked, I wonder if it would be better to define a function (perhaps check_merge_compatibility(class(x), class(i))), though that only really seems to scratch the itch of "declarative programming" and not adding much to efficiency/speed, and reduces the clutter only a little.

It could alternatively stopf instead of warn, I'll leave that decision up to the core devs here.

@ben-schwen
Copy link
Member

Looks good. I would only add nanotime to commonly used timeclasses.

stopf seems to hard and would probably break a lot of code.

r2evans pushed a commit to r2evans/data.table that referenced this issue Nov 10, 2024
@ben-schwen ben-schwen linked a pull request Nov 14, 2024 that will close this issue
@MichaelChirico MichaelChirico added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants