Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break Ties when joining on timestamps #106

Open
colmsnowplow opened this issue Jun 29, 2021 · 2 comments
Open

Break Ties when joining on timestamps #106

colmsnowplow opened this issue Jun 29, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@colmsnowplow
Copy link
Collaborator

An edge case has come up where two things conspired to produce duplicates in the model:

  1. Some users' data have different session_ids at the same time (we think because of a race condition between tabs)
  2. Latency between device and collector results in exactly the same derived_tstamp for these events

These two factors mean that a user has two sessions with exactly the same start_tstamp, but different domain_sessionid's - and these happen to be the first tstamps for the user.

This produces duplicates in the users table when we join on start_tstamp:

The same issue may exist when we join on end_tstamp for aggregates.

This seems very rare, but we should introduce some means of breaking a tie in the case where derived_tstamps happen to evaluate to exactly the same thing.

event 1:
    "collector_tstamp": "2021-06-25 16:51:55.559 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:54.919 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:54.911 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

event 2:
    "collector_tstamp": "2021-06-25 16:51:55.557 UTC",
    "dvce_sent_tstamp": "2021-06-25 16:51:55.077 UTC",
    "dvce_created_tstamp": "2021-06-25 16:51:55.071 UTC",
    "derived_tstamp": "2021-06-25 16:51:55.551 UTC",

(first reported on ZD ticket 27522)

@colmsnowplow colmsnowplow added the bug Something isn't working label Jun 29, 2021
@awoehrl
Copy link

awoehrl commented Jul 9, 2021

Hi,

just to add another data point:
I just ran the model on 100 days of our data and this happened with two user_ids. Both have two session ids, but exactly the same start_time.

Cheers

@bill-warner
Copy link
Contributor

Raised the same issue on the dataform model. I have included a proposed solution there which I think we should also be able to use on the SQL runner version.

snowplow-archive/dataform-data-models#10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants