You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A rare edge case can occur with current 'exclude all' duplicates strategy, when an event is processed, and a subsequent run contains duplicates of that event along with other, legitimate events. For example:
A run contains a page view with event id 123 - this event has page view in session index = 1.
A subsequent run contains a duplicate of that event, along with another, legitimate page view event in the same session. The data from that session in this run will be:
In this second run, the already processed event 123 will be removed by deduplication, and the new one 456 will be assigned page view in session index of 1.
Page View 123 won't be removed from the table, so we will have a session with two page views of page view in session index of 1.
We might solve this by using session_id to update the table, but this feels somewhat fragile.
We can also solve it by implementing better deduplication logic - to keep the first event_id (by collector_tstamp).
The tricky part is that ideally we only keep the first event IF the collector_tstamp is not duplicated also, and remove both otherwise (to avoid cartesian join). However, if we remove both we still have a chance to hit this issue.
One way out of that is to implement a mechanism to apply the incremental logic to all relevant atomic tables (thereby creating deduplicated _staged tables for every join that might be involved in a customisation).
The text was updated successfully, but these errors were encountered:
A rare edge case can occur with current 'exclude all' duplicates strategy, when an event is processed, and a subsequent run contains duplicates of that event along with other, legitimate events. For example:
A run contains a page view with event id
123
- this event has page view in session index = 1.A subsequent run contains a duplicate of that event, along with another, legitimate page view event in the same session. The data from that session in this run will be:
page view event - event ID:
123
page view event - event ID:
123
page view event - event ID:
456
In this second run, the already processed event
123
will be removed by deduplication, and the new one456
will be assigned page view in session index of 1.Page View
123
won't be removed from the table, so we will have a session with two page views of page view in session index of 1.We might solve this by using session_id to update the table, but this feels somewhat fragile.
We can also solve it by implementing better deduplication logic - to keep the first
event_id
(bycollector_tstamp
).The tricky part is that ideally we only keep the first event IF the collector_tstamp is not duplicated also, and remove both otherwise (to avoid cartesian join). However, if we remove both we still have a chance to hit this issue.
One way out of that is to implement a mechanism to apply the incremental logic to all relevant atomic tables (thereby creating deduplicated
_staged
tables for every join that might be involved in a customisation).The text was updated successfully, but these errors were encountered: