Redshift: Add deduplication #50

colmsnowplow · 2021-02-23T14:50:13Z

A rare edge case can occur with current 'exclude all' duplicates strategy, when an event is processed, and a subsequent run contains duplicates of that event along with other, legitimate events. For example:

A run contains a page view with event id 123 - this event has page view in session index = 1.

A subsequent run contains a duplicate of that event, along with another, legitimate page view event in the same session. The data from that session in this run will be:

page view event - event ID: 123
page view event - event ID: 123
page view event - event ID: 456

In this second run, the already processed event 123 will be removed by deduplication, and the new one 456 will be assigned page view in session index of 1.

Page View 123 won't be removed from the table, so we will have a session with two page views of page view in session index of 1.

We might solve this by using session_id to update the table, but this feels somewhat fragile.

We can also solve it by implementing better deduplication logic - to keep the first event_id (by collector_tstamp).

The tricky part is that ideally we only keep the first event IF the collector_tstamp is not duplicated also, and remove both otherwise (to avoid cartesian join). However, if we remove both we still have a chance to hit this issue.

One way out of that is to implement a mechanism to apply the incremental logic to all relevant atomic tables (thereby creating deduplicated _staged tables for every join that might be involved in a customisation).

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redshift: Add deduplication #50

Redshift: Add deduplication #50

colmsnowplow commented Feb 23, 2021

Redshift: Add deduplication #50

Redshift: Add deduplication #50

Comments

colmsnowplow commented Feb 23, 2021