Add backfill functionality #117

bill-warner · 2021-12-22T11:36:51Z

Currently it is tricky to backfill new custom modules. The easiest path is to tear everything down and start again. This is inefficient, particularly when the custom module is completely independent of the 'core' derived tables meaning these tables could be left untouched in theory.

The issue is that the manifest system used for incrementalisation has no insight into what modules have consumed what events, only that the event has been processed at some point.

Solution

It is hard to dynamically back-fill data in the model but we could assist the process by populating the events_staged table with all the events that have been processed up until the current point in time. This could then be consumed by the new custom module as part of a one-off job, then revert back to the standard job (including the newly filled custom module).

How this looks in practice using BQ as an example:

Before running the backfill we should ensure all _staged tables are empty i.e. all data has been consumed by the standard modules. Then:

Calc limits for the run. We can skip steps 2-4 in the base module. In 05-batch-limits.sql:

CREATE OR REPLACE TABLE {{.scratch_schema}}.base_run_limits{{.entropy}}
AS(
  SELECT
    MIN(collector_tstamp) AS lower_limit,
    MAX(collector_tstamp) AS upper_limit

  FROM
    {{.output_schema}}.base_event_id_manifest{{.entropy}}
);

Generate events_this_run. In 06-events-this-run join in event manifest rather than session manifest to get all events previously processed while using the limits calculated in the last step:

CREATE OR REPLACE TABLE {{.scratch_schema}}.events_this_run{{.entropy}}
AS(
  -- Without downstream joins, it's safe to dedupe by picking the first event_id found.
  SELECT
    ARRAY_AGG(e ORDER BY e.collector_tstamp LIMIT 1)[OFFSET(0)].*
  FROM (
    SELECT
        a.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0[SAFE_OFFSET(0)].id AS page_view_id,
        a.* EXCEPT(contexts_com_snowplowanalytics_snowplow_web_page_1_0_0)

    FROM
      {{.input_schema}}.events a
    INNER JOIN
      {{.scratch_schema}}.base_event_id_manifest{{.entropy}} b
    ON a.event_id = b.event_id
    WHERE
      a.collector_tstamp >= LOWER_LIMIT
      AND a.collector_tstamp <= UPPER_LIMIT

      {{if eq (or .derived_tstamp_partitioned false) true}}

        AND a.derived_tstamp >= LOWER_LIMIT
        AND a.derived_tstamp <= UPPER_LIMIT

      {{end}}

  ) e
  GROUP BY
    e.event_id
);

The reason for inner joining the events table with the base_event_id_manifest, rather than just processing all events between the lower_limit and upper_limit, is to ensure we don't process previously unseen late arriving events into the new custom module that havent previously been consumed by the standard modules. This could result in modules potentially becoming out of sync.

Commit this to events_staged using standard step 8.
For the cleanup steps of the base module, ignore the manifest step.
Let the custom module consume events_staged
Run 98-truncate-base-staged in the page views module to truncate events staged.
Revert to standard job

This alternative base module logic could be toggled on/off using a backfill boolean in the playbook.

One potential problem might be if the backfill is particularly large it may not be possible to process all data in one go. In which case you would have to chunk the backfill into say n month batches. This adds complication due to sessions that straddle batches and therefore need to be reprocessed in the subsequent batch. This could be solved but would require slightly more complex logic.

The text was updated successfully, but these errors were encountered:

colmsnowplow · 2021-12-22T13:15:37Z

I like the plan!

One potential problem might be if the backfill is particularly large it may not be possible to process all data in one go. In which case you would have to chunk the backfill into say n month batches. This adds complication due to sessions that straddle batches and therefore need to be reprocessed in the subsequent batch. This could be solved but would require slightly more complex logic.

I wonder if we might be able to solve this with some kind of mechanism which subsets the session ID manifest, and uses it to limit the data... If the goal is to chunk the job into batches it can be fairly naive, eg:

copy session manifest to new backfill_session_manifest if it doesn't exist (so we don't re-create it if we're on chunk >1 of the backfill)
Take the top n rows for that manifest (since it seems this can be arbitrary - we don't seem to necessarily need any more nuanced logic than 'limit the amount of data going in')
Limit the data going into the base module to only those session IDs
Run model
At the end, delete those session IDs from the backfill manifest
If the backfill manifest is now empty, drop the table (so a new backfill can basically start again)

We would need to make sure that the initial limits of the run still cover the entire backfill period I think, but then once we've identified a specific chunk of data to process we can further constrain limits.

I'm hazy on specifics so not sure whether or not it would work, but perhaps it sparks some ideas. :)

bill-warner · 2022-01-06T08:04:07Z

Yep agreed we could certainly use the data within the sessions manifest to assist with the batching. Rather than the approach outlined above, perhaps there is an alternative to method making use of the entropy suffix on table names within the base module.

By changing the entropy only within the base module we can effectively create a temporary base module to handle the backfill, which can then be discarded entirely once completed. This would mean significantly less changes to the existing codebase. Downstream modules that reference event_staged would need to have all references updated to reflect this temp entropy value. Haven't fully worked this through but worth considering.

colmsnowplow · 2022-01-10T12:56:20Z

That could be a nice solution! I think there's likely some some clever configuration solution to resolve the referencing issue.

What I really like about this idea is that it gives the opportunity for HITL intervention during the recovery process - one could run the recovery base module, check that the results are as expected, then run the part which integrates it into the other modules, if desired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backfill functionality #117

Add backfill functionality #117

bill-warner commented Dec 22, 2021

colmsnowplow commented Dec 22, 2021 •

edited

Loading

bill-warner commented Jan 6, 2022

colmsnowplow commented Jan 10, 2022

Add backfill functionality #117

Add backfill functionality #117

Comments

bill-warner commented Dec 22, 2021

Solution

colmsnowplow commented Dec 22, 2021 • edited Loading

bill-warner commented Jan 6, 2022

colmsnowplow commented Jan 10, 2022

colmsnowplow commented Dec 22, 2021 •

edited

Loading