Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backfill functionality #117

Open
bill-warner opened this issue Dec 22, 2021 · 3 comments
Open

Add backfill functionality #117

bill-warner opened this issue Dec 22, 2021 · 3 comments

Comments

@bill-warner
Copy link
Contributor

Currently it is tricky to backfill new custom modules. The easiest path is to tear everything down and start again. This is inefficient, particularly when the custom module is completely independent of the 'core' derived tables meaning these tables could be left untouched in theory.

The issue is that the manifest system used for incrementalisation has no insight into what modules have consumed what events, only that the event has been processed at some point.

Solution

It is hard to dynamically back-fill data in the model but we could assist the process by populating the events_staged table with all the events that have been processed up until the current point in time. This could then be consumed by the new custom module as part of a one-off job, then revert back to the standard job (including the newly filled custom module).

How this looks in practice using BQ as an example:

Before running the backfill we should ensure all _staged tables are empty i.e. all data has been consumed by the standard modules. Then:

  1. Calc limits for the run. We can skip steps 2-4 in the base module. In 05-batch-limits.sql:
CREATE OR REPLACE TABLE {{.scratch_schema}}.base_run_limits{{.entropy}}
AS(
  SELECT
    MIN(collector_tstamp) AS lower_limit,
    MAX(collector_tstamp) AS upper_limit

  FROM
    {{.output_schema}}.base_event_id_manifest{{.entropy}}
);
  1. Generate events_this_run. In 06-events-this-run join in event manifest rather than session manifest to get all events previously processed while using the limits calculated in the last step:
CREATE OR REPLACE TABLE {{.scratch_schema}}.events_this_run{{.entropy}}
AS(
  -- Without downstream joins, it's safe to dedupe by picking the first event_id found.
  SELECT
    ARRAY_AGG(e ORDER BY e.collector_tstamp LIMIT 1)[OFFSET(0)].*
  FROM (
    SELECT
        a.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0[SAFE_OFFSET(0)].id AS page_view_id,
        a.* EXCEPT(contexts_com_snowplowanalytics_snowplow_web_page_1_0_0)

    FROM
      {{.input_schema}}.events a
    INNER JOIN
      {{.scratch_schema}}.base_event_id_manifest{{.entropy}} b
    ON a.event_id = b.event_id
    WHERE
      a.collector_tstamp >= LOWER_LIMIT
      AND a.collector_tstamp <= UPPER_LIMIT

      {{if eq (or .derived_tstamp_partitioned false) true}}

        AND a.derived_tstamp >= LOWER_LIMIT
        AND a.derived_tstamp <= UPPER_LIMIT

      {{end}}

  ) e
  GROUP BY
    e.event_id
);

The reason for inner joining the events table with the base_event_id_manifest, rather than just processing all events between the lower_limit and upper_limit, is to ensure we don't process previously unseen late arriving events into the new custom module that havent previously been consumed by the standard modules. This could result in modules potentially becoming out of sync.

  1. Commit this to events_staged using standard step 8.
  2. For the cleanup steps of the base module, ignore the manifest step.
  3. Let the custom module consume events_staged
  4. Run 98-truncate-base-staged in the page views module to truncate events staged.
  5. Revert to standard job

This alternative base module logic could be toggled on/off using a backfill boolean in the playbook.

One potential problem might be if the backfill is particularly large it may not be possible to process all data in one go. In which case you would have to chunk the backfill into say n month batches. This adds complication due to sessions that straddle batches and therefore need to be reprocessed in the subsequent batch. This could be solved but would require slightly more complex logic.

@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Dec 22, 2021

I like the plan!

One potential problem might be if the backfill is particularly large it may not be possible to process all data in one go. In which case you would have to chunk the backfill into say n month batches. This adds complication due to sessions that straddle batches and therefore need to be reprocessed in the subsequent batch. This could be solved but would require slightly more complex logic.

I wonder if we might be able to solve this with some kind of mechanism which subsets the session ID manifest, and uses it to limit the data... If the goal is to chunk the job into batches it can be fairly naive, eg:

  • copy session manifest to new backfill_session_manifest if it doesn't exist (so we don't re-create it if we're on chunk >1 of the backfill)
  • Take the top n rows for that manifest (since it seems this can be arbitrary - we don't seem to necessarily need any more nuanced logic than 'limit the amount of data going in')
  • Limit the data going into the base module to only those session IDs
  • Run model
  • At the end, delete those session IDs from the backfill manifest
  • If the backfill manifest is now empty, drop the table (so a new backfill can basically start again)

We would need to make sure that the initial limits of the run still cover the entire backfill period I think, but then once we've identified a specific chunk of data to process we can further constrain limits.

I'm hazy on specifics so not sure whether or not it would work, but perhaps it sparks some ideas. :)

@bill-warner
Copy link
Contributor Author

Yep agreed we could certainly use the data within the sessions manifest to assist with the batching. Rather than the approach outlined above, perhaps there is an alternative to method making use of the entropy suffix on table names within the base module.

By changing the entropy only within the base module we can effectively create a temporary base module to handle the backfill, which can then be discarded entirely once completed. This would mean significantly less changes to the existing codebase. Downstream modules that reference event_staged would need to have all references updated to reflect this temp entropy value. Haven't fully worked this through but worth considering.

@colmsnowplow
Copy link
Collaborator

That could be a nice solution! I think there's likely some some clever configuration solution to resolve the referencing issue.

What I really like about this idea is that it gives the opportunity for HITL intervention during the recovery process - one could run the recovery base module, check that the results are as expected, then run the part which integrates it into the other modules, if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants