-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add backfill functionality #117
Comments
I like the plan!
I wonder if we might be able to solve this with some kind of mechanism which subsets the session ID manifest, and uses it to limit the data... If the goal is to chunk the job into batches it can be fairly naive, eg:
We would need to make sure that the initial limits of the run still cover the entire backfill period I think, but then once we've identified a specific chunk of data to process we can further constrain limits. I'm hazy on specifics so not sure whether or not it would work, but perhaps it sparks some ideas. :) |
Yep agreed we could certainly use the data within the sessions manifest to assist with the batching. Rather than the approach outlined above, perhaps there is an alternative to method making use of the By changing the entropy only within the base module we can effectively create a temporary base module to handle the backfill, which can then be discarded entirely once completed. This would mean significantly less changes to the existing codebase. Downstream modules that reference |
That could be a nice solution! I think there's likely some some clever configuration solution to resolve the referencing issue. What I really like about this idea is that it gives the opportunity for HITL intervention during the recovery process - one could run the recovery base module, check that the results are as expected, then run the part which integrates it into the other modules, if desired. |
Currently it is tricky to backfill new custom modules. The easiest path is to tear everything down and start again. This is inefficient, particularly when the custom module is completely independent of the 'core' derived tables meaning these tables could be left untouched in theory.
The issue is that the manifest system used for incrementalisation has no insight into what modules have consumed what events, only that the event has been processed at some point.
Solution
It is hard to dynamically back-fill data in the model but we could assist the process by populating the
events_staged
table with all the events that have been processed up until the current point in time. This could then be consumed by the new custom module as part of a one-off job, then revert back to the standard job (including the newly filled custom module).How this looks in practice using BQ as an example:
Before running the backfill we should ensure all
_staged
tables are empty i.e. all data has been consumed by the standard modules. Then:05-batch-limits.sql
:events_this_run
. In06-events-this-run
join in event manifest rather than session manifest to get all events previously processed while using the limits calculated in the last step:The reason for inner joining the
events
table with thebase_event_id_manifest
, rather than just processing all events between thelower_limit
andupper_limit
, is to ensure we don't process previously unseen late arriving events into the new custom module that havent previously been consumed by the standard modules. This could result in modules potentially becoming out of sync.events_staged
using standard step 8.events_staged
98-truncate-base-staged
in the page views module to truncate events staged.This alternative base module logic could be toggled on/off using a
backfill
boolean in the playbook.One potential problem might be if the backfill is particularly large it may not be possible to process all data in one go. In which case you would have to chunk the backfill into say n month batches. This adds complication due to sessions that straddle batches and therefore need to be reprocessed in the subsequent batch. This could be solved but would require slightly more complex logic.
The text was updated successfully, but these errors were encountered: