You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Relevant issue (partially contains this issue): #808
See the following document: https://github.com/hasadna/anyway/blob/dev/docs/Architecture/CBS.md
See part 6:
Data Loading - Separate to multiple stages - see CBS ETL in process refactoring Make sure data is not loaded multiple times and that no duplicates are created
Current Flow: email -> s3 -> updated tables email -> s3: can be scheduled once a week / even a day s3 -> Data Tables: Needs to be scheduled when both accident type 1 and accident type 3 of that months are in s3 Explanation: Nowadays we pull the last data from last 4 emails and insert data to s3 (after deleting previous data), we need to pull only emails we didn't save to s3 - hence track on the emails we already read and not re-insert them. Optional: We can add CBS data versioning in s3 - right now we delete old data and insert new one.
Expected behavior
Checking email once a day, when a new email arrives that we didn't load to S3 (perhaps create a data versioning table as mentioned above), load its data to s3.
The text was updated successfully, but these errors were encountered:
Describe the bug
Relevant issue (partially contains this issue): #808
See the following document: https://github.com/hasadna/anyway/blob/dev/docs/Architecture/CBS.md
See part 6:
Data Loading - Separate to multiple stages - see CBS ETL in process refactoring Make sure data is not loaded multiple times and that no duplicates are created
Current Flow: email -> s3 -> updated tables email -> s3: can be scheduled once a week / even a day s3 -> Data Tables: Needs to be scheduled when both accident type 1 and accident type 3 of that months are in s3 Explanation: Nowadays we pull the last data from last 4 emails and insert data to s3 (after deleting previous data), we need to pull only emails we didn't save to s3 - hence track on the emails we already read and not re-insert them. Optional: We can add CBS data versioning in s3 - right now we delete old data and insert new one.
Expected behavior
Checking email once a day, when a new email arrives that we didn't load to S3 (perhaps create a data versioning table as mentioned above), load its data to s3.
The text was updated successfully, but these errors were encountered: