[Feature] CBS import data to s3 from new emails #2176

atalyaalon · 2022-04-23T17:42:07Z

Describe the bug
Relevant issue (partially contains this issue): #808

See the following document: https://github.com/hasadna/anyway/blob/dev/docs/Architecture/CBS.md
See part 6:
Data Loading - Separate to multiple stages - see CBS ETL in process refactoring Make sure data is not loaded multiple times and that no duplicates are created
Current Flow: email -> s3 -> updated tables email -> s3: can be scheduled once a week / even a day s3 -> Data Tables: Needs to be scheduled when both accident type 1 and accident type 3 of that months are in s3 Explanation: Nowadays we pull the last data from last 4 emails and insert data to s3 (after deleting previous data), we need to pull only emails we didn't save to s3 - hence track on the emails we already read and not re-insert them. Optional: We can add CBS data versioning in s3 - right now we delete old data and insert new one.

Expected behavior
Checking email once a day, when a new email arrives that we didn't load to S3 (perhaps create a data versioning table as mentioned above), load its data to s3.

atalyaalon added the bug label Apr 23, 2022

atalyaalon added this to the Future milestone Apr 23, 2022

ziv17 added the prio 1 label Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] CBS import data to s3 from new emails #2176

[Feature] CBS import data to s3 from new emails #2176

atalyaalon commented Apr 23, 2022 •

edited

Loading

[Feature] CBS import data to s3 from new emails #2176

[Feature] CBS import data to s3 from new emails #2176

Comments

atalyaalon commented Apr 23, 2022 • edited Loading

atalyaalon commented Apr 23, 2022 •

edited

Loading