Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify CBS weekly data collection script #2745

Open
3 of 6 tasks
atalyaalon opened this issue Dec 19, 2024 · 0 comments
Open
3 of 6 tasks

Modify CBS weekly data collection script #2745

atalyaalon opened this issue Dec 19, 2024 · 0 comments
Milestone

Comments

@atalyaalon
Copy link
Collaborator

atalyaalon commented Dec 19, 2024

Modify CBS weekly data collection script
Current Airflow is here

  • Add CBS data to an S3 bucket - create a repo for provider code and for each year.
  • Import cbs data from email and upload to AWS - nowdays importmail process is running once a week and uploads to s3 data from last 2 cbs emails
  • Trigger Load CBS data routinely, once a week
    command to delete a certain year and load starting that year, for example 2019:
    python main.py process cbs --source s3 --load_start_year=2019
    cbs parser is in file anyway/parsers/cbs/executor.py
    delete only data starting the year of the current files that arrived
  • Create DB table versioning of emails we load from email to s3, and load new data to s3 ONLY when new email data arrives
  • While loading data from email to s3, detect what years are loaded, and use the earliest year as load_start_year in the triggering of CBS loading. You may use an additional table.
  • Modify schedule from weekly back to daily, since CBS data will be loaded only when a new email arrives (see this pr that changed from daily to weekly)

Note that CBS processes are now in anyway-etl repo - see process repo here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant