-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL] Issues with monitoring/testing #110
Comments
Joy's Commentthe ACs look like they will be costly to implement since we are leveraging spring batch here and queuing and the tables are managed by it. For AC1 (monitoring), we can rely on logs as source of truth and ignore the DB. |
Himesh's CommentIn general, i agree with the issues that we aim to resolve here.. but have difference in the approach to resolve them though
|
Viveks comment:We are using Quartz btw, not spring batch for ETL.
We perhaps should use new Date here |
… actual job execution and add a higher priority trigger for first run of ETL Sync job for an org
AC1 fixed as per Vivek's input above and seems to work well. AC2 (metabase report) pending. Moving to code review ready so AC1 and AC3 can be tested. |
AC2 Metabase Reports: https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run |
Made slight additions to the first report to filter by "SyncJobs" job_group and show OrgCategory and OrgStatus values in readable format. Code review didn't result in any other issues of concern. |
Additionally create following reports for QA and others to determin ETL job status: |
On debugging issue reported by Achala, observed the following in Prod environment:
Discussion thoughts:
|
NotesPriorities are only compared when triggers have the same fire time. A trigger scheduled to fire at 10:59 will always fire before one scheduled to fire at 11:00. Therefore configured the first run to be in past with higher priority and start time in past, with the Misfire instruction set to fire now. @mahalakshme and @1t5j0y => Code changes done in ETL 10.2 branch, as the issue is easily testable only in prod environment, and the code change is localized and easy to deploy -> test in prod env, with ease of revertion if causing issues. |
@mahalakshme During dev-testing for this issue, we found that ETL run for OrgGroups takes almost an hour during working hours, this severly delays etl runs for rest of the orgs and our repeatInterval of 120 minutes in prod is not sufficient to accomodate this.. Therefore recommend 2 action-items:
|
@himeshr couldn't find any related documentation and can't think of an easy way to test this but will repeated misfires with fire now misfire instruction cause any issues / is there a limit to the number of times it will try to fire now for a misfire? |
"Fire Now" should be ideally used for One time Triggers, which is the case with FirstRun trigger. |
This is a good article to understand Quartz misfire scenarios and choice implications.. |
The scenario I am worried about is:
|
Issues raised regarding ETL runs
Discussion notes with recommendation on how to resolve issues raised for ETL run management
Note: We would need to retrigger ETL jobs after repeat frequency config change using Postman Run Collection capability. |
…in past" This reverts commit f50bed2.
Job chaining might also be viable for ETL to guarantee execution for all orgs and avoid misfires. Will require us to 'splice' newly scheduled jobs into the chain. |
Issue:
ETL for rwbngos2023 completed in a minute, but in database it looks like it took 15 mins giving a wrong picture
If you see in the below image as well, the start time of some jobs are earlier than the end time of other jobs. So this looks like either the start time of next job or end time of previous job is recorded incorrectly. This is posing issues for monitoring the ETL jobs.
AC:
- When ETL of an organisation of 'Organisation Category' - production or UAT and 'Organisation Status' - Live fails.
- When time taken to complete one round of ETL takes more than 1.5 hours
Technical analysis/suggestions:
Trigger trigger = TriggerBuilder.newTrigger()
.withIdentity("triggerName", "triggerGroup")
.startNow()
.build();
scheduled_job_run
table withqrtz_job_details
Ignore:
What:
Who:
The text was updated successfully, but these errors were encountered: