Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL] Issues with monitoring/testing #110

Open
Tracked by #109
mahalakshme opened this issue Sep 30, 2024 · 16 comments
Open
Tracked by #109

[ETL] Issues with monitoring/testing #110

mahalakshme opened this issue Sep 30, 2024 · 16 comments
Assignees

Comments

@mahalakshme
Copy link
Contributor

mahalakshme commented Sep 30, 2024

Issue:

ETL for rwbngos2023 completed in a minute, but in database it looks like it took 15 mins giving a wrong picture

Image

Image

If you see in the below image as well, the start time of some jobs are earlier than the end time of other jobs. So this looks like either the start time of next job or end time of previous job is recorded incorrectly. This is posing issues for monitoring the ETL jobs.

Image

AC:

  • started_at time needs to be the time, when the ETL process started for that org and ended_at needs to be the time ETL job finished running.
  • Extend this report to send notification mail(to Maha alone) for the below:
    - When ETL of an organisation of 'Organisation Category' - production or UAT and 'Organisation Status' - Live fails.
    - When time taken to complete one round of ETL takes more than 1.5 hours
  • Add a way to trigger the ETL of an org immediately overriding the other orgs in the queue - Currently testing of ETL stories are very difficult and time-consuming since even after disabling->enabling the ETL, most of the time it takes around 20-30 mins to trigger for that organisation. This should work when a button 'Trigger' is clicked on the org page of super admin. For usual enable, disable it need not trigger immediately. This is to prevent accidental disabling and to ve the audit unchanged.

Image

Technical analysis/suggestions:

  • Finding the times at appropriate callback methods(job listener events) should help to fix start and end time
  • To trigger immediately: Currently we are triggering with a scheduler, and hence it might not take effect even when we mention a start time. Try triggering it once without scheduler like below, and then trigger it with a scheduler.
    Trigger trigger = TriggerBuilder.newTrigger()
    .withIdentity("triggerName", "triggerGroup")
    .startNow()
    .build();
  • One way to identify if all the jobs of ETL are triggered in one-and-half an hour is to cross-check entries of scheduled_job_run table with qrtz_job_details

Ignore:

What:

  • ETL failures
  • time taken for an ETL if it exceeds 15 mins
  • one run completes in 1:30 hours
  • disabled for unneeded orgs

Who:

  • can setup report - get alert - Maha to look into it
  • generate bundle from UAT - store ETL status in the bundle
  • check with implementation team
@mahalakshme mahalakshme converted this from a draft issue Sep 30, 2024
@mahalakshme mahalakshme mentioned this issue Nov 4, 2024
@mahalakshme mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Nov 5, 2024
@mahalakshme mahalakshme changed the title ETL taking long time issue [ETL] Issues with monitoring Nov 5, 2024
@mahalakshme mahalakshme moved this from In Analysis Review to Ready in Avni Product Nov 29, 2024
@mahalakshme mahalakshme changed the title [ETL] Issues with monitoring [ETL] Issues with monitoring/testing Nov 29, 2024
@mahalakshme mahalakshme moved this from Ready to In Analysis in Avni Product Dec 2, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 2, 2024

Joy's Comment

the ACs look like they will be costly to implement since we are leveraging spring batch here and queuing and the tables are managed by it.

For AC1 (monitoring), we can rely on logs as source of truth and ignore the DB.
For AC2, we don't really have a concept of 'round' of ETL so again might be costly/complicated to determine this. Per org should be easier to do
For AC3, would it be sufficient to have an endpoint that disables ETL for all orgs so we can focus on the org we want to test as implementing priority within the queue is again going to be costly

@himeshr
Copy link
Contributor

himeshr commented Dec 2, 2024

Himesh's Comment

In general, i agree with the issues that we aim to resolve here.. but have difference in the approach to resolve them though

  • For AC1, Would recommend introducing an additional ETL-JOB-AUDIT table with info like ORGANISATION_UUID, ETL_TRIGGER_TYPE(org/orgGroup), ETL_START_TIME, ETL_END_TIME, ETL_JOB_STATUS, ETL_JOB_RUNTIME
  • For AC2, Create a Metabase alert on ETL-JOB-AUDIT table as per requirement
  • For AC3, Introduce an Adhoc ETL Job, that runs in Parallel to the Quartz scheduled ETL jobs, this would be run only once, scheduled immediately / within a day based on queue of Adhoc triggers and does not make any change to the Quartz based periodic execution of ETL (precedent exists in Avni-Integration-service for doing this)

@mahalakshme
Copy link
Contributor Author

Viveks comment:

We are using Quartz btw, not spring batch for ETL.
I think this line may be the problem:

scheduledJobRun.startedAt = trigger.getNextFireTime();

We perhaps should use new Date here
We can try to figure out why we are getting this issue, because the database entries are managed by us using the JobListener
If job listener is not the best way, we can hook into the actual execution callback that we get to record the times

@mahalakshme mahalakshme moved this from In Analysis to Ready in Avni Product Dec 4, 2024
@1t5j0y 1t5j0y moved this from Ready to QA Failed in Avni Product Dec 10, 2024
@1t5j0y 1t5j0y moved this from QA Failed to In Progress in Avni Product Dec 10, 2024
@1t5j0y 1t5j0y self-assigned this Dec 10, 2024
1t5j0y added a commit that referenced this issue Dec 13, 2024
… actual job execution and add a higher priority trigger for first run of ETL Sync job for an org
@1t5j0y
Copy link
Contributor

1t5j0y commented Dec 13, 2024

AC1 fixed as per Vivek's input above and seems to work well.
AC3 fixed by adding an additional higher priority trigger for the first run after enabling ETL for an org.

AC2 (metabase report) pending. Moving to code review ready so AC1 and AC3 can be tested.

@1t5j0y 1t5j0y moved this from In Progress to Code Review Ready in Avni Product Dec 13, 2024
@1t5j0y
Copy link
Contributor

1t5j0y commented Dec 13, 2024

AC2 Metabase Reports:
https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs-production-environment

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

@himeshr himeshr moved this from Code Review Ready to In Code Review in Avni Product Dec 13, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 13, 2024

AC2 Metabase Reports: https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

Made slight additions to the first report to filter by "SyncJobs" job_group and show OrgCategory and OrgStatus values in readable format.

Code review didn't result in any other issues of concern.

@himeshr himeshr moved this from In Code Review to Code Review Ready in Avni Product Dec 13, 2024
@himeshr himeshr moved this from Code Review Ready to QA Ready in Avni Product Dec 13, 2024
@himeshr
Copy link
Contributor

himeshr commented Dec 13, 2024

@AchalaBelokar AchalaBelokar moved this from QA Ready to In QA in Avni Product Dec 16, 2024
@AchalaBelokar AchalaBelokar moved this from In QA to Done in Avni Product Dec 16, 2024
@himeshr
Copy link
Contributor

himeshr commented Jan 3, 2025

On debugging issue reported by Achala, observed the following in Prod environment:

  • We are successfully creating a one time Higher Priority trigger for T+5 seconds
  • We are also creating a repeating Normal Priority trigger for T+RepeatIntervalMinutes(120)
  • Quartz has a lot of triggers backed up which need to have run more than 1 hour ago, these get recycled for later as part of Misfire time limit config(1 hour)
  • Our Higher Priority trigger is still stuck in queue behind earlier trigger time jobs, and its priority matters only if trigger time is same or later

Discussion thoughts:

  • There is no point in optimizing our HigherPriority triggerTime through some logic, as the backed-up triggers in combination with Misfire threshold will most likely result in our trigger being discarded
  • We should rather focus on a way to get priority to be given higher precedence than time in-order to get it to be picked up next

@himeshr himeshr reopened this Jan 3, 2025
@github-project-automation github-project-automation bot moved this from Done to Triaged in Avni Product Jan 3, 2025
@himeshr himeshr moved this from Triaged to Ready in Avni Product Jan 3, 2025
@himeshr himeshr moved this from Ready to In Progress in Avni Product Jan 6, 2025
@himeshr himeshr assigned himeshr and unassigned 1t5j0y Jan 6, 2025
@himeshr
Copy link
Contributor

himeshr commented Jan 6, 2025

Notes

Priorities are only compared when triggers have the same fire time. A trigger scheduled to fire at 10:59 will always fire before one scheduled to fire at 11:00. Therefore configured the first run to be in past with higher priority and start time in past, with the Misfire instruction set to fire now.

@mahalakshme and @1t5j0y => Code changes done in ETL 10.2 branch, as the issue is easily testable only in prod environment, and the code change is localized and easy to deploy -> test in prod env, with ease of revertion if causing issues.

@himeshr himeshr moved this from In Progress to Code Review Ready in Avni Product Jan 6, 2025
@himeshr
Copy link
Contributor

himeshr commented Jan 6, 2025

@mahalakshme During dev-testing for this issue, we found that ETL run for OrgGroups takes almost an hour during working hours, this severly delays etl runs for rest of the orgs and our repeatInterval of 120 minutes in prod is not sufficient to accomodate this..

Therefore recommend 2 action-items:

  1. Increase the repeatInterval to 4 hours and recreate triggers with the same repeat interval using ETL Job APIs..
  2. Optimize the ETL run itself, to speed up etl for OrgGroups and orgs in general

@1t5j0y
Copy link
Contributor

1t5j0y commented Jan 6, 2025

@himeshr couldn't find any related documentation and can't think of an easy way to test this but will repeated misfires with fire now misfire instruction cause any issues / is there a limit to the number of times it will try to fire now for a misfire?

@himeshr
Copy link
Contributor

himeshr commented Jan 6, 2025

@himeshr couldn't find any related documentation and can't think of an easy way to test this but will repeated misfires with fire now misfire instruction cause any issues / is there a limit to the number of times it will try to fire now for a misfire?

"Fire Now" should be ideally used for One time Triggers, which is the case with FirstRun trigger.

Screenshot 2025-01-06 at 4 35 04 PM

@himeshr
Copy link
Contributor

himeshr commented Jan 6, 2025

This is a good article to understand Quartz misfire scenarios and choice implications..
https://nurkiewicz.com/2012/04/quartz-scheduler-misfire-instructions.html

@1t5j0y
Copy link
Contributor

1t5j0y commented Jan 6, 2025

"Fire Now" should be ideally used for One time Triggers, which is the case with FirstRun trigger.

The scenario I am worried about is:

  • ETL is executing for org A
  • ETL enabled for org B - triggers created
  • org B 'First Run' trigger executes and misfires as org A ETL is still running and is 'fired now' again.
  • Above step keeps happening

@himeshr
Copy link
Contributor

himeshr commented Jan 6, 2025

Issues raised regarding ETL runs

  • ETL is not run immediately after doing org Disable/Enable analytics
  • ETL in general takes long time to reflect changes (days rather than hours)
  • Org Group ETL runs take hours to complete
  • Any other?

Discussion notes with recommendation on how to resolve issues raised for ETL run management

  • Improve ETL processing performance (Optimize for invocation of queries atleast, if not for optimizing the queries themselves)
  • Reduce frequency of ETL org runs (increase interval)
  • Configure Misfire tolerance to be small value(10 mins) rather than the large one now (8 hours)
  • Make RepeatJobs scheduling misfire logic to be "withMisfireHandlingInstructionNowWithRemainingCount" instead of default of "withMisfireHandlingInstructionNextWithRemainingCount"
  • Make it multi-threaded (2 threads) To get ETL adhoc requests served immediately

Note: We would need to retrigger ETL jobs after repeat frequency config change using Postman Run Collection capability.

@himeshr himeshr moved this from Code Review Ready to In Analysis in Avni Product Jan 6, 2025
himeshr added a commit that referenced this issue Jan 6, 2025
himeshr added a commit that referenced this issue Jan 6, 2025
himeshr added a commit that referenced this issue Jan 6, 2025
himeshr added a commit that referenced this issue Jan 6, 2025
@1t5j0y
Copy link
Contributor

1t5j0y commented Jan 6, 2025

Job chaining might also be viable for ETL to guarantee execution for all orgs and avoid misfires. Will require us to 'splice' newly scheduled jobs into the chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Analysis
Development

No branches or pull requests

4 participants