[ETL] Issues with monitoring/testing #110

mahalakshme · 2024-09-30T08:19:37Z

Issue:

ETL for rwbngos2023 completed in a minute, but in database it looks like it took 15 mins giving a wrong picture

If you see in the below image as well, the start time of some jobs are earlier than the end time of other jobs. So this looks like either the start time of next job or end time of previous job is recorded incorrectly. This is posing issues for monitoring the ETL jobs.

AC:

started_at time needs to be the time, when the ETL process started for that org and ended_at needs to be the time ETL job finished running.
Extend this report to send notification mail(to Maha alone) for the below:
- When ETL of an organisation of 'Organisation Category' - production or UAT and 'Organisation Status' - Live fails.
- When time taken to complete one round of ETL takes more than 1.5 hours
Add a way to trigger the ETL of an org immediately overriding the other orgs in the queue - Currently testing of ETL stories are very difficult and time-consuming since even after disabling->enabling the ETL, most of the time it takes around 20-30 mins to trigger for that organisation. This should work when a button 'Trigger' is clicked on the org page of super admin. For usual enable, disable it need not trigger immediately. This is to prevent accidental disabling and to ve the audit unchanged.

Technical analysis/suggestions:

Finding the times at appropriate callback methods(job listener events) should help to fix start and end time
To trigger immediately: Currently we are triggering with a scheduler, and hence it might not take effect even when we mention a start time. Try triggering it once without scheduler like below, and then trigger it with a scheduler.
Trigger trigger = TriggerBuilder.newTrigger()
.withIdentity("triggerName", "triggerGroup")
.startNow()
.build();
One way to identify if all the jobs of ETL are triggered in one-and-half an hour is to cross-check entries of scheduled_job_run table with qrtz_job_details

Ignore:

What:

ETL failures
time taken for an ETL if it exceeds 15 mins
one run completes in 1:30 hours
disabled for unneeded orgs

Who:

can setup report - get alert - Maha to look into it
generate bundle from UAT - store ETL status in the bundle
check with implementation team

The text was updated successfully, but these errors were encountered:

himeshr · 2024-12-02T06:20:59Z

Joy's Comment

the ACs look like they will be costly to implement since we are leveraging spring batch here and queuing and the tables are managed by it.

For AC1 (monitoring), we can rely on logs as source of truth and ignore the DB.
For AC2, we don't really have a concept of 'round' of ETL so again might be costly/complicated to determine this. Per org should be easier to do
For AC3, would it be sufficient to have an endpoint that disables ETL for all orgs so we can focus on the org we want to test as implementing priority within the queue is again going to be costly

himeshr · 2024-12-02T06:33:13Z

Himesh's Comment

In general, i agree with the issues that we aim to resolve here.. but have difference in the approach to resolve them though

For AC1, Would recommend introducing an additional ETL-JOB-AUDIT table with info like ORGANISATION_UUID, ETL_TRIGGER_TYPE(org/orgGroup), ETL_START_TIME, ETL_END_TIME, ETL_JOB_STATUS, ETL_JOB_RUNTIME
For AC2, Create a Metabase alert on ETL-JOB-AUDIT table as per requirement
For AC3, Introduce an Adhoc ETL Job, that runs in Parallel to the Quartz scheduled ETL jobs, this would be run only once, scheduled immediately / within a day based on queue of Adhoc triggers and does not make any change to the Quartz based periodic execution of ETL (precedent exists in Avni-Integration-service for doing this)

mahalakshme · 2024-12-03T09:37:46Z

Viveks comment:

We are using Quartz btw, not spring batch for ETL.
I think this line may be the problem:

avni-etl/src/main/java/org/avniproject/etl/domain/quartz/ScheduledJobRun.java

Line 38 in 19591ac

scheduledJobRun.startedAt = trigger.getNextFireTime();

We perhaps should use new Date here
We can try to figure out why we are getting this issue, because the database entries are managed by us using the JobListener
If job listener is not the best way, we can hook into the actual execution callback that we get to record the times

… actual job execution and add a higher priority trigger for first run of ETL Sync job for an org

1t5j0y · 2024-12-13T04:54:18Z

AC1 fixed as per Vivek's input above and seems to work well.
AC3 fixed by adding an additional higher priority trigger for the first run after enabling ETL for an org.

AC2 (metabase report) pending. Moving to code review ready so AC1 and AC3 can be tested.

1t5j0y · 2024-12-13T07:38:13Z

AC2 Metabase Reports:
https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs-production-environment

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

himeshr · 2024-12-13T09:12:08Z

AC2 Metabase Reports: https://reporting.avniproject.org/question/4840-latest-etl-run-failures-for-live-uat-and-prod-orgs

https://reporting.avniproject.org/question/4841-etl-round-completed-in-90-minutes

Alerts can be enabled after this change is promoted due to inaccurate start/end times in scheduled_job_run

Made slight additions to the first report to filter by "SyncJobs" job_group and show OrgCategory and OrgStatus values in readable format.

Code review didn't result in any other issues of concern.

himeshr · 2024-12-13T09:48:39Z

Additionally create following reports for QA and others to determin ETL job status:

himeshr · 2025-01-03T06:21:11Z

On debugging issue reported by Achala, observed the following in Prod environment:

We are successfully creating a one time Higher Priority trigger for T+5 seconds
We are also creating a repeating Normal Priority trigger for T+RepeatIntervalMinutes(120)
Quartz has a lot of triggers backed up which need to have run more than 1 hour ago, these get recycled for later as part of Misfire time limit config(1 hour)
Our Higher Priority trigger is still stuck in queue behind earlier trigger time jobs, and its priority matters only if trigger time is same or later

Discussion thoughts:

There is no point in optimizing our HigherPriority triggerTime through some logic, as the backed-up triggers in combination with Misfire threshold will most likely result in our trigger being discarded
We should rather focus on a way to get priority to be given higher precedence than time in-order to get it to be picked up next

himeshr · 2025-01-06T07:33:19Z

Notes

Priorities are only compared when triggers have the same fire time. A trigger scheduled to fire at 10:59 will always fire before one scheduled to fire at 11:00. Therefore configured the first run to be in past with higher priority and start time in past, with the Misfire instruction set to fire now.

@mahalakshme and @1t5j0y => Code changes done in ETL 10.2 branch, as the issue is easily testable only in prod environment, and the code change is localized and easy to deploy -> test in prod env, with ease of revertion if causing issues.

himeshr · 2025-01-06T08:52:31Z

@mahalakshme During dev-testing for this issue, we found that ETL run for OrgGroups takes almost an hour during working hours, this severly delays etl runs for rest of the orgs and our repeatInterval of 120 minutes in prod is not sufficient to accomodate this..

Therefore recommend 2 action-items:

Increase the repeatInterval to 4 hours and recreate triggers with the same repeat interval using ETL Job APIs..
Optimize the ETL run itself, to speed up etl for OrgGroups and orgs in general

1t5j0y · 2025-01-06T10:06:37Z

@himeshr couldn't find any related documentation and can't think of an easy way to test this but will repeated misfires with fire now misfire instruction cause any issues / is there a limit to the number of times it will try to fire now for a misfire?

himeshr · 2025-01-06T11:04:22Z

@himeshr couldn't find any related documentation and can't think of an easy way to test this but will repeated misfires with fire now misfire instruction cause any issues / is there a limit to the number of times it will try to fire now for a misfire?

"Fire Now" should be ideally used for One time Triggers, which is the case with FirstRun trigger.

himeshr · 2025-01-06T11:13:17Z

This is a good article to understand Quartz misfire scenarios and choice implications..
https://nurkiewicz.com/2012/04/quartz-scheduler-misfire-instructions.html

1t5j0y · 2025-01-06T11:13:35Z

"Fire Now" should be ideally used for One time Triggers, which is the case with FirstRun trigger.

The scenario I am worried about is:

ETL is executing for org A
ETL enabled for org B - triggers created
org B 'First Run' trigger executes and misfires as org A ETL is still running and is 'fired now' again.
Above step keeps happening

himeshr · 2025-01-06T11:36:16Z

Issues raised regarding ETL runs

ETL is not run immediately after doing org Disable/Enable analytics
ETL in general takes long time to reflect changes (days rather than hours)
Org Group ETL runs take hours to complete
Any other?

Discussion notes with recommendation on how to resolve issues raised for ETL run management

Improve ETL processing performance (Optimize for invocation of queries atleast, if not for optimizing the queries themselves)
Reduce frequency of ETL org runs (increase interval)
Configure Misfire tolerance to be small value(10 mins) rather than the large one now (8 hours)
Make RepeatJobs scheduling misfire logic to be "withMisfireHandlingInstructionNowWithRemainingCount" instead of default of "withMisfireHandlingInstructionNextWithRemainingCount"
Make it multi-threaded (2 threads) To get ETL adhoc requests served immediately

Note: We would need to retrigger ETL jobs after repeat frequency config change using Postman Run Collection capability.

This reverts commit e70d36e.

This reverts commit 4a29cee.

This reverts commit 7f783e7.

…in past" This reverts commit f50bed2.

1t5j0y · 2025-01-06T12:03:15Z

Job chaining might also be viable for ETL to guarantee execution for all orgs and avoid misfires. Will require us to 'splice' newly scheduled jobs into the chain.

https://blog.harveydelaney.com/implementing-job-chaining-in-quartz-net/

mahalakshme added this to Avni Product Sep 30, 2024

mahalakshme converted this from a draft issue Sep 30, 2024

mahalakshme mentioned this issue Nov 4, 2024

ETL issues #109

Closed

mahalakshme moved this from In Analysis to In Analysis Review in Avni Product Nov 5, 2024

mahalakshme changed the title ~~ETL taking long time issue~~ [ETL] Issues with monitoring Nov 5, 2024

mahalakshme moved this from In Analysis Review to Ready in Avni Product Nov 29, 2024

mahalakshme changed the title ~~[ETL] Issues with monitoring~~ [ETL] Issues with monitoring/testing Nov 29, 2024

mahalakshme moved this from Ready to In Analysis in Avni Product Dec 2, 2024

mahalakshme moved this from In Analysis to Ready in Avni Product Dec 4, 2024

1t5j0y moved this from Ready to QA Failed in Avni Product Dec 10, 2024

1t5j0y moved this from QA Failed to In Progress in Avni Product Dec 10, 2024

1t5j0y self-assigned this Dec 10, 2024

1t5j0y added a commit that referenced this issue Dec 13, 2024

#110 | Fix start_date logging in scheduled_job_run to be in line with…

3acfb02

… actual job execution and add a higher priority trigger for first run of ETL Sync job for an org

1t5j0y moved this from In Progress to Code Review Ready in Avni Product Dec 13, 2024

himeshr moved this from Code Review Ready to In Code Review in Avni Product Dec 13, 2024

himeshr moved this from In Code Review to Code Review Ready in Avni Product Dec 13, 2024

himeshr moved this from Code Review Ready to QA Ready in Avni Product Dec 13, 2024

AchalaBelokar moved this from QA Ready to In QA in Avni Product Dec 16, 2024

AchalaBelokar moved this from In QA to Done in Avni Product Dec 16, 2024

vinayvenu closed this as completed Dec 16, 2024

mahalakshme mentioned this issue Dec 30, 2024

Release 10.2.0 avniproject/avni-product#1674

Open

himeshr reopened this Jan 3, 2025

github-project-automation bot moved this from Done to Triaged in Avni Product Jan 3, 2025

himeshr moved this from Triaged to Ready in Avni Product Jan 3, 2025

himeshr moved this from Ready to In Progress in Avni Product Jan 6, 2025

himeshr assigned himeshr and unassigned 1t5j0y Jan 6, 2025

himeshr added a commit that referenced this issue Jan 6, 2025

#110 | Setup First Run to run now on Misfire with start time in past

f50bed2

himeshr moved this from In Progress to Code Review Ready in Avni Product Jan 6, 2025

himeshr added a commit that referenced this issue Jan 6, 2025

#110 | set first run trigger 30 mins in the past

7f783e7

himeshr added a commit that referenced this issue Jan 6, 2025

#110 | set first run trigger startTime to now

4a29cee

himeshr added a commit that referenced this issue Jan 6, 2025

#110 | Keep logs similar for job start end to ease analysis

e70d36e

himeshr moved this from Code Review Ready to In Analysis in Avni Product Jan 6, 2025

himeshr added a commit that referenced this issue Jan 6, 2025

Revert "#110 | Keep logs similar for job start end to ease analysis"

698b195

This reverts commit e70d36e.

himeshr added a commit that referenced this issue Jan 6, 2025

Revert "#110 | set first run trigger startTime to now"

4dbd555

This reverts commit 4a29cee.

himeshr added a commit that referenced this issue Jan 6, 2025

Revert "#110 | set first run trigger 30 mins in the past"

29f3596

This reverts commit 7f783e7.

himeshr added a commit that referenced this issue Jan 6, 2025

Revert "#110 | Setup First Run to run now on Misfire with start time …

b0a73a2

…in past" This reverts commit f50bed2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL] Issues with monitoring/testing #110

[ETL] Issues with monitoring/testing #110

mahalakshme commented Sep 30, 2024 •

edited

Loading

himeshr commented Dec 2, 2024

himeshr commented Dec 2, 2024

mahalakshme commented Dec 3, 2024

1t5j0y commented Dec 13, 2024

1t5j0y commented Dec 13, 2024 •

edited by himeshr

Loading

himeshr commented Dec 13, 2024

himeshr commented Dec 13, 2024

himeshr commented Jan 3, 2025

himeshr commented Jan 6, 2025 •

edited

Loading

himeshr commented Jan 6, 2025

1t5j0y commented Jan 6, 2025

himeshr commented Jan 6, 2025 •

edited

Loading

himeshr commented Jan 6, 2025

1t5j0y commented Jan 6, 2025

himeshr commented Jan 6, 2025 •

edited

Loading

1t5j0y commented Jan 6, 2025

[ETL] Issues with monitoring/testing #110

[ETL] Issues with monitoring/testing #110

Comments

mahalakshme commented Sep 30, 2024 • edited Loading

Issue:

AC:

Technical analysis/suggestions:

Ignore:

What:

Who:

himeshr commented Dec 2, 2024

Joy's Comment

himeshr commented Dec 2, 2024

Himesh's Comment

mahalakshme commented Dec 3, 2024

Viveks comment:

1t5j0y commented Dec 13, 2024

1t5j0y commented Dec 13, 2024 • edited by himeshr Loading

himeshr commented Dec 13, 2024

himeshr commented Dec 13, 2024

himeshr commented Jan 3, 2025

himeshr commented Jan 6, 2025 • edited Loading

Notes

himeshr commented Jan 6, 2025

1t5j0y commented Jan 6, 2025

himeshr commented Jan 6, 2025 • edited Loading

himeshr commented Jan 6, 2025

1t5j0y commented Jan 6, 2025

himeshr commented Jan 6, 2025 • edited Loading

Issues raised regarding ETL runs

Discussion notes with recommendation on how to resolve issues raised for ETL run management

1t5j0y commented Jan 6, 2025

mahalakshme commented Sep 30, 2024 •

edited

Loading

1t5j0y commented Dec 13, 2024 •

edited by himeshr

Loading

himeshr commented Jan 6, 2025 •

edited

Loading

himeshr commented Jan 6, 2025 •

edited

Loading

himeshr commented Jan 6, 2025 •

edited

Loading