[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

rxraghu · 2023-12-06T10:51:30Z

Motivation: Why do you think this is important?

If for some reason Flyte is down and the components are not able to communicate with each other to launch executions, then executions get backed up in the pipeline and when the system comes back online, then the catchup system executes all the pending requests. This may undue stress on different components like the database or API because usually these systems have a limit of how many requests they can serve at a time. This caused an incident in our production system where there was an issue with kubernetes and the executions created by our launch plan were not being executed, and when the system became healthy, all the executions were launched at once which brought down the database instance. We had to restart the database to make it drop the excess requests.

Goal: What should the final outcome look like, ideally?

Ideally if the number of executions that can be launched by the catchup system, then we will be able to configure this based on available resources so that it does not bring the other systems down. Also, if we can specify if the executions from the beginning of the queue (earliest) should be executed or from the end of the queue(latest) that would be great, but this is a nice to have.

Describe alternatives you've considered

We tried archiving the LP when the system was apparently down using flytectl, but it did not work, i.e the LP did not stop creating executions. We archived the LP after the system came back online, but by then hundreds of executions were already created.

Propose: Link/Inline OR Additional context

Discussion on slack - https://flyte-org.slack.com/archives/CP2HDHKE1/p1701084955940149

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

dosubot · 2023-12-06T10:51:46Z

🤖

Hello @rxraghu,

I'm Dosu, a bot here to assist you while we wait for a human maintainer to join us. I'm here to help you with resolving bugs, answering questions, and guiding you through your journey as a contributor. Let's work together to make your experience with flyte smoother!

Thank you for raising this issue. It's clear that the current catch-up system can cause problems when there's a backlog of executions due to system downtime.

The catch-up system in Flyte is handled by a component called "CatchupAll-System". This component runs at bootup and catches up all the schedules to their current time (time.Now()). The scheduler does not run until all the schedules have been caught up. The current design also does not snapshot until all the schedules are caught up. This might be a drawback in case catch up runs for a long time and hasn't been snapshotted. You can find more details in the flyteadmin/scheduler/doc.go file.

As for the queue of pending executions, it's managed by the QualityOfServiceAllocator. This allocator determines the Quality of Service (QoS) for an execution based on various factors. The QoS can be specified at different levels of specificity: at the time of the CreateExecution request, in the LaunchPlan spec, in the Workflow spec, or as an overridable MatchableResource for the underlying workflow. If no QoS is specified at any of these levels, the allocator checks if an override exists in the matchable attributes resource table. If there is still no QoS defined, it uses the default values from the admin application config. The relevant code can be found in the flyteadmin/pkg/manager/impl/executions/quality_of_service.go file.

Given this, it seems feasible to introduce a limit to the number of executions that the catch-up system can launch at once. This could be done by modifying the CatchupAll-System to check the number of pending executions and limit the number of executions it launches based on a configurable limit. This limit could be set based on the available resources to prevent overloading the system.

As for the order of execution (earliest vs latest), this would require changes to the way the queue of pending executions is managed. Currently, it seems that the order of execution is not explicitly managed, but it could be possible to introduce this feature by modifying the QualityOfServiceAllocator or the CatchupAll-System.

I hope this provides some insight into how this feature could be implemented. If you have any further questions or need more clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Future-Outlier · 2023-12-29T13:19:44Z

@austin362667, would you like to work on it?

shubhscoder · 2024-01-09T14:51:35Z

@Future-Outlier I would like to work on it!

shubhscoder · 2024-01-09T14:51:49Z

#self-assign

davidmirror-ops · 2024-02-20T21:02:21Z

@shubhscoder let us know if you need any help

peterxcli · 2024-10-28T13:56:34Z

@Future-Outlier The issue seems to have had plenty of time without any progress. May I take it on?

davidmirror-ops · 2024-10-28T17:20:03Z

@peterxcli sure, let us know if you have questions. Thanks!

peterxcli · 2024-10-30T08:42:27Z

Hi @davidmirror-ops,
My current thought is to add a config named MaxCatchUpLimit into WorkflowExecutorConfig with default to infinite. Then once the flyteAdmin scheduler start, check if the jobs in the job store exceed the set MaxCatchUpLimit

But I have one question: What should the flyteAdmin do if the GoCronScheduler#CatchupAll return a ErrExceedCatchUpLimit error

flyte/flyteadmin/scheduler/schedule_executor.go

Lines 82 to 89 in 553a702

    
           af := futures.NewAsyncFuture(ctx, func(ctx context.Context) (interface{}, error) { 
        
           	return gcronScheduler.CatchupAll(ctx, currTime), nil 
        
           }) 
        
           isCatchupSuccess, err := af.Get(ctx) 
        
           if err != nil { 
        
           	logger.Errorf(ctx, "failed to get future value for catchup due to %v", err) 
        
           	return err 
        
           }

Do users just restart the scheduler so it can take a newer snapshot of db? Or we could just skip those exceed jobs?

peterxcli · 2024-11-18T07:59:33Z

Maybe we can refer to https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/pools.html

rxraghu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Dec 6, 2023

eapolinario added good first issue Good for newcomers help wanted Extra attention is needed and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 8, 2023

github-actions bot assigned shubhscoder Jan 9, 2024

davidmirror-ops assigned peterxcli Oct 28, 2024

Future-Outlier removed the good first issue Good for newcomers label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

rxraghu commented Dec 6, 2023

dosubot bot commented Dec 6, 2023 •

edited

Loading

Future-Outlier commented Dec 29, 2023

shubhscoder commented Jan 9, 2024

shubhscoder commented Jan 9, 2024

davidmirror-ops commented Feb 20, 2024

peterxcli commented Oct 28, 2024

davidmirror-ops commented Oct 28, 2024

peterxcli commented Oct 30, 2024 •

edited

Loading

peterxcli commented Nov 18, 2024

[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

Comments

rxraghu commented Dec 6, 2023

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Dec 6, 2023 • edited Loading

Sources

Future-Outlier commented Dec 29, 2023

shubhscoder commented Jan 9, 2024

shubhscoder commented Jan 9, 2024

davidmirror-ops commented Feb 20, 2024

peterxcli commented Oct 28, 2024

davidmirror-ops commented Oct 28, 2024

peterxcli commented Oct 30, 2024 • edited Loading

peterxcli commented Nov 18, 2024

dosubot bot commented Dec 6, 2023 •

edited

Loading

peterxcli commented Oct 30, 2024 •

edited

Loading