Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538

Open
2 tasks done
rxraghu opened this issue Dec 6, 2023 · 9 comments
Open
2 tasks done
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@rxraghu
Copy link

rxraghu commented Dec 6, 2023

Motivation: Why do you think this is important?

If for some reason Flyte is down and the components are not able to communicate with each other to launch executions, then executions get backed up in the pipeline and when the system comes back online, then the catchup system executes all the pending requests. This may undue stress on different components like the database or API because usually these systems have a limit of how many requests they can serve at a time. This caused an incident in our production system where there was an issue with kubernetes and the executions created by our launch plan were not being executed, and when the system became healthy, all the executions were launched at once which brought down the database instance. We had to restart the database to make it drop the excess requests.

Goal: What should the final outcome look like, ideally?

Ideally if the number of executions that can be launched by the catchup system, then we will be able to configure this based on available resources so that it does not bring the other systems down. Also, if we can specify if the executions from the beginning of the queue (earliest) should be executed or from the end of the queue(latest) that would be great, but this is a nice to have.

Describe alternatives you've considered

We tried archiving the LP when the system was apparently down using flytectl, but it did not work, i.e the LP did not stop creating executions. We archived the LP after the system came back online, but by then hundreds of executions were already created.

Propose: Link/Inline OR Additional context

Discussion on slack - https://flyte-org.slack.com/archives/CP2HDHKE1/p1701084955940149

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@rxraghu rxraghu added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers labels Dec 6, 2023
Copy link

dosubot bot commented Dec 6, 2023

🤖

Hello @rxraghu,

I'm Dosu, a bot here to assist you while we wait for a human maintainer to join us. I'm here to help you with resolving bugs, answering questions, and guiding you through your journey as a contributor. Let's work together to make your experience with flyte smoother!

Thank you for raising this issue. It's clear that the current catch-up system can cause problems when there's a backlog of executions due to system downtime.

The catch-up system in Flyte is handled by a component called "CatchupAll-System". This component runs at bootup and catches up all the schedules to their current time (time.Now()). The scheduler does not run until all the schedules have been caught up. The current design also does not snapshot until all the schedules are caught up. This might be a drawback in case catch up runs for a long time and hasn't been snapshotted. You can find more details in the flyteadmin/scheduler/doc.go file.

As for the queue of pending executions, it's managed by the QualityOfServiceAllocator. This allocator determines the Quality of Service (QoS) for an execution based on various factors. The QoS can be specified at different levels of specificity: at the time of the CreateExecution request, in the LaunchPlan spec, in the Workflow spec, or as an overridable MatchableResource for the underlying workflow. If no QoS is specified at any of these levels, the allocator checks if an override exists in the matchable attributes resource table. If there is still no QoS defined, it uses the default values from the admin application config. The relevant code can be found in the flyteadmin/pkg/manager/impl/executions/quality_of_service.go file.

Given this, it seems feasible to introduce a limit to the number of executions that the catch-up system can launch at once. This could be done by modifying the CatchupAll-System to check the number of pending executions and limit the number of executions it launches based on a configurable limit. This limit could be set based on the available resources to prevent overloading the system.

As for the order of execution (earliest vs latest), this would require changes to the way the queue of pending executions is managed. Currently, it seems that the order of execution is not explicitly managed, but it could be possible to introduce this feature by modifying the QualityOfServiceAllocator or the CatchupAll-System.

I hope this provides some insight into how this feature could be implemented. If you have any further questions or need more clarification, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario added good first issue Good for newcomers help wanted Extra attention is needed and removed untriaged This issues has not yet been looked at by the Maintainers labels Dec 8, 2023
@Future-Outlier
Copy link
Member

@austin362667, would you like to work on it?

@shubhscoder
Copy link

@Future-Outlier I would like to work on it!

@shubhscoder
Copy link

#self-assign

@davidmirror-ops
Copy link
Contributor

@shubhscoder let us know if you need any help

@peterxcli
Copy link
Contributor

@Future-Outlier The issue seems to have had plenty of time without any progress. May I take it on?

@davidmirror-ops
Copy link
Contributor

@peterxcli sure, let us know if you have questions. Thanks!

@peterxcli
Copy link
Contributor

peterxcli commented Oct 30, 2024

Hi @davidmirror-ops,
My current thought is to add a config named MaxCatchUpLimit into WorkflowExecutorConfig with default to infinite. Then once the flyteAdmin scheduler start, check if the jobs in the job store exceed the set MaxCatchUpLimit

But I have one question: What should the flyteAdmin do if the GoCronScheduler#CatchupAll return a ErrExceedCatchUpLimit error

af := futures.NewAsyncFuture(ctx, func(ctx context.Context) (interface{}, error) {
return gcronScheduler.CatchupAll(ctx, currTime), nil
})
isCatchupSuccess, err := af.Get(ctx)
if err != nil {
logger.Errorf(ctx, "failed to get future value for catchup due to %v", err)
return err
}

Do users just restart the scheduler so it can take a newer snapshot of db? Or we could just skip those exceed jobs?

@Future-Outlier Future-Outlier removed the good first issue Good for newcomers label Nov 11, 2024
@peterxcli
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants