-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core feature] Allow setting of limit for how many executions the catch up system can launch #4538
Comments
🤖 Hello @rxraghu, I'm Dosu, a bot here to assist you while we wait for a human maintainer to join us. I'm here to help you with resolving bugs, answering questions, and guiding you through your journey as a contributor. Let's work together to make your experience with flyte smoother! Thank you for raising this issue. It's clear that the current catch-up system can cause problems when there's a backlog of executions due to system downtime. The catch-up system in Flyte is handled by a component called "CatchupAll-System". This component runs at bootup and catches up all the schedules to their current time (time.Now()). The scheduler does not run until all the schedules have been caught up. The current design also does not snapshot until all the schedules are caught up. This might be a drawback in case catch up runs for a long time and hasn't been snapshotted. You can find more details in the flyteadmin/scheduler/doc.go file. As for the queue of pending executions, it's managed by the QualityOfServiceAllocator. This allocator determines the Quality of Service (QoS) for an execution based on various factors. The QoS can be specified at different levels of specificity: at the time of the CreateExecution request, in the LaunchPlan spec, in the Workflow spec, or as an overridable MatchableResource for the underlying workflow. If no QoS is specified at any of these levels, the allocator checks if an override exists in the matchable attributes resource table. If there is still no QoS defined, it uses the default values from the admin application config. The relevant code can be found in the flyteadmin/pkg/manager/impl/executions/quality_of_service.go file. Given this, it seems feasible to introduce a limit to the number of executions that the catch-up system can launch at once. This could be done by modifying the CatchupAll-System to check the number of pending executions and limit the number of executions it launches based on a configurable limit. This limit could be set based on the available resources to prevent overloading the system. As for the order of execution (earliest vs latest), this would require changes to the way the queue of pending executions is managed. Currently, it seems that the order of execution is not explicitly managed, but it could be possible to introduce this feature by modifying the QualityOfServiceAllocator or the CatchupAll-System. I hope this provides some insight into how this feature could be implemented. If you have any further questions or need more clarification, please let me know. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@austin362667, would you like to work on it? |
@Future-Outlier I would like to work on it! |
#self-assign |
@shubhscoder let us know if you need any help |
@Future-Outlier The issue seems to have had plenty of time without any progress. May I take it on? |
@peterxcli sure, let us know if you have questions. Thanks! |
Hi @davidmirror-ops, But I have one question: What should the flyteAdmin do if the flyte/flyteadmin/scheduler/schedule_executor.go Lines 82 to 89 in 553a702
Do users just restart the scheduler so it can take a newer snapshot of db? Or we could just skip those exceed jobs? |
Motivation: Why do you think this is important?
If for some reason Flyte is down and the components are not able to communicate with each other to launch executions, then executions get backed up in the pipeline and when the system comes back online, then the catchup system executes all the pending requests. This may undue stress on different components like the database or API because usually these systems have a limit of how many requests they can serve at a time. This caused an incident in our production system where there was an issue with kubernetes and the executions created by our launch plan were not being executed, and when the system became healthy, all the executions were launched at once which brought down the database instance. We had to restart the database to make it drop the excess requests.
Goal: What should the final outcome look like, ideally?
Ideally if the number of executions that can be launched by the catchup system, then we will be able to configure this based on available resources so that it does not bring the other systems down. Also, if we can specify if the executions from the beginning of the queue (earliest) should be executed or from the end of the queue(latest) that would be great, but this is a nice to have.
Describe alternatives you've considered
We tried archiving the LP when the system was apparently down using flytectl, but it did not work, i.e the LP did not stop creating executions. We archived the LP after the system came back online, but by then hundreds of executions were already created.
Propose: Link/Inline OR Additional context
Discussion on slack - https://flyte-org.slack.com/archives/CP2HDHKE1/p1701084955940149
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: