how to restore cortex operator normally when too many jobs are requested #2394

nellaG · 2021-10-01T12:55:06Z

hello.

I'm currently using cortex 0.40.0.

I seldom request thousands of jobs to certain cortex api by mistake.
When I do like that, I can't use cortex cli well (the response time is so long, or just hanging) and I guess that cortex operator is overloaded because of me.
(the status of operator-controller-manager pod is continuously goes to OOMKilled -> CrashLoopBackOff)

To resolve this issue, I attempted these so far but It didn't work well.

delete thousands of AWS sqs queue
delete all of enqueuer job and worker job created by mistake
delete certain cortex api and re-deploy it

After all I just down the cluster and up (+ re-deploy all of api) to make cortex work well.
If this is happened, what should I do to restore cortex without down and up cluster?

I glad to your support. Thank you so much.

The text was updated successfully, but these errors were encountered:

miguelvr · 2021-10-04T02:33:38Z

the operator-controller-manager is responsible for the cleanup of all the resources, so if it starts failing, it requires a lot of intervention.

The first thing I would try is If the operator-controller-manager is getting OOMKilled, is to increase its memory limits.

If that doesn't work, there are ways to "fix" that weird state, but still require a lot of manual intervention, or eventually an automated script.

When you create a BatchAPI job this happens:

A BatchJob kubernetes resource is created
The operator-controller-manager creates / updates / deletes the required resources referring to that BatchJob resource.

In order to fix that weird state you have to:

Delete the created BatchJob resources from the cluster using kubectl delete with the --force flag
Delete all the created SQS queues manually or with a script
Delete S3 resources that might have been created for that BatchJob resource

nellaG added the question Further information is requested label Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to restore cortex operator normally when too many jobs are requested #2394

how to restore cortex operator normally when too many jobs are requested #2394

nellaG commented Oct 1, 2021 •

edited

Loading

miguelvr commented Oct 4, 2021 •

edited

Loading

how to restore cortex operator normally when too many jobs are requested #2394

how to restore cortex operator normally when too many jobs are requested #2394

Comments

nellaG commented Oct 1, 2021 • edited Loading

miguelvr commented Oct 4, 2021 • edited Loading

nellaG commented Oct 1, 2021 •

edited

Loading

miguelvr commented Oct 4, 2021 •

edited

Loading