[BUG] Operator doesn't manage the metric exporter sidecar #72

eduardchernomaz · 2022-12-07T16:00:17Z

Describe the bug
After the CRD has been applied and the test has completed running for the duration specified, the process fails. The worker job completes, while the master job continues to run.

To Reproduce
Steps to reproduce the behavior:

Apply the LocustTest manifest to start the test
Once the test runs for specified duration, list available jobs and pods
You should see that the worker job and worker pods have been completed. However, the master job has not been completed and the pod is in a NotReady state.

Expected behavior
Once the test has completed, both the worker and the master pods should be in a Completed state and eventually removed.

Screenshots
Pods status after the test has completed.

Jobs status after the test has completed.

Additional context
I suspect that the problem is that on the master node, the locust-metrics-exporter container never stop and continues to run. Failing to signal job completion.

The text was updated successfully, but these errors were encountered:

AbdelrhmanHamouda · 2022-12-12T09:25:03Z

Hello @eduardchernomaz,
your assessment of the root cause is on point. Indeed this behavior is because of the metrics exporter. This is a known issue that i am aware and intend on solving.

Problem explanation:
The exporter is a sidecar container that the locust native image is not aware of its existence. Also Kubernetes doesn't provide native support for sidecars container behavior definition e.g. shutdown after container x exits. This is important to understand because when I solve this for the metrics exporter container, the same issue will happen if your organisation have a cluster configuration to inject other sidecars e.g. istio sidecar.

In all cases, till I push the fix, I have provided 2 simple workarounds to employ in the meanwhile.

Workaround:

Option 1: Use a custom image that calls the recently added /quitquitquit endpoint on the exporter to exit. It would be something like this: curl -fsI -XPOST http://localhost:9646/quitquitquit
Option 2: At the end of your test call the /quitquitquit endpoint. In locust term this would translate to making the call in the function / method annotated with @events.quitting.add_listener.

AbdelrhmanHamouda · 2022-12-12T09:36:40Z

More info on the proposed fix (still under investigation), The proposed idea is to extend the Operator operation to include container management as a secondary resource to the custom resource. Meaning that after the Operator creates the main cluster resources, it need to register a reconciler to manage the secondary resources created by k8s itself. Doing so, the Operator can start reacting to to events coming from specific containers within specific pods.

From a security perspective, I also need to investigate any additional (if any) cluster privileges that maybe required by such solution.

eduardchernomaz · 2022-12-12T14:31:13Z

I wonder if we can also just add a PreStop container hook to the master deployment which would call the /quitquitquit endpoint.

AbdelrhmanHamouda · 2022-12-13T15:06:42Z

it is an interesting idea. One that make a lot of sense. I will dig into that and see what is needed to have this put in place.

AbdelrhmanHamouda · 2022-12-16T11:45:37Z

After some investigation, PreStop hook won't fit this use case. According to the documentation, it only gets invoked if the container termination is triggered from outside and not in cases where containers exit gracefully because of there internal process.

I am moving the investigation to assess if the liveness probe can be used to secure the desired effect without marking the job as "error"

AbdelrhmanHamouda · 2023-02-16T15:38:00Z

Possible implementation approach, Change metrics_exporter Liveness probes to ping the locust container every 10 seconds and on failure send a curl to quitequitequite.

AbdelrhmanHamouda · 2023-02-16T15:56:11Z

This will be solved with the fix for #50

S-mishina · 2024-09-26T11:24:06Z

@AbdelrhmanHamouda
Hello
Do you think there’s any chance this issue could be resolved?

I’d like to think about what I can do if there’s no chance of resolving this issue.

S-mishina · 2024-09-26T11:30:27Z

Ideas I'm thinking about now

I’m not using locust-metrics-exporter, so how about adding a flag to LocustTest to disable it and provide an option to let the Job complete successfully for now?

AbdelrhmanHamouda · 2024-10-04T11:34:37Z

Hello @eduardchernomaz, @S-mishina, thank you for your patience regarding this request. the fact is, that i never had the time to get around to this since it needed some prerequisite code changes (which were done thankfully) and in itself, it was never a priority since there are a lot of ways to get around it. what made it a bit worse is that in companies (most likely place for the operator to be running) most of the time, there is a lot of sidecar injection going around and the exporter becomes just another one. This is what's happening in my company for example and few others that i'm aware that the operator is running in.

That being said, i am starting to get some time back and I believe I will be able to give this request the attention it deserves.

regarding adding a flag, while it is true that it would be easy, I don't feel comfortable with this approach since removing it will be a breaking change not to mention that when running in a cloud environment, observability is extremly important.

Workaround

However, one workaround that I employ personally is to send a quite request to the exporter. I personally do this from within the custom hosted image of locust that we run but i believe you can perfectly do it from within the test itself as part of the cleanup stage. I can provide more info on the workaround if you desire

# Shutdown the metrics exporter container
echo "INFO: Shutting down metrics exporter container."
curl -fsI -XPOST http://localhost:9646/quitquitquit

S-mishina · 2024-10-13T09:52:57Z

@AbdelrhmanHamouda

Thank you for your response!

Regarding adding a flag, while it is true that it would be easy, I don't feel comfortable with this approach since removing it will be a breaking change, not to mention that when running in a cloud environment, observability is extremely important.

I completely agree with this point! Thank you for the feedback!

However, one workaround that I personally employ is to send a quiet request to the exporter. I do this from within the custom hosted image of Locust that we run, but I believe you can perfectly do it from within the test itself as part of the cleanup stage. I can provide more info on the workaround if you desire.

I would like to use the workaround as well, but is it implemented as a separate monitoring resource to trigger the corresponding curl command?

I can provide more info on the workaround if you desire.

If possible, I would appreciate any additional information you can share!

AbdelrhmanHamouda added the Bug Something isn't working label Dec 12, 2022

AbdelrhmanHamouda changed the title ~~[BUG]~~ [BUG] Operator doesn't manage the metric exporter sidecar Dec 12, 2022

AbdelrhmanHamouda added the Duplicate This issue or pull request already exists label Feb 16, 2023

AbdelrhmanHamouda self-assigned this Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Operator doesn't manage the metric exporter sidecar #72

[BUG] Operator doesn't manage the metric exporter sidecar #72

eduardchernomaz commented Dec 7, 2022

AbdelrhmanHamouda commented Dec 12, 2022

AbdelrhmanHamouda commented Dec 12, 2022

eduardchernomaz commented Dec 12, 2022

AbdelrhmanHamouda commented Dec 13, 2022 •

edited

Loading

AbdelrhmanHamouda commented Dec 16, 2022

AbdelrhmanHamouda commented Feb 16, 2023

AbdelrhmanHamouda commented Feb 16, 2023

S-mishina commented Sep 26, 2024 •

edited

Loading

S-mishina commented Sep 26, 2024

AbdelrhmanHamouda commented Oct 4, 2024 •

edited

Loading

S-mishina commented Oct 13, 2024

[BUG] Operator doesn't manage the metric exporter sidecar #72

[BUG] Operator doesn't manage the metric exporter sidecar #72

Comments

eduardchernomaz commented Dec 7, 2022

AbdelrhmanHamouda commented Dec 12, 2022

AbdelrhmanHamouda commented Dec 12, 2022

eduardchernomaz commented Dec 12, 2022

AbdelrhmanHamouda commented Dec 13, 2022 • edited Loading

AbdelrhmanHamouda commented Dec 16, 2022

AbdelrhmanHamouda commented Feb 16, 2023

AbdelrhmanHamouda commented Feb 16, 2023

S-mishina commented Sep 26, 2024 • edited Loading

S-mishina commented Sep 26, 2024

AbdelrhmanHamouda commented Oct 4, 2024 • edited Loading

Workaround

S-mishina commented Oct 13, 2024

AbdelrhmanHamouda commented Dec 13, 2022 •

edited

Loading

S-mishina commented Sep 26, 2024 •

edited

Loading

AbdelrhmanHamouda commented Oct 4, 2024 •

edited

Loading