Automatic switch to emergency mode when metrics unavailable #424

randytqwjp · 2024-11-26T05:51:24Z

What this PR does / why we need it:

This PR adds a check to HPA status condition and switches tortoise to emergency mode whenever hpa is unable to get resource metric

Which issue(s) this PR fixes:

#422

Fixes #

Special notes for your reviewer:

…nplace

sanposhiho · 2024-11-26T06:13:15Z

pkg/hpa/service.go

+		return true
+	}
+
+	if conditions[1].Type == "ScalingActive" && conditions[1].Status == "False" && conditions[1].Reason == "FailedGetResourceMetric" {


FailedGetResourceMetric

What about other failures? e.g., the container resource metrics

@sanposhiho should we only switch to emergency mode when main container resource metrics are missing? e.g. if istio-proxy metrics are unavailable for some reason but the main container metrics are still available, do we stay in auto mode since scaling is still active?

I think we should fallback to emergency only if all metrics are dead..? Maybe it's too aggressive to do a fallback if some, but not all metrics are dead.

e.g., Let's say a user deploys a new version to the app container which contains a bug unfortunately, and all main containers are crash due to the bug. In this case, switching to emerge doesn't help, rather just become unnecessary cost increase.

So, ideally we have to detect which issue is solvable with increasing the replicas, and which isn't.
I know it's super difficult, though. But, if two container's metrics are missing (a metric server is dead, the service is overwhelmed and whole dead, etc), we can say increasing the replica might help solve the issue, with more confidence.

sanposhiho

You must write a test.

randytqwjp · 2024-12-03T15:13:14Z

@sanposhiho Seems like in the controller tests, it is creating HPA without status conditions and that cause nil pointer reference in CheckHpaMetricStatus. Is it possible to add status conditions to the hpas created in the tests? or is there another way to solve the nil pointer reference

sanposhiho · 2024-12-04T02:07:08Z

You should change your code to handle HPA without status.

sanposhiho · 2024-12-04T02:07:44Z

But, you also need to add an controller level test as well, in addition to the unit tests.

randytqwjp · 2024-12-04T02:07:48Z

You should change your code to handle HPA without status.

@sanposhiho is there a specific reason for that?

sanposhiho · 2024-12-04T02:11:08Z

The same reason that you have currenthpa.Status.Conditions == nil. Basically we expect that HPA's Status (or even Conditions) won't be nil if everything works correctly, but we program the function defensively.

randytqwjp · 2024-12-04T08:56:52Z

@sanposhiho is vpa-monitor.yaml supposed to be missing from this test case? Seems like its causing one of the controller tests to fail im not sure why it passed in the previous kubebuilder upgrade

sanposhiho · 2024-12-04T09:35:04Z

You meant, at /before? I don't exactly remember whether it's intentional or I just forgot, but I guess it's intentional because the reconciliation should create the monitor VPA at the initialization phase, no?

randytqwjp · 2024-12-04T15:58:29Z

internal/controller/tortoise_controller.go

+	scalingActive := r.HpaService.CheckHpaMetricStatus(ctx, hpa)
+	if scalingActive == false && tortoise.Spec.UpdateMode == autoscalingv1beta3.UpdateModeAuto && tortoise.Status.TortoisePhase == autoscalingv1beta3.TortoisePhaseWorking {
+		//switch to emergency mode only when auto tortoise and already working
+		tortoise.Spec.UpdateMode = autoscalingv1beta3.UpdateModeEmergency


@sanposhiho somehow this line doesn't change the actual spec that k8sclient.get retrieves so in the test while tortoisephase changes to emergency, spec.updatemode is still auto. If this is the case, i think we need to create a new tortoisephase for auto scale up. What do you think?

somehow this line doesn't change the actual spec

because UpdateTortoiseStatus below only updates .status (literally), not .spec.
And, thinking about it, probably we shouldn't directly change the spec. It's breaking the boundary between the user's intent and the controller.

So,

If this is the case, i think we need to create a new tortoisephase for auto scale up.

yeah. I think, instead of changing the updatemode on spec, we should change something at .status. TortoisePhase makes sense to me.

Can we just use the existing TortoisePhaseEmergency, not a new one?

if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right? we can only use emergency phase if we disallow emergency to be input by user. If we want users to be able to manually switch to emergency, we need a new phase

Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?

if we use existing emergency phase, when tortoise is in auto mode it will reconcile and switch it back to working right?

No, it shouldn't. That depends on how you implement. The controller should keep putting TortoisePhaseEmergency as long as HPA has a missing metric, even if the mode is auto.

Actually i am not sure what other scenarios will require emergency mode. If missing metrics is the only scenario then it would make sense to disallow manual switch. What do you think?

There could be many scenarios right now, e.g.,

Mercari is going to have a Black Friday sale, and we want to pre-autoscale the number of replicas. (could be solved by init commit for scheduled scaling #370)

A big upstream microservice is down, and the downstream microservice is getting traffic much less than usual now. The dynamic minReplicas feature keeps the replica number to some extent though, we want to make it bigger enough to prepare the big upstream microservice to come back fully.

We're getting requests more than usual, unexpectedly (DDoS, etc), and the platform needs to scale up all the services.

I agree that we should enhance tortoise so that users don't have to use emergency mode at all eventually, but for now, we should have it, as a last resort.

sanposhiho · 2024-12-06T07:03:05Z

api/v1beta3/tortoise_types.go

@@ -256,7 +256,8 @@ const (
 	// TortoisePhaseBackToNormal means tortoise was in the emergency mode, and now it's coming back to the normal operation.
 	// During TortoisePhaseBackToNormal, the number of replicas of workloads are gradually reduced to the usual value.
 	//  - Emergency → BackToNormal
-	TortoisePhaseBackToNormal TortoisePhase = "BackToNormal"
+	TortoisePhaseBackToNormal  TortoisePhase = "BackToNormal"
+	TortoisePhaseAutoEmergency TortoisePhase = "AutoEmergency"


Again, I do not agree adding a new phase. It'll increase our maintenance effort.

I will look at it again later i didnt see your comment earlier

@sanposhiho Changed to use only emergency mode. Could you review the new test cases as well? I am also using currentMetric.ContainerResource.Current.Value.IsZero() which is the default value for int32 though I am not sure what is supposed to be returned when metrics are down. I assume its the default value for the type but I'm not too familiar with golang

I am not sure what is supposed to be returned when metrics are down.

I guess it's correct to assume HPA to return zero when something goes wrong happens with the metric. Or maybe they just keep the previous values (= they don't clean up the list with zero values when the HPA reconciliation fails).
You have to check HPA controller implementation to make 100% sure.

sanposhiho · 2024-12-10T01:57:58Z

pkg/hpa/service_test.go

+		HPA    *v2.HorizontalPodAutoscaler
+		result bool
+	}{
+		{


You also need a positive pattern; HPA is correctly working, and the function return true

sanposhiho · 2024-12-10T01:58:19Z

pkg/hpa/service.go

@@ -765,3 +766,54 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP

 	return newHPA
 }
+
+func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {


nit

Suggested change

func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {

func (c *Service) IsHpaMetricAvailable(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {

Actually, can we do like this instead? It would eliminate the logic from the controller layer.

func (c *Service) UpdateTortoisePhaseIfHPAIsUnhealthy(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler, tortoise *v1beta1.Tortoise) error { ... if condition.Type == "ScalingActive" && condition.Status == "False" && condition.Reason == "FailedGetResourceMetric" { //switch to Emergency mode since no metrics logger.Info("HPA failed to get resource metrics, switch to emergency mode") tortoise.Status.TortoisePhase = v1beta3.TortoisePhaseEmergency return nil } }

Then, probably it'd make more sense to move inside the tortoise service, instead of hpa service.

Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well? Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa. Im not sure which will be cleaner

Wont this add extra responsibilities to tortoise service if the service has to manage hpa conditions as well

Hmm, that's understandable. So,

Would it be better to add a UpdateTortoisePhaseIfHPAIsUnhealthy function in tortoise service but still perform the check in hpa service? so we only pass in the scalingActive bool into tortoise service instead of the entire hpa.

this direction looks good.

sanposhiho · 2024-12-10T01:59:37Z

pkg/hpa/service.go

+	if currenthpa == nil {
+		logger.Info("empty HPA passed into status check, ignore")
+		return true
+	}
+
+	if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) {
+		logger.Info("HPA empty status, switch to emergency mode")
+		return false
+	}
+
+	if currenthpa.Status.Conditions == nil {
+		logger.Info("HPA empty conditions, switch to emergency mode")
+		return false
+	}
+
+	if currenthpa.Status.CurrentMetrics == nil {
+		logger.Info("HPA no metrics, switch to emergency mode")
+		return false
+	}


Let's add error in the return value and return err in those cases.

and, when err, we should abort the reconciliation because HPA's object looks like not valid at all.

This will cause an issue with the controller test that initializes hpa because in the test environment, creating a new hpa without manually updating the status will give an empty status. But we cannot force a status in the controller itself because that might affect the actual hpa in an actual environment.

Right, looks like it's expected that if HPA is just created, Status etc could be empty (since HPA controller might not yet notice a new HPA)
Can we early-return (or not run this function) when tortoise's phase is Initializing?

yes i handled it to only run when tortoise is working. But there is a test case where tortoise is working but no hpa and tortoise reconcile should create a hpa. This will always fail and im not sure of a workaround for this specific test case

yes but because of that it will cause error since it generates hpa with empty status and tortoisephase is still working. A rather hackish way around it is if a new hpa is created from that function it returns a bool so we skip hpa status check on that reconcile loop and check it on future reconcile loops

Hmmm, that's complicated.
Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.

regarding the missing metrics concern as well, would skipping the check if theres a change in policy be a good workaround? as long as the metrics are populated by the next reconcile loop it should work?

There's two cases coming from UpdateHPASpecFromTortoiseAutoscalingPolicy, IIUC:

A new HPA is created: In this case, empty status and hence we skip. So, no issue.

A new metric is added: In this case, a status doesn't have anything about new metric added at HPA spec, but this function doesn't check missing metrics by comparing spec and status, so no issue.

So, I don't see any issues. WDYT

i dont see issues as well just that it may not be the cleanest way to code this? But I can't think of anything better at the moment

randytqwjp · 2024-12-17T13:50:17Z

@sanposhiho Ready for review

sanposhiho · 2024-12-18T01:43:27Z

pkg/hpa/service.go

@@ -497,32 +498,33 @@ func (c *Service) UpdateHPASpecFromTortoiseAutoscalingPolicy(
 	givenHPA *v2.HorizontalPodAutoscaler,
 	replicaNum int32,
 	now time.Time,
-) (*autoscalingv1beta3.Tortoise, error) {
+) (*autoscalingv1beta3.Tortoise, bool, error) {


What I was suggesting at the end of the thread is a bit different.

Okay, I now agree that we don't return error from this function even when HPA has empty condition etc, and we just ignore, doing nothing at those cases.
#424 (comment)

So, we don't need hpaCreated, and we can just ignore (= return true) when HPA has empty condition at CheckHpaMetricStatus. (I meant CheckHpaMetricStatus doesn't have to have error in a return value anymore)

sanposhiho · 2024-12-18T01:45:06Z

pkg/hpa/service.go

@@ -765,3 +768,51 @@ func (c *Service) excludeExternalMetric(ctx context.Context, hpa *v2.HorizontalP

 	return newHPA
 }
+
+func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) {


So, I don't think we have to return error anymore.

Suggested change

func (c *Service) CheckHpaMetricStatus(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) (bool, error) {

// IsHPAScalingActive returns whether HPA is correctly working or not.

func (c *Service) IsHPAScalingActive(ctx context.Context, currenthpa *v2.HorizontalPodAutoscaler) bool {

sanposhiho · 2024-12-18T01:47:52Z

pkg/hpa/service.go

+	if currenthpa == nil {
+		logger.Info("empty HPA passed into status check, ignore")
+		return true, nil
+	}
+
+	if reflect.DeepEqual(currenthpa.Status, v2.HorizontalPodAutoscalerStatus{}) {
+		return false, fmt.Errorf("HPA empty status, switch to emergency mode")
+	}
+
+	if currenthpa.Status.Conditions == nil {
+		return false, fmt.Errorf("HPA empty conditions, switch to emergency mode")
+	}
+
+	if currenthpa.Status.CurrentMetrics == nil {
+		return false, fmt.Errorf("HPA no metrics, switch to emergency mode")
+	}


And, let's return true for all those cases based on the assumption that HPA is just created and the HPA controller doesn't handle it yet.

Actually, this might be a general issue that maybe HPA hasn't be reconciled yet when UpdateHPASpecFromTortoiseAutoscalingPolicy just created the one.
In such cases, Tortoise controller cannot (or shouldn't) calculate the recommendation.

Can we varify those empty stuff at GetHPAOnTortoise, and if empty, we abort the reconciliation? And, at the next reconciliation, HPA status is supposed to be filled by HPA controller, and we can reconcile.

func (c *Service) GetHPAOnTortoise(ctx context.Context, tortoise *autoscalingv1beta3.Tortoise) (*v2.HorizontalPodAutoscaler, bool, error) { ... hpa := &v2.HorizontalPodAutoscaler{} if err := c.c.Get(ctx, types.NamespacedName{Namespace: tortoise.Namespace, Name: tortoise.Status.Targets.HorizontalPodAutoscaler}, hpa); err != nil { return nil, false, fmt.Errorf("failed to get hpa on tortoise: %w", err) } + if hpa.Status == nil || hpa.Status.Conditions == nil || .... { + // Most likely, HPA is just created and not yet handled by HPA controller. + return nil, false, nil + } ... }

hpa, isReady, err = r.HpaService.GetHPAOnTortoise(ctx, tortoise) if err != nil { logger.Error(err, "failed to get HPA", "tortoise", req.NamespacedName) return ctrl.Result{}, err } + + if !isReady { + // HPA is correctly fetched, but looks like not ready yet. We won't be able to calculate things correctly, and hence stop the reconciliation here. + return ctrl.Result{}, nil + }

Just a quick question @sanposhiho How should we handle the test case where HPA is created? In the current test case, it is updated to the correct recommendation after one reconcile loop in tortoise conditions. Now it should just be the same tortoise since reconcile was aborted and tests only cover one reconcile loop? Would this be lacking from a testing perspective

Here, we change the expectation of the reconciliation loop that creates HPA to stop the reconciliation at that point (if HPA hasn't yet updated immediately by HPA controller) and expect the next reconciliation would handle them appropriately.

So, we can just change the test case; change the test case's tortoise expectation not to get any update around the recommendation.

randytqwjp · 2024-12-19T04:56:17Z

@sanposhiho fixed the above comments. Could you look through the create hpa test case too? in case i missed something

randytqwjp added 22 commits October 10, 2024 13:32

upgrade kubebuilder to plugin/v4

438f653

add test utils

df7b7b8

fix controller test

52780ef

fix gha test

9c153b5

chmod tortoisectl test

f4454f4

edit tortoisectl

bba0747

fix lint

a968c7a

fix lint

691c4f4

add lint-fix to ci

e28d881

go mod tidy

192fecf

add make dependencies

4343ea9

remove lint-fix

93556df

upgrade tools

5cda06c

lint-fix

0fbfff5

add tool chain version

ae2334a

change toolchain to 1.22

28080e3

add timeout

ca36611

remove lint-fix

5677abc

edit licenses

a6b0318

remove chmod

e8454e9

Merge branch 'main' of github.com:mercari/tortoise into kubebuilder-i…

3725788

…nplace

automatic emergency mode trigger when kube metrics unavailable for hpa

ba4351e

randytqwjp requested a review from sanposhiho as a code owner November 26, 2024 05:51

add return statement

11d8cb8

sanposhiho reviewed Nov 26, 2024

View reviewed changes

randytqwjp added 2 commits November 28, 2024 13:50

clean up code

c056d82

clean up code

b47578c

randytqwjp marked this pull request as draft November 28, 2024 07:44

add hpa test and try to fix controller test

7d07efd

fix old controller tests

12c7b05

randytqwjp commented Dec 4, 2024

View reviewed changes

randytqwjp added 2 commits December 6, 2024 15:41

add controller test and fix checkHPAStatus function

ed22342

clean up code

bddb139

sanposhiho reviewed Dec 6, 2024

View reviewed changes

randytqwjp added 2 commits December 6, 2024 16:37

remove autoemergency phase and use emergency instead

0dc2749

fix lint

dd7c10a

sanposhiho reviewed Dec 10, 2024

View reviewed changes

randytqwjp added 3 commits December 12, 2024 14:53

refactor tortoisephase change into tortoise service and write unit tests

2e261c4

fix lint

7c0e997

fix lint

5be0c1a

randytqwjp marked this pull request as ready for review December 17, 2024 13:50

sanposhiho reviewed Dec 18, 2024

View reviewed changes

fix review comments

d8ae58d

	func (c Service) CheckHpaMetricStatus(ctx context.Context, currenthpa v2.HorizontalPodAutoscaler) bool {
	func (c Service) IsHpaMetricAvailable(ctx context.Context, currenthpa v2.HorizontalPodAutoscaler) bool {

	func (c Service) CheckHpaMetricStatus(ctx context.Context, currenthpa v2.HorizontalPodAutoscaler) (bool, error) {
	// IsHPAScalingActive returns whether HPA is correctly working or not.
	func (c Service) IsHPAScalingActive(ctx context.Context, currenthpa v2.HorizontalPodAutoscaler) bool {

Automatic switch to emergency mode when metrics unavailable #424

Are you sure you want to change the base?

Automatic switch to emergency mode when metrics unavailable #424

Conversation

randytqwjp commented Nov 26, 2024

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho left a comment

Choose a reason for hiding this comment

randytqwjp commented Dec 3, 2024

sanposhiho commented Dec 4, 2024

sanposhiho commented Dec 4, 2024

randytqwjp commented Dec 4, 2024 • edited Loading

sanposhiho commented Dec 4, 2024

randytqwjp commented Dec 4, 2024

sanposhiho commented Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

randytqwjp commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

randytqwjp commented Dec 19, 2024

randytqwjp commented Dec 4, 2024 •

edited

Loading

sanposhiho commented Dec 4, 2024 •

edited

Loading

sanposhiho Dec 10, 2024 •

edited

Loading

sanposhiho Dec 11, 2024 •

edited

Loading