Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEDA Unable to Retrieve correct Kafka Metrics from ScaledObject on GKE #5730

Open
converge opened this issue Apr 23, 2024 · 14 comments
Open
Labels
bug Something isn't working

Comments

@converge
Copy link

Report

KEDA is unable to retrieve metrics correctly from a ScaledObject/ScaleTarget using a Kafka trigger when deployed to a GKE cluster (It works locally)

Expected Behavior

When HPA calculates the current metric value, it should not return -9223372036854775808m, but a valid Kafka lag.

Actual Behavior

When the Kafka ScaledObject is deployed to GKE:

  • The HorizontalPodAutoscaler reports a negative metric value (9223372036854775808m)

Steps to Reproduce the Problem

  1. deploy Keda in a GKE cluster
  2. create a ScaledObject with Kafka Source as Scaled Target
  3. kubectl get hpa -A (will show the metric with the negative value)

Logs from KEDA operator

There is no error or warning in the Keda operator.

KEDA Version

2.13.1

Kubernetes Version

1.27

Platform

Google Cloud

Scaler Details

Kafka

Anything else?

No response

@converge converge added the bug Something isn't working label Apr 23, 2024
@SpiritZhou
Copy link
Contributor

Could you change the log-level to debug in operator and send the operator logs?

@dttung2905
Copy link
Contributor

In addition to the log, could you provide the ScaledObject yaml config too ?

@converge
Copy link
Author

converge commented Jun 5, 2024

sorry for the delay, I still have this issue and created a new GCP/GKE cluster specially to debug it. I was able to reproduce the issue and got the logs from the exact moment where the current metric value switch in the HPA from <unknown> to -9223372036854775808m.

the logs from the controller when this switch happened:

2024-06-05T19:43:42Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:43:42Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:43:42Z","value":"1"}], "activity": true, "scalerError": null}

2024-06-05T19:43:42Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:43:42Z	INFO	scaleexecutor	Successfully updated ScaleTarget	{"scaledobject.Name": "kafka-scaledobject", "scaledObject.Namespace": "app", "scaleTarget.Name": "kafka-source", "Original Replicas Count": 0, "New Replicas Count": 1}

2024-06-05T19:43:42Z	DEBUG	events	Scaled sources.knative.dev/v1beta1.kafkasource app/kafka-source from 0 to 1	{"type": "Normal", "object": {"kind":"ScaledObject","namespace":"app","name":"kafka-scaledobject","uid":"01537a1d-1543-48ea-adce-e7c3069fff86","apiVersion":"keda.sh/v1alpha1","resourceVersion":"314645"}, "reason": "KEDAScaleTargetActivated"}
2024-06-05T19:44:12Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000a1b400]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:12Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:12Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:12Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:42Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000ae9400]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:42Z	DEBUG	scale_handler	Getting metrics and activity from scaler	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:42Z","value":"1"}], "activity": true, "scalerError": null}
2024-06-05T19:44:42Z	DEBUG	scale_handler	Scaler for scaledObject is active	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "scaler": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics"}
2024-06-05T19:44:52Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000460050]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:44:52Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}

2024-06-05T19:44:52Z	DEBUG	scale_handler	Getting metrics from trigger	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:44:52Z","value":"1"}], "scalerError": null}
2024-06-05T19:44:52Z	DEBUG	grpc_server	Providing metrics	{"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:44:52.805581005 +0000 UTC m=+3545.120570672,WindowSeconds:nil,Value:{{1000 -3} {<nil>}  DecimalSI},},},}"}
2024-06-05T19:45:08Z	DEBUG	kafka_scaler	with topic name [knative-demo-topic] the list of topic metadata is [0xc000550c30]	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z	DEBUG	kafka_scaler	Kafka scaler: Providing metrics based on totalLag 1, topicPartitions 1, threshold 60	{"type": "ScaledObject", "namespace": "app", "name": "kafka-scaledobject"}
2024-06-05T19:45:08Z	DEBUG	scale_handler	Getting metrics from trigger	{"scaledObject.Namespace": "app", "scaledObject.Name": "kafka-scaledobject", "trigger": "kafkaScaler", "metricName": "s0-kafka-knative-group-topics", "metrics": [{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-05T19:45:08Z","value":"1"}], "scalerError": null}
2024-06-05T19:45:08Z	DEBUG	grpc_server	Providing metrics	{"scaledObjectName": "kafka-scaledobject", "scaledObjectNamespace": "app", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-knative-group-topics,MetricLabels:map[string]string{},Timestamp:2024-06-05 19:45:08.16711765 +0000 UTC m=+3560.482107310,WindowSeconds:nil,Value:{{1000 -3} {<nil>}  DecimalSI},},},}"}

the scaled object:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: app
  annotations:
    scaledobject.keda.sh/transfer-hpa-ownership: "true"
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      name: notify-friends
  scaleTargetRef:
    apiVersion: sources.knative.dev/v1beta1
    name: kafka-source
    kind: kafkasource
  pollingInterval: 30
  minReplicaCount: 1
  maxReplicaCount: 11
  idleReplicaCount: 0
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: pkc-<...>.gcp.confluent.cloud:9092
        consumerGroup: knative-group
        topic:
        lagThreshold: '60'
        offsetResetPolicy: earliest
        scaleToZeroOnInvalidOffset: "false"
        tls: enable
        sasl: plaintext
      authenticationRef:
        name: keda-kafka-credentials

@converge
Copy link
Author

converge commented Jun 5, 2024

found people with similar issue https://kubernetes.slack.com/archives/CKZJ36A5D/p1709761505122509

@converge
Copy link
Author

converge commented Jun 7, 2024

@SpiritZhou @dttung2905 yesterday I created an EKS cluster with the same GCP setup/versions, and it works perfectly. Can you think of anything that could be different for GCP and AWS? any kind of blocker or anything that could be causing the issue?

I have also followed the k8s events and couldn't find anything bad in GCP.
Wanted to look for horizontal pod autoscaler logs, but couldn't find them.

@converge
Copy link
Author

I did some more tests, I believe kafka connection is ok, I was able to produce and consume messages inside the pod using Go Sarama library (the same knative kafka extension library).

I created a debian pod in Kubernetes/GCP/GKE, attached to it and:

./kafka-consumer-groups.sh --bootstrap-server pkc-<...>.<...>.gcp.confluent.cloud:9092 --command-config config.properties --describe --group testing-group

Consumer group 'testing-group' has no active members.

# no messages produced

GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          1               1               0               -               -               -
testing-group   knative-demo-topic 8          1               1               0               -               -               -
testing-group   knative-demo-topic 22         15              15              0               -               -               -
testing-group   knative-demo-topic 29         1               1               0               -               -               -

# no messages consumed
GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          1               30              29              -               -               -
testing-group   knative-demo-topic 8          1               36              35              -               -               -
testing-group   knative-demo-topic 22         15              54              39              -               -               -
testing-group   knative-demo-topic 29         1               38              37              -               -               -

# after messages being consumed
GROUP           TOPIC              PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
testing-group   knative-demo-topic 1          30              30              0               -               -               -
testing-group   knative-demo-topic 8          36              36              0               -               -               -
testing-group   knative-demo-topic 22         54              54              0               -               -               -
testing-group   knative-demo-topic 29         38              38              0               -               -               -
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/app/s0-kafka-knative-group-topics?labelSelector=scaledobject.keda.sh%2Fname%3Dkafka-scaledobject"
{"kind":"ExternalMetricValueList","apiVersion":"external.metrics.k8s.io/v1beta1","metadata":{},"items":[{"metricName":"s0-kafka-knative-group-topics","metricLabels":null,"timestamp":"2024-06-13T12:26:02Z",
---> "value":"1009"}]} <---

 ~/tmp/sandbox  🏖️ sandbox  k get hpa -n app                                                                                                                                                
NAME                  REFERENCE                                   TARGETS                          MINPODS   MAXPODS      REPLICAS   AGE
notify-friends        kafkasource/kafka-source                    -9223372036854775808m/60 (avg)   1         11           11         3m1s

another thing I noticed/not sure if relevant is that the metric is not listed here:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq                                                                                                  
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "externalmetrics",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

but I can query it, as it shows above.

@zroubalik
Copy link
Member

You should try to query the specific metric for the ScaledObject, see the examples down below: https://keda.sh/docs/2.14/operate/metrics-server/#querying-metrics-exposed-by-keda-metrics-server

@converge
Copy link
Author

thanks @zroubalik , Im able to query the metric, and I can also see the metric value via open telemetry/datadog, the metric value is correct/the expected value, and not the unexpected -9223372036854775808m value.

@pstibrany
Copy link

pstibrany commented Jul 24, 2024

-9223372036854775808m value is a result of converting +Inf to int64.

This code:

	fmt.Println("+Inf", int64(math.Inf(1)))
	fmt.Println("-Inf", int64(math.Inf(-1)))
	fmt.Println("NaN", int64(math.NaN()))

prints

+Inf 9223372036854775807
-Inf -9223372036854775808
NaN 0

on Apple M1. On same Apple M1, but when compiling with GOARCH=amd64, it prints:

+Inf 9223372036854775807
-Inf -9223372036854775808
NaN -9223372036854775808

@pstibrany
Copy link

pstibrany commented Jul 24, 2024

There are places in HPA controller where such conversion is happening. One case that I was investigating is when using KEDA to target custom resource, with TargetType AverageValue. In such case, the following method is responsible for computing usage:

https://github.com/kubernetes/kubernetes/blob/7f805dc9354b5209136d72359e3c019885135ce7/pkg/controller/podautoscaler/replica_calculator.go#L351C1-L376C2

If statusReplicas (read from /status subresource of target resource) is 0, then computed usage will be wrong (-9223372036854775808m): https://github.com/kubernetes/kubernetes/blob/a145f1508d2180c93ba292849f0bf4bf6120fae8/pkg/controller/podautoscaler/replica_calculator.go#L374 (notice conversion from float64 to int64 here)

This usage value is then stored into HPA's metric status: https://github.com/kubernetes/kubernetes/blob/985ef265511370655452157fb5c583d5bf7a45e4/pkg/controller/podautoscaler/horizontal.go#L687-L703

On EKS, when we fixed status.replicas on the target resource, things started to work correctly for us. On GKE, we still see -9223372036854775808m value, and don't understand why.

@Zurvarian
Copy link

I've followed up on this issue.

We've seen the actual calculation of the scaler to be correct. However HPA is displaying the values wrong when queried.

@pstibrany As you pointed out the issue seems to be how the usage is calculated. I still have pending to try to use Value instead of AverageValue as the former won't do any calculation with the metric gotten from the Kafka Cluster. If then the value is displayed correctly we would have found the reason.

Looking at the source code in the method it turns out that even if number of replicas is 0, the replicas calculation is correct because in the end the statusReplicas is not use to calculate the replicas only to prevent increasing the replica count if the new value grew below the toleration.
And as toleration is calculated by 1-(-9223372036854775808) the result is a positive number that is always over the toleration threshold 🤣.

Knowing this is the root cause granted me peace of mind, thank you!

Now the remaining question is why scale.Status.Replicas is 0 in GCP, but that's another path to investigate.

That's assuming that GCP did not deploy their own custom version of the Autoscaler and that introduced something else that could be messing around.

Copy link

stale bot commented Nov 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Nov 7, 2024
@zroubalik zroubalik removed the stale All issues that are marked as stale due to inactivity label Nov 11, 2024
@zroubalik
Copy link
Member

I think this issue might be worth documenting

@MaxWinterstein
Copy link

I opened a case for this / my related discussion #6375

Case 55415916: HPA scaler has huge delay when coming from 0 replica deployment

Let's see if this get us somewhere...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants