query-frontend performance #716

josunect · 2023-12-14T12:47:48Z

Installing Tempo with the operator and the following resources:

  resources:
    total:
      limits:
        memory: 1Gi
        cpu: 2000m

With the following query the pod is OOMKilled:

curl -G -s http://localhost:3200/api/search --data-urlencode 'q={ .service.name = "productpage.bookinfo"} && { } | select("status", ".service_name", ".node_id", ".component", ".upstream_cluster", ".http.method", ".response_flags")' --data-urlencode 'spss=10' --data-urlencode 'limit=100' --data-urlencode 'start=1701948096' --data-urlencode 'end=1702552896' | jq

The system seems stable increasing the resources:

  resources:
    total:
      limits:
        memory: 4Gi
        cpu: 8000m

But, it looks a lot for a development environment?

Tested in minikube following https://grafana.com/docs/tempo/latest/setup/operator/.

The text was updated successfully, but these errors were encountered:

andreasgerstmayr · 2024-02-28T15:12:05Z

hi @josunect!

Can you give more details, i.e. what command is used to generate the traces, and after which timeframe Tempo runs into OOM?

I'd like to test this with a basic Tempo setup (https://github.com/grafana/tempo/blob/main/example/docker-compose/s3/docker-compose.yaml) to see if it's an issue of the operator or Tempo itself.

josunect · 2024-02-28T16:10:11Z

Hi, @andreasgerstmayr !

So, we are configuring istio to send traces to Tempo using zipkin, https://istio.io/latest/docs/tasks/observability/distributed-tracing/zipkin/, with a sample rate of 100 with the following configuration:

--set values.meshConfig.defaultConfig.tracing.zipkin.address=tempo-cr-distributor.tempo:9411

There is a Kiali hack script to easily create the environment that we use:

https://github.com/kiali/kiali/tree/master/hack/istio/tempo

And follow these steps (Ex with minikube):

minikube start
Update this line https://github.com/kiali/kiali/blob/master/hack/istio/tempo/install-tempo-env.sh#L208 to true, so Jaeger query is enabled.
hack/istio/tempo/install-tempo-env.sh -c kubectl -ik true

That will install tempo and istio (Configured to send the traces) in different namespaces.
Then port forward the service:

kubectl port-forward svc/tempo-cr-query-frontend 16686:16686 -n tempo

And then run this query every 10 seconds, to get traces from the last hour with a limit of 200:

curl 'http://localhost:16686/api/traces?end=1709135961250000&limit=200&service=productpage.bookinfo&start=1709132361250000'
With ~= 30 minutes traces, running the query 4 times, the query-frontend was killed by OOM:

I've found the issue also with the Tempo API.

andreasgerstmayr · 2024-02-28T18:05:55Z

Thank you for the detailed instructions! I'll test that in the coming days.

andreasgerstmayr · 2024-03-04T12:57:29Z

I can reproduce the OOM of the tempo container in the query-frontend pod with instructions above. The container gets 409 MB memory allocated (5% [1] of the 8 GB allocated to the TempoStack).

The query fetches the entire trace of up to 200 traces, resulting in a ~2 MB large response. The Jaeger UI/API (wrapped with tempo-query) is not optimized for Tempo. For every search query, it fetches a list of matching traceIDs, and then fetches the entire trace in a serial loop [2]. Fetching the entire trace is an expensive operation.

afaics the intended usage is to run a TraceQL query to find interesting traces (traces with errors, high latency, etc.), and then only fetch the entire trace of these (few) matching traces.

Running the same query with TraceQL should improve performance. Tempo recommends to use scoped attributes, i.e. { resource.service.name = "productpage.bookinfo" }:

curl -s -G http://tempo-cr-query-frontend.tempo.svc.cluster.local:3200/api/search --data-urlencode 'q={ resource.service.name = "productpage.bookinfo" }' --data-urlencode start=$(date -d "1 hour ago" +%s) --data-urlencode end=$(date +%s) --data-urlencode limit=200

This query still OOMs on my machine with the resource limits after a while.

However, when I increase the resources of the query-frontend pod [3]:

spec:
  template:
    queryFrontend:
      component:
        resources:
          limits:
            cpu: "2"
            memory: 2Gi

I can run the above curl command in an endless loop and the tempo container of the query-frontend pod uses about 60% CPU and 1.1 GiB memory (and does not run out of memory 😃).

Edit: Also the query with the Jaeger API works now (with 2GB of memory), albeit slow because the Jaeger API is not optimized for Tempo as described above.

[1]

tempo-operator/internal/manifests/manifestutils/resources.go

Line 23 in 9178423

"query-frontend": {cpu: 0.09, memory: 0.05},

[2] https://github.com/grafana/tempo/blob/v2.3.1/cmd/tempo-query/tempo/plugin.go#L270-L281
[3] This feature is already merged in main branch, but not in a released version yet.

josunect · 2024-03-04T16:06:25Z

I can reproduce the OOM of the tempo container in the query-frontend pod with instructions above. The container gets 409 MB memory allocated (5% [1] of the 8 GB allocated to the TempoStack).

The query fetches the entire trace of up to 200 traces, resulting in a ~2 MB large response. The Jaeger UI/API (wrapped with tempo-query) is not optimized for Tempo. For every search query, it fetches a list of matching traceIDs, and then fetches the entire trace in a serial loop [2]. Fetching the entire trace is an expensive operation.

afaics the intended usage is to run a TraceQL query to find interesting traces (traces with errors, high latency, etc.), and then only fetch the entire trace of these (few) matching traces.

Running the same query with TraceQL should improve performance. Tempo recommends to use scoped attributes, i.e. { resource.service.name = "productpage.bookinfo" }:
curl -s -G http://tempo-cr-query-frontend.tempo.svc.cluster.local:3200/api/search --data-urlencode 'q={ resource.service.name = "productpage.bookinfo" }' --data-urlencode start=$(date -d "1 hour ago" +%s) --data-urlencode end=$(date +%s) --data-urlencode limit=200
This query still OOMs on my machine with the resource limits after a while.

However, when I increase the resources of the query-frontend pod [3]:
spec:
  template:
    queryFrontend:
      component:
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
I can run the above curl command in an endless loop and the tempo container of the query-frontend pod uses about 60% CPU and 1.1 GiB memory (and does not run out of memory 😃).

Edit: Also the query with the Jaeger API works now (with 2GB of memory), albeit slow because the Jaeger API is not optimized for Tempo as described above.

[1]

tempo-operator/internal/manifests/manifestutils/resources.go

Line 23 in 9178423

"query-frontend": {cpu: 0.09, memory: 0.05},

[2] https://github.com/grafana/tempo/blob/v2.3.1/cmd/tempo-query/tempo/plugin.go#L270-L281
[3] This feature is already merged in main branch, but not in a released version yet.

Thanks for all the information, @andreasgerstmayr ! That is really useful.

Option [3], I think it would be the ideal, as using TraceQL still has some issues in the end, and I think it will help to allocate the resources where they are really needed.

andreasgerstmayr · 2024-03-11T12:13:31Z

@josunect we just released version 0.9.0 of the operator, which allows overriding resource requests and limits per component.

Could you test if this resolves the issue? We'll deprecate the current formula to allocate resources to components soon and switch to T-Shirt sizes (#845) instead.

josunect · 2024-03-11T13:29:16Z

@josunect we just released version 0.9.0 of the operator, which allows overriding resource requests and limits per component.

Could you test if this resolves the issue? We'll deprecate the current formula to allocate resources to components soon and switch to T-Shirt sizes (#845) instead.

Thank you, @andreasgerstmayr ! I have done some initial testing and it looks like the performance is improved. I will do some more testing and update the issue.

Thanks!

josunect · 2024-04-01T11:57:47Z

Hi, @andreasgerstmayr

For the tests that I've done, I didn't found any issue. This resource allocation configuration seems more appropriate.
I think the issue can be closed.

Thank you!

andreasgerstmayr · 2024-04-17T10:45:42Z

For the tests that I've done, I didn't found any issue. This resource allocation configuration seems more appropriate. I think the issue can be closed.

Thanks for the follow-up! I'll close this issue then.

josunect mentioned this issue Mar 11, 2024

Update Tempo resource usage kiali/kiali#7185

Closed

andreasgerstmayr closed this as completed Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query-frontend performance #716

query-frontend performance #716

josunect commented Dec 14, 2023

andreasgerstmayr commented Feb 28, 2024

josunect commented Feb 28, 2024

andreasgerstmayr commented Feb 28, 2024

andreasgerstmayr commented Mar 4, 2024 •

edited

Loading

josunect commented Mar 4, 2024

andreasgerstmayr commented Mar 11, 2024

josunect commented Mar 11, 2024

josunect commented Apr 1, 2024

andreasgerstmayr commented Apr 17, 2024

query-frontend performance #716

query-frontend performance #716

Comments

josunect commented Dec 14, 2023

andreasgerstmayr commented Feb 28, 2024

josunect commented Feb 28, 2024

andreasgerstmayr commented Feb 28, 2024

andreasgerstmayr commented Mar 4, 2024 • edited Loading

josunect commented Mar 4, 2024

andreasgerstmayr commented Mar 11, 2024

josunect commented Mar 11, 2024

josunect commented Apr 1, 2024

andreasgerstmayr commented Apr 17, 2024

andreasgerstmayr commented Mar 4, 2024 •

edited

Loading