-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Cortex v1.18.0 Upgrade Causing OOMKills and CPU Spikes in Store-Gateway #6259
Comments
Hi @dpericaxon, Thanks for filing the issue. I was looking at the pprof attached in the issue and noticed that
Something that changed in between 1.17.1 and 1.18.0 is this
I don't see this flag being set in your values file for the queriers that enabled it before the upgrade: querier.query-ingesters-within: 8h
querier.max-fetched-data-bytes-per-query: "2147483648"
querier.max-fetched-chunks-per-query: "1000000"
querier.max-fetched-series-per-query: "200000"
querier.max-samples: "50000000"
blocks-storage.bucket-store.bucket-index.enabled: true I have a feeling that since it is always enabled, the label values are being returned for the entire time range instead of just the instant that the query was run. Could you try setting |
It can indeed be because of that flag.. good catch @CharlieTLe maybe we should default the series/label names apis to query the last 24 hours if the time range is not specified ? |
I think we should be able to set a limit for how many label values can be queried so that even if a long time range is specified, it doesn't cause the store-gateway to use too much memory. |
There is an effort to limit this but it may not be straight forward as this limit can only be applied after querying the index (and for those particular apis, this is all the work) |
Should we add the flag to restore the previous behavior until a limit can be set on the maximum number of label values that could be fetched? Or perhaps setting an execution time limit on the fetching so that it can be cancelled if it's taking longer than a specified duration? I think this specific API call is mostly used by query builders for making auto complete possible? |
I don't think the heap usage increased was caused by label values request. If you look at the heap profile, it was used by the binary index header part, which is expected as Store Gateway caches blocks' symbols, and some postings. And the heap profile provided may not capture what took memory as it was only 600MBs. I recommend taking another heap dump from a Store Gateway where you observe high memory usage. |
Thank you @CharlieTLe and @yeya24 for your suggestions
CPU and memory spike after setting to false and upgrade to v1.18.0 Here’s a quick PPROF of the Store Gateway during one of these OOM incidents: (pprof) top
Showing nodes accounting for 975.10MB, 95.68% of 1019.09MB total
Dropped 206 nodes (cum <= 5.10MB)
Showing top 10 nodes out of 66
flat flat% sum% cum cum%
464.82MB 45.61% 45.61% 464.82MB 45.61% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init.func3
178.29MB 17.50% 63.11% 683.35MB 67.05% github.com/thanos-io/thanos/pkg/block/indexheader.(*BinaryReader).init
129.10MB 12.67% 75.78% 129.10MB 12.67% github.com/thanos-io/thanos/pkg/pool.NewBucketedBytes.func1
76.84MB 7.54% 83.32% 76.84MB 7.54% github.com/thanos-io/thanos/pkg/cacheutil.NewAsyncOperationProcessor
64.77MB 6.36% 89.67% 65.77MB 6.45% github.com/bradfitz/gomemcache/memcache.parseGetResponse
40.23MB 3.95% 93.62% 40.23MB 3.95% github.com/prometheus/prometheus/tsdb/index.NewSymbols
13.94MB 1.37% 94.99% 13.94MB 1.37% github.com/klauspost/compress/s2.NewWriter.func1
4.10MB 0.4% 95.39% 687.45MB 67.46% github.com/thanos-io/thanos/pkg/block/indexheader.newFileBinaryReader
1.50MB 0.15% 95.54% 5.55MB 0.54% github.com/thanos-io/thanos/pkg/store.(*blockSeriesClient).nextBatch
1.50MB 0.15% 95.68% 35.02MB 3.44% github.com/thanos-io/thanos/pkg/store.populateChunk |
Hi @elliesaber, Unfortunately, setting We could bring the flag back by reverting #5984. I'm not really sure why we decided to remove this flag instead of setting its default to true. Adding the flag back could help with users that are looking to upgrade to 1.18.0 without querying the store gateway for labels. |
Thank you @CharlieTLe for the suggestion. I agree that being able to set querier.query-store-for-labels-enabled manually instead of relying on the default behavior would be helpful for us. Reverting the flag and allowing users to control whether or not to query the store gateway for labels would give us more flexibility. This would likely prevent the significant CPU and memory spikes that are leading to OOMKills and help smooth the upgrade process to v1.18.0. We’d appreciate this addition as it would enable us to upgrade without running into these memory issues. |
I don't think the heap dump above shows the issue was label values touching store gateway. The heap dump was probably not at the right time as your memory usage showed that it could go to 48GB. For the memory usage metric, are you using the Another thing that might help with the issue is setting |
This message seems pretty telling that it is caused by the behavior controlled by the flag
If we ignored the heap dump, it does seem possible that there is a label with a very high cardinality. If there is no limit to how many label values could be queried, I could imagine that the store-gateway could be overwhelmed with fetching all of the values possible for a label. |
@yeya24 we used |
Thanks and sorry for the late response. @elliesaber How does metric If you confirmed that the OOM kill was caused by |
Hey @yeya24 we believe its related to that flag. This is what the |
@dpericaxon I don't think the graph showed that the flag is related. It looks more related to a deployment. Do you have any API requests that ask for label names/values at the time of the spikes? The flag is related to those labels API so we need evidence to prove that the API caused the memory increase. You can reproduce this by calling the API manually yourself. |
Hey @yeya24, I observed the issue immediately after updating Cortex to Key Observations:
Memory Spikes and Metrics:
Questions:
Let me know if you need further details or additional logs to debug this issue. |
Describe the bug
Following the upgrade of Cortex from v1.17.1 to v1.18.0, the Store Gateway Pods are frequently encountering OOMKills. These events appear random, occurring approximately every 5 minutes, and have continued beyond the upgrade. Before the upgrade, memory usage consistently hovered around 4GB, with CPU usage under 1 core. However, after the upgrade, both CPU and memory usage have spiked to over 10 times their typical levels. Even after increasing the memory limit for the Store Gateway to 30GB, the issue persists. (see graph below)
We initially suspected the issue might be related to the sharding ring configurations, so we attempted to disable the following flags:
However, this did not resolve the problem.
CPU Graph: The far left shows usage before the upgrade, the middle represents usage during the upgrade, and the far right illustrates the rollback, where CPU usage returns to normal levels-
Memory Graph: The far left shows memory usage before the upgrade, the middle represents usage during the upgrade, and the far right reflects the rollback, where memory usage returns to normal levels-
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Store-GW shouldn't be OOMKilling.
Environment:
Additional Context
Helm Chart Values Passed
Quick PPROF of Store GW
The text was updated successfully, but these errors were encountered: