-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723
Comments
We're on Prometheus 2.47 and thanos 0.32 since yesterday and have not yet experienced this issue, I've just checked our logs. We are trying to figure out why the compactor is suddenly using a LOT of disk io and running constantly though. |
There might be a difference between the flags we add to the compactor. We run the compactor with these:
|
We are also on 2.47.0 and compactor have the same errors messages on One prometheus (not another one)
On this prometheus we do federation and not on the one working. |
When you rolled back the versions, did you also delete the problematic chunks, which halted the compaction? |
We have compactor with "--compact.skip-block-with-out-of-order-chunks", so the blocks are marked for no compaction, we did not delete anything to prevent data loss Its a rollback, not a solution :) |
we've encountered the same problem on a few deployments using AWS S3. |
Got the same issue:
happened after we upgraded our prometheus instances to v2.47.0 and turned on native-histograms. |
We do not have "native-histograms" in our conf |
No I do not do federation. |
Same issue for us. We did try running sidecar with thanos v0.31 since we use The streams we get from Thanos receiver get compacted without problems. |
Just encounter the same in our setup. Thanos, Prometheus version used: object storage: s3 Thanos compactor falling due to out-of-order chunks, tried the skip-block-with-out-of-order-chunks option, but it was encountering it with almost with every chunk, so we rolled back Prometheus to 2.46.0. |
Same issue for me, so I added log prints to find out what the problem was, and found that the problem had been solved in prometheus/prometheus#12874 , but It hasn't been released yet. |
I don't want to delete block because of this problem, I want a tool that fixes the problem, and now I'm going to add a directive to the thanos tool to handle the problem. |
Can we upgrade prometheus without a release and tag a patch version a thanos ? This is a serious issue that merits actions. Mimir already did the upgrade grafana/mimir#6107. |
CC @saswatamcode I guess we can do a v0.32.4 release for this fix prometheus/prometheus#12874 and previous fixes |
Ack! I think the upgrade will be a bit more involved tho, will do it on |
@saswatamcode After taking another look, this seems a Prometheus issue only. Thanos is still using an old version of Prometheus so not affected by this bug. It is just the bad blocks created by Prometheus causing compactor to fail. |
@saswatamcode Is that fix? I made some code changes in my branch, testing it, I'm on vacation here, go back after the holiday to see the actual situation. If it's fixed, I'll cancel the code I changed. |
@mickeyzzc No, since the bad blocks are created by Prometheus, we need a tool to repair the bad blocks. Thanos doesn't cause those blocks to be corrupted. |
@saswatamcode @yeya24 I'm testing my environment, and the anomaly block has passed the fix. |
You mean the faulty blocks got compacted ? |
It's possible in this case. |
Hello, |
Sounds like you have a different issue. The issue here was not that blocks are not pushed to the store but that the blocks in the store were out of order and could not be compacted. |
@rouke-broersma You are right. My appologies. We assumed this issue resulted in unavailability of data in the storage. Bet as I see now, these are two different issues. |
Thanos, Prometheus and Golang version used:
Prometheus: 2.47.0
Thanos: 0.32.2
Object Storage Provider:
Azure Blob
What happened:
After upgrading Prometheus to vesion 2.47.0 the compactor stopped working. All blocks after the upgrade have out-of-order chunks.
The compactor fails when it reaches a new block with error:
What you expected to happen:
The compactor succeeds in compacting the blocks.
How to reproduce it (as minimally and precisely as possible):
Run Prometheus 2.47.0 with the compactor enabled.
Full logs to relevant components:
Anything else we need to know:
After reverting back to 2.46.0 the newly created blocks done give the error anymore and it seems to be solved.
Already discussed this problem on the CNCF slack: https://cloud-native.slack.com/archives/CK5RSSC10/p1694681247238809
The text was updated successfully, but these errors were encountered: