-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Prometheus reporting when zdbs are unreachable #118
Comments
How did you block the traffic? As I'm struggeling to reproduce this locally |
I used an iptables directive to # ip6tables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DROP all anywhere 2a10:b600:1:0:9cc7:4bff:fe83:4fbd I'm able to reproduce the behavior of only removing the metadata zdb from metrics pretty consistently. After I deployed this fresh instance, I only saw the expected behavior of removing both zdbs the first time I added the iptables rule. It happened every time subsequently. My test cycle is like this:
|
this error should be fixed by #126 as well (assuming that the root cause is the same). I'll try to reproduce it and see whether #126 fixes the issue |
Amazing. Thanks @iwanbk well done. @scottyeager please check on your end if it helps. |
I'll test ASAP. |
I have verified that the core of this issue is solved. The backends are now consistently removed from Prometheus metrics when traffic is interrupted. One last question for you though @iwanbk. I'm still seeing a lot of logs with |
It is not normal and i didn't see it. |
OK, i can reproduce it, i have to refresh my browser to see it. I've fixed it at #130. Thanks for the amazing test 👍 |
@iwanbk nicely done. Thanks to both of you and @scottyeager for the work here. We are getting close to having qsfs set up. |
I created a zstor setup with four metadata nodes and four data backend nodes. These are distributed across four physical nodes, such that each physical node hosts one metadata zdb and one data zdb.
To simulate a failure, I blocked network traffic to one of the physical nodes. As expected, the zstor
status
command shows that both of the associated zdbs are unreachable:On the other hand, if I fetch Prometheus metrics off the
/metrics
endpoint, I see that only the metadata zdb has been removed from the list of supplied data points:Seems based on my reading of #72 that the expected behavior is to have the metric for any unreachable backend removed from the metrics list.
Indeed I see errors in the log file:
There's two things that seem off about these logs. First, the namespace
81-43002-node1backend0
was not a metadata backend, but the log shows"backend_type": "meta"
. Secondly, the same error appears for both unreachable backends, when one has actually been successfully removed.The text was updated successfully, but these errors were encountered: