Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Prometheus reporting when zdbs are unreachable #118

Closed
scottyeager opened this issue Jul 10, 2024 · 12 comments · Fixed by #130
Closed

Inconsistent Prometheus reporting when zdbs are unreachable #118

scottyeager opened this issue Jul 10, 2024 · 12 comments · Fixed by #130
Assignees

Comments

@scottyeager
Copy link

scottyeager commented Jul 10, 2024

I created a zstor setup with four metadata nodes and four data backend nodes. These are distributed across four physical nodes, such that each physical node hosts one metadata zdb and one data zdb.

To simulate a failure, I blocked network traffic to one of the physical nodes. As expected, the zstor status command shows that both of the associated zdbs are unreachable:

# zstor -c /etc/zstor-default.toml status
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                             | reachable | objects | used space | free space | usage percentage |
+=====================================================================+===========+=========+============+============+==================+
| [44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900 - 81-42998-node1meta0 | No        |       - |          - |          - |                - |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900 - 81-42999-node3meta0 | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900 - 81-43000-node5meta0 | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900 - 81-43001-node7meta0  | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                                | reachable | objects | used space | free space | usage percentage |
+========================================================================+===========+=========+============+============+==================+
| [44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900 - 81-43002-node1backend0 | No        |       - |          - |          - |                - |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900 - 81-43003-node3backend0 | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900 - 81-43004-node5backend0 | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900 - 81-43005-node7backend0  | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+

On the other hand, if I fetch Prometheus metrics off the /metrics endpoint, I see that only the metadata zdb has been removed from the list of supplied data points:

# HELP data_disk_freespace_bytes data_disk_freespace_bytes in namespace
# TYPE data_disk_freespace_bytes gauge
data_disk_freespace_bytes{address="[44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900",backend_type="data",namespace="81-43002-node1backend0"} 996966887424
data_disk_freespace_bytes{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="data",namespace="81-43003-node3backend0"} 2997354729472
data_disk_freespace_bytes{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="meta",namespace="81-42999-node3meta0"} 2997354729472
data_disk_freespace_bytes{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="data",namespace="81-43004-node5backend0"} 2997354999808
data_disk_freespace_bytes{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="meta",namespace="81-43000-node5meta0"} 2997354999808
data_disk_freespace_bytes{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="data",namespace="81-43005-node7backend0"} 2997354733568
data_disk_freespace_bytes{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="meta",namespace="81-43001-node7meta0"} 2997354733568
# HELP data_faults data_faults in namespace
# TYPE data_faults gauge
data_faults{address="[44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900",backend_type="data",namespace="81-43002-node1backend0"} 0
data_faults{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="data",namespace="81-43003-node3backend0"} 0
data_faults{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="meta",namespace="81-42999-node3meta0"} 0
data_faults{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="data",namespace="81-43004-node5backend0"} 0
data_faults{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="meta",namespace="81-43000-node5meta0"} 0
data_faults{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="data",namespace="81-43005-node7backend0"} 0
data_faults{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="meta",namespace="81-43001-node7meta0"} 0

Seems based on my reading of #72 that the expected behavior is to have the metric for any unreachable backend removed from the metrics list.

Indeed I see errors in the log file:

2024-07-10 22:37:41 +00:00: WARN Failed to delete removed metric by label: Error: missing labels {"backend_type": "meta", "address": "[44a:5fd6:b716:d147:f5ae:7088:a381:49
76]:9900", "namespace": "81-42998-node1meta0"}
2024-07-10 22:37:41 +00:00: WARN Failed to delete removed metric by label: Error: missing labels {"namespace": "81-43002-node1backend0", "address": "[44a:5fd6:b716:d147:f5
ae:7088:a381:4976]:9900", "backend_type": "meta"}

There's two things that seem off about these logs. First, the namespace 81-43002-node1backend0 was not a metadata backend, but the log shows "backend_type": "meta". Secondly, the same error appears for both unreachable backends, when one has actually been successfully removed.

@LeeSmet
Copy link
Contributor

LeeSmet commented Sep 10, 2024

How did you block the traffic? As I'm struggeling to reproduce this locally

@scottyeager
Copy link
Author

I used an iptables directive to DROP traffic, for example:

# ip6tables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all      anywhere             2a10:b600:1:0:9cc7:4bff:fe83:4fbd

I'm able to reproduce the behavior of only removing the metadata zdb from metrics pretty consistently. After I deployed this fresh instance, I only saw the expected behavior of removing both zdbs the first time I added the iptables rule. It happened every time subsequently.

My test cycle is like this:

  1. iptables -A ...
  2. zstor -c /etc/zstor_config.toml status
  3. wget localhost:9100/metrics
  4. iptables -D ...
  5. zstor -c /etc/zstor_config.toml status
  6. wget localhost:9100/metrics

@Mik-TF
Copy link

Mik-TF commented Nov 7, 2024

@LeeSmet @iwanbk Can you check this? Thanks!

@iwanbk
Copy link
Member

iwanbk commented Nov 13, 2024

There's two things that seem off about these logs. First, the namespace 81-43002-node1backend0 was not a metadata backend, but the log shows "backend_type": "meta". Secondly, the same error appears for both unreachable backends, when one has actually been successfully removed.

this error should be fixed by #126 as well (assuming that the root cause is the same).

I'll try to reproduce it and see whether #126 fixes the issue

@iwanbk
Copy link
Member

iwanbk commented Nov 13, 2024

this error should be fixed by #126 as well (assuming that the root cause is the same).

I'll try to reproduce it and see whether #126 fixes the issue

confirmed

@iwanbk iwanbk self-assigned this Nov 13, 2024
@Mik-TF
Copy link

Mik-TF commented Nov 13, 2024

Amazing. Thanks @iwanbk well done.

@scottyeager please check on your end if it helps.

@scottyeager
Copy link
Author

I'll test ASAP.

@iwanbk iwanbk assigned scottyeager and unassigned iwanbk Nov 15, 2024
@scottyeager
Copy link
Author

I have verified that the core of this issue is solved. The backends are now consistently removed from Prometheus metrics when traffic is interrupted.

One last question for you though @iwanbk. I'm still seeing a lot of logs with WARN Failed to delete removed metric by label: Error: missing labels. Is it normal for zstor to continue trying to remove them and issue these warnings, or something further to investigate there?

@iwanbk
Copy link
Member

iwanbk commented Nov 18, 2024

Is it normal for zstor to continue trying to remove them and issue these warnings, or something further to investigate there?

It is not normal and i didn't see it.
I'll check it again later.

@iwanbk iwanbk self-assigned this Nov 18, 2024
@iwanbk
Copy link
Member

iwanbk commented Nov 18, 2024

It is not normal and i didn't see it.
I'll check it again later.

OK, i can reproduce it, i have to refresh my browser to see it.

I've fixed it at #130.

Thanks for the amazing test 👍

@Mik-TF
Copy link

Mik-TF commented Nov 18, 2024

@iwanbk nicely done. Thanks to both of you and @scottyeager for the work here.

We are getting close to having qsfs set up.

@scottyeager
Copy link
Author

OK, i can reproduce it, i have to refresh my browser to see it.

I've fixed it at #130.

Great, thanks @iwanbk!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants