Inconsistent Prometheus reporting when zdbs are unreachable #118

scottyeager · 2024-07-10T20:41:14Z

I created a zstor setup with four metadata nodes and four data backend nodes. These are distributed across four physical nodes, such that each physical node hosts one metadata zdb and one data zdb.

To simulate a failure, I blocked network traffic to one of the physical nodes. As expected, the zstor status command shows that both of the associated zdbs are unreachable:

# zstor -c /etc/zstor-default.toml status
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                             | reachable | objects | used space | free space | usage percentage |
+=====================================================================+===========+=========+============+============+==================+
| [44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900 - 81-42998-node1meta0 | No        |       - |          - |          - |                - |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900 - 81-42999-node3meta0 | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900 - 81-43000-node5meta0 | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900 - 81-43001-node7meta0  | Yes       |      93 |      44208 | 1073741824 |                0 |
+---------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| backend                                                                | reachable | objects | used space | free space | usage percentage |
+========================================================================+===========+=========+============+============+==================+
| [44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900 - 81-43002-node1backend0 | No        |       - |          - |          - |                - |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900 - 81-43003-node3backend0 | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900 - 81-43004-node5backend0 | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+
| [564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900 - 81-43005-node7backend0  | Yes       |     620 | 1071916880 | 1073741824 |               99 |
+------------------------------------------------------------------------+-----------+---------+------------+------------+------------------+

On the other hand, if I fetch Prometheus metrics off the /metrics endpoint, I see that only the metadata zdb has been removed from the list of supplied data points:

# HELP data_disk_freespace_bytes data_disk_freespace_bytes in namespace
# TYPE data_disk_freespace_bytes gauge
data_disk_freespace_bytes{address="[44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900",backend_type="data",namespace="81-43002-node1backend0"} 996966887424
data_disk_freespace_bytes{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="data",namespace="81-43003-node3backend0"} 2997354729472
data_disk_freespace_bytes{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="meta",namespace="81-42999-node3meta0"} 2997354729472
data_disk_freespace_bytes{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="data",namespace="81-43004-node5backend0"} 2997354999808
data_disk_freespace_bytes{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="meta",namespace="81-43000-node5meta0"} 2997354999808
data_disk_freespace_bytes{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="data",namespace="81-43005-node7backend0"} 2997354733568
data_disk_freespace_bytes{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="meta",namespace="81-43001-node7meta0"} 2997354733568
# HELP data_faults data_faults in namespace
# TYPE data_faults gauge
data_faults{address="[44a:5fd6:b716:d147:f5ae:7088:a381:4976]:9900",backend_type="data",namespace="81-43002-node1backend0"} 0
data_faults{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="data",namespace="81-43003-node3backend0"} 0
data_faults{address="[4b8:280d:e777:5252:f985:19cc:dcfd:7624]:9900",backend_type="meta",namespace="81-42999-node3meta0"} 0
data_faults{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="data",namespace="81-43004-node5backend0"} 0
data_faults{address="[4d0:6235:9b0e:328b:b0f8:4c1c:5ed8:b673]:9900",backend_type="meta",namespace="81-43000-node5meta0"} 0
data_faults{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="data",namespace="81-43005-node7backend0"} 0
data_faults{address="[564:7d94:d5cc:f37b:4783:35d6:6b37:a40]:9900",backend_type="meta",namespace="81-43001-node7meta0"} 0

Seems based on my reading of #72 that the expected behavior is to have the metric for any unreachable backend removed from the metrics list.

Indeed I see errors in the log file:

2024-07-10 22:37:41 +00:00: WARN Failed to delete removed metric by label: Error: missing labels {"backend_type": "meta", "address": "[44a:5fd6:b716:d147:f5ae:7088:a381:49
76]:9900", "namespace": "81-42998-node1meta0"}
2024-07-10 22:37:41 +00:00: WARN Failed to delete removed metric by label: Error: missing labels {"namespace": "81-43002-node1backend0", "address": "[44a:5fd6:b716:d147:f5
ae:7088:a381:4976]:9900", "backend_type": "meta"}

There's two things that seem off about these logs. First, the namespace 81-43002-node1backend0 was not a metadata backend, but the log shows "backend_type": "meta". Secondly, the same error appears for both unreachable backends, when one has actually been successfully removed.

The text was updated successfully, but these errors were encountered:

LeeSmet · 2024-09-10T13:16:00Z

How did you block the traffic? As I'm struggeling to reproduce this locally

scottyeager · 2024-10-01T20:38:33Z

I used an iptables directive to DROP traffic, for example:

# ip6tables -L

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all      anywhere             2a10:b600:1:0:9cc7:4bff:fe83:4fbd

I'm able to reproduce the behavior of only removing the metadata zdb from metrics pretty consistently. After I deployed this fresh instance, I only saw the expected behavior of removing both zdbs the first time I added the iptables rule. It happened every time subsequently.

My test cycle is like this:

iptables -A ...
zstor -c /etc/zstor_config.toml status
wget localhost:9100/metrics
iptables -D ...
zstor -c /etc/zstor_config.toml status
wget localhost:9100/metrics

Mik-TF · 2024-11-07T15:32:43Z

@LeeSmet @iwanbk Can you check this? Thanks!

iwanbk · 2024-11-13T08:21:49Z

There's two things that seem off about these logs. First, the namespace 81-43002-node1backend0 was not a metadata backend, but the log shows "backend_type": "meta". Secondly, the same error appears for both unreachable backends, when one has actually been successfully removed.

this error should be fixed by #126 as well (assuming that the root cause is the same).

I'll try to reproduce it and see whether #126 fixes the issue

iwanbk · 2024-11-13T08:32:57Z

this error should be fixed by #126 as well (assuming that the root cause is the same).

I'll try to reproduce it and see whether #126 fixes the issue

confirmed

Mik-TF · 2024-11-13T14:52:05Z

Amazing. Thanks @iwanbk well done.

@scottyeager please check on your end if it helps.

scottyeager · 2024-11-14T02:49:39Z

I'll test ASAP.

scottyeager · 2024-11-16T01:00:37Z

I have verified that the core of this issue is solved. The backends are now consistently removed from Prometheus metrics when traffic is interrupted.

One last question for you though @iwanbk. I'm still seeing a lot of logs with WARN Failed to delete removed metric by label: Error: missing labels. Is it normal for zstor to continue trying to remove them and issue these warnings, or something further to investigate there?

iwanbk · 2024-11-18T02:49:44Z

Is it normal for zstor to continue trying to remove them and issue these warnings, or something further to investigate there?

It is not normal and i didn't see it.
I'll check it again later.

iwanbk · 2024-11-18T11:05:31Z

It is not normal and i didn't see it.
I'll check it again later.

OK, i can reproduce it, i have to refresh my browser to see it.

I've fixed it at #130.

Thanks for the amazing test 👍

Mik-TF · 2024-11-18T14:28:59Z

@iwanbk nicely done. Thanks to both of you and @scottyeager for the work here.

We are getting close to having qsfs set up.

scottyeager · 2024-11-20T16:45:05Z

OK, i can reproduce it, i have to refresh my browser to see it.

I've fixed it at #130.

Great, thanks @iwanbk!

scottyeager mentioned this issue Jul 10, 2024

Expose zdb reachability as a Prometheus metric #119

Closed

iwanbk self-assigned this Nov 13, 2024

iwanbk assigned scottyeager and unassigned iwanbk Nov 15, 2024

iwanbk self-assigned this Nov 18, 2024

iwanbk mentioned this issue Nov 18, 2024

fix(metrics): only remove metrics when needed #130

Merged

LeeSmet closed this as completed in #130 Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Prometheus reporting when zdbs are unreachable #118

Inconsistent Prometheus reporting when zdbs are unreachable #118

scottyeager commented Jul 10, 2024 •

edited

Loading

LeeSmet commented Sep 10, 2024

scottyeager commented Oct 1, 2024

Mik-TF commented Nov 7, 2024

iwanbk commented Nov 13, 2024 •

edited

Loading

iwanbk commented Nov 13, 2024

Mik-TF commented Nov 13, 2024

scottyeager commented Nov 14, 2024

scottyeager commented Nov 16, 2024

iwanbk commented Nov 18, 2024

iwanbk commented Nov 18, 2024

Mik-TF commented Nov 18, 2024

scottyeager commented Nov 20, 2024

Inconsistent Prometheus reporting when zdbs are unreachable #118

Inconsistent Prometheus reporting when zdbs are unreachable #118

Comments

scottyeager commented Jul 10, 2024 • edited Loading

LeeSmet commented Sep 10, 2024

scottyeager commented Oct 1, 2024

Mik-TF commented Nov 7, 2024

iwanbk commented Nov 13, 2024 • edited Loading

iwanbk commented Nov 13, 2024

Mik-TF commented Nov 13, 2024

scottyeager commented Nov 14, 2024

scottyeager commented Nov 16, 2024

iwanbk commented Nov 18, 2024

iwanbk commented Nov 18, 2024

Mik-TF commented Nov 18, 2024

scottyeager commented Nov 20, 2024

scottyeager commented Jul 10, 2024 •

edited

Loading

iwanbk commented Nov 13, 2024 •

edited

Loading