wireguard: bridge fdb entry sometimes missing? #147

lemoer · 2021-02-13T23:56:23Z

I just found for my router, that the bridge fdb entry for 00:00:00:00:00:00 was missing when I used bridge fdb. Only 72:4c:e2:db:6f:37 dev vx-99 dst fe80::247:34ff:fef4:26cc via wg-99 self is visible.

Details:

Wg handshakes are established.
The router received nothing on the vpn interface.
On the router, I didn't see a batctl neighbor on vpn.
On sn10 I saw the node as neighbour.
systemctl restart wg_netlink.service didn't help.
Rebooting solved the issue.
- But the entry for 00:00:00:00:00:00 is still not existing.
- But seems to work.

We should keep an eye on this.

The text was updated successfully, but these errors were encountered:

AiyionPrime · 2021-02-14T10:10:27Z

You did keep in mind that some of our tools filter the 00:00:00:00:00:00 entry?

lemoer · 2021-02-14T14:25:31Z

No, I didn't. Which tools filtering it?

AiyionPrime · 2021-02-14T15:25:29Z

I thought we did in wg_established; but that one's implicit, as the entry does just never handshake and is therefore blocked by awk.
Will look into this again.

CodeFetch · 2021-02-14T15:26:06Z

E.g. the statistics export 5d22e24

AiyionPrime · 2021-02-14T15:30:08Z

True, but not what I had in mind.
Maybe the was a shell script before that filtered; or something in our netlink.py.

ansible/roles/ffh.mesh_wireguard/files/bin/ffh_wg_stats.py

Line 18 in 5d22e24

if public_key == (43 * "0" + "="):

CodeFetch · 2021-02-14T15:33:31Z

@lemoer Why should there be an fdb entry for 00:00:00:00:00:00? Does it have to do anything with the dummy peer at all?

AiyionPrime · 2021-02-14T16:01:48Z

Interesting:
for i in 01 07 08 09 10; do echo sn$i; ssh zone.ffh.s.sn$i -C bridge fdb | grep 00:00:00:00:00:00; done

sn01
00:00:00:00:00:00 dev vx-14 dst fe80::2ce:7ff:fe40:5a6c self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::2fa:fcff:feb9:c861 self permanent
00:00:00:00:00:00 dev vx-21 dst fe80::28b:f8ff:fe51:88ee self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2be:e1ff:feac:7147 self permanent
sn07
sn08
sn09
00:00:00:00:00:00 dev vx-15 dst fe80::231:b7ff:fea4:a410 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::209:95ff:fe01:9ea9 self permanent
00:00:00:00:00:00 dev vx-16 dst fe80::2a5:a4ff:feec:563a self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::213:18ff:fe6e:f314 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::252:acff:fee3:caeb self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::21b:e3ff:fe04:4409 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::256:3cff:fe07:1fce self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::292:3cff:fe3e:21d8 self permanent
sn10
00:00:00:00:00:00 dev vx-19 dst fe80::2d7:55ff:fe3e:dbbc self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::255:cdff:fe56:6f7d self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2e3:6cff:fe5d:d07c self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::247:34ff:fef4:26cc self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2b6:3dff:fe32:5577 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2ff:96ff:fe41:1b70 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::25a:5dff:fed9:de19 self permanent
00:00:00:00:00:00 dev vx-13 dst fe80::2af:aeff:fe58:3cdb self permanent

AiyionPrime · 2021-02-20T10:41:53Z

Maybe we should add a script, that allows to reproduce this issue with less effort in order to have more eyes on it?

lemoer · 2021-02-20T20:09:52Z

Maybe it's not a bridge fdb problem. I just observed on sn01, that the vx-... interfaces are not added to batman:

sn01:

[root@sn01]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
[root@sn01]:~ #

sn09:

[root@sn09]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
vx-10: active
vx-11: active
vx-12: active
vx-13: active
vx-14: active
vx-15: active
vx-16: active
vx-17: active
vx-18: active
vx-19: active
vx-20: active
vx-21: active
vx-22: active
vx-23: active
vx-99: active

lemoer · 2021-02-20T20:14:55Z

As quickfix I fired:

ls -d /sys/class/net/bat* | cut -d '/' -f 5 | grep -v bat0 | sed 's_bat__g' | xargs -n 1 -I XX systemctl start [email protected]

My node is now properly connected. But I am not sure whether all problems described in this issue are solved.

AiyionPrime · 2021-02-20T22:46:50Z

I'll look into it tomorrow in the afternoonn

lemoer · 2021-02-22T01:39:58Z

I added the milestone "Beginn der stabilen Phase", as this is likely to be a bug. But as this happens sporadically, I am not sure, whether we will resolve this issue before the "stabile Phase".

On some supernodes, this did not work on system boot. The vx-* interfaces were added to batman but directly removed again. This was because the interfaces were not up. Now we set them up before the vx-* is added to batman. Discussed in #147. #147 (comment)

lemoer · 2021-02-23T23:35:13Z

I implemented a fix for the mentioned issue in 5fc0673.

But I am not sure whether all problems described in this issue are solved.

AiyionPrime · 2021-02-24T07:45:25Z

If what you did in 5fc0673 is indeed a fix,
we need to rewrite wait_for_iface.sh, as it's then broken, right?

lemoer · 2021-02-28T22:24:50Z

I think, the discussed problem here is the same as #175.

lemoer · 2021-02-28T22:25:08Z

It does not make sense to have either #175 or #147 (this issue) as blocker for the infrastructure freeze week, so I'll remove the milestone here.

lemoer · 2021-04-27T22:47:41Z

Today there appeared a similar issue, but this time only the route is missing and the fdb entry is there. Maybe it's related, maybe not...

(Originally reported by @bschelm via Mail.)

I collected some data:

WG is established:

root@NDS-PoE-Test1:~# ubus call wgpeerselector.vpn status
{
	"peers": {
		"sn07": false,
		"sn01": false,
		"sn09": false,
		"sn10": {
			"established": 12262
		},
		"sn05": false
	}
}

WG is established:

[root@sn10]:~ # ffh_wg_established.sh | grep dom14
95	dom14	/etc/wireguard/peers-wg/aiyion-JT-OR750i
1819344	dom14	/etc/wireguard/peers-wg/charon
595543	dom14	/etc/wireguard/peers-wg/nds-esperanto
11446	dom14	/etc/wireguard/peers-wg/nds-fwh-tresckowstr-technik-vorne
3	dom14	/etc/wireguard/peers-wg/nds-poe-test1
2268739	dom14	/etc/wireguard/peers-wg/nds-schwule-sau
1684077	dom14	/etc/wireguard/peers-wg/nds-the-dalek-mothership
1643047	dom14	/etc/wireguard/peers-wg/nds-the-tardis
683281	dom14	/etc/wireguard/peers-wg/wgtest-1043-lemoer

IPv6 of the router:

root@NDS-PoE-Test1:~# ip a s vpn
12: vpn: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet6 fe80::2dc:dfff:fecc:981d/128 scope link 
       valid_lft forever preferred_lft forever
root@NDS-PoE-Test1:~# ip a s vx_vpn_wired
15: vx_vpn_wired: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1330 qdisc noqueue master bat0 state UNKNOWN group default qlen 1000
    link/ether 02:29:04:5d:75:e7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::29:4ff:fe5d:75e7/64 scope link 
       valid_lft forever preferred_lft forever

But no appropriate route is installed:

[root@sn10]:~ # ip -6 route | grep -i wg-14
fe80::213:18ff:fe6e:f314 dev wg-14 proto static metric 1024 pref medium
fe80::/64 dev wg-14 proto kernel metric 256 pref medium

Bridge fdb entry is ok:

[root@sn10]:~ # bridge fdb list | grep wg-14
1e:bd:8f:52:15:d7 dev vx-14 dst fe80::213:18ff:fe6e:f314 via wg-14 self 
02:29:04:5d:75:e7 dev vx-14 dst fe80::2dc:dfff:fecc:981d via wg-14 self

lemoer · 2021-04-28T00:50:54Z

Even if we restart the service, the route is not created...

Some analysis is following:

Here we see that we have 91 peers per interface:

[root@sn10]:~ # wg | grep -e '^[^ ]' | cut -d ' ' -f 1 | uniq -c
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:

A small patch applied to netlink.py:

diff --git a/netlink.py b/netlink.py
index 31a1e76..743dfdb 100644
--- a/netlink.py
+++ b/netlink.py
@@ -97,10 +97,13 @@ class ConfigManager:
         with WireGuard() as wg:
             clients = wg.info(self.wg_interface)[0].WGDEVICE_A_PEERS.value
 
+            print(f"LEN: {len(clients)}, iface={self.wg_interface}")
             for client in clients:
                 latest_handshake = client.WGPEER_A_LAST_HANDSHAKE_TIME["tv_sec"]
                 public_key = client.WGPEER_A_PUBLIC_KEY["value"].decode("utf-8")
 
+                print(f"A: {public_key}")
+
                 peer = self.find_by_public_key(public_key)
                 if len(peer) < 1:
                     peer = WireGuardPeer(public_key)

Shows only 89 or 90 peers:

[root@sn10]:~ # /usr/bin/python3 /srv/wireguard/vxlan-glue/netlink.py -c /etc/wireguard/netlink_cfg.json | grep LEN
LEN: 90, iface=wg-10
LEN: 90, iface=wg-11
LEN: 90, iface=wg-12
LEN: 90, iface=wg-13
LEN: 89, iface=wg-14
LEN: 89, iface=wg-15
LEN: 89, iface=wg-16
LEN: 90, iface=wg-17
LEN: 89, iface=wg-18
LEN: 90, iface=wg-19
LEN: 90, iface=wg-20
LEN: 90, iface=wg-21
LEN: 90, iface=wg-22
LEN: 90, iface=wg-23
LEN: 90, iface=wg-99

(Even if 90 would only be an off by one discrepancy, this discrepancy would not be consistent, as a few interfaces also have 89 only.)

lemoer · 2021-04-29T21:42:26Z

The recent finding of this week has now been fixed in freifunkh/wireguard-vxlan-glue@7c876de .

1977er · 2023-05-31T15:02:14Z

Can this be closed as of the last comment? Or is there anything we can / should regularily test (via monitoring)?

AiyionPrime · 2023-05-31T15:06:16Z

I think it reads as if the finding of that week and not the whole issue was resolved.
But maybe it has been regardless.

1977er · 2024-12-20T00:41:22Z

Added further monitoring in Zabbix (FF Wireguard Template). Per supernode:

number of (wg) peers
number of vxlan-glue routes
number of fdb entries

Added a trigger if these numbers mismatch.

lemoer changed the title ~~bridge fdb entry sometimes missing~~ bridge fdb entry sometimes missing? Feb 13, 2021

AiyionPrime added the question label Feb 20, 2021

lemoer changed the title ~~bridge fdb entry sometimes missing?~~ wireguard: bridge fdb entry sometimes missing? Feb 20, 2021

lemoer added this to the Beginn der stabilen Phase milestone Feb 22, 2021

lemoer added the bug label Feb 22, 2021

lemoer added the effort:huge label Feb 25, 2021

lemoer mentioned this issue Feb 28, 2021

routers apparently prefer fastd #175

Open

lemoer removed this from the wireguard infrastructure freeze milestone Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wireguard: bridge fdb entry sometimes missing? #147

wireguard: bridge fdb entry sometimes missing? #147

lemoer commented Feb 13, 2021 •

edited

Loading

AiyionPrime commented Feb 14, 2021

lemoer commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

CodeFetch commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

CodeFetch commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

AiyionPrime commented Feb 20, 2021

lemoer commented Feb 20, 2021 •

edited

Loading

lemoer commented Feb 20, 2021

AiyionPrime commented Feb 20, 2021

lemoer commented Feb 22, 2021

lemoer commented Feb 23, 2021

AiyionPrime commented Feb 24, 2021

lemoer commented Feb 28, 2021

lemoer commented Feb 28, 2021

lemoer commented Apr 27, 2021 •

edited

Loading

lemoer commented Apr 28, 2021 •

edited

Loading

lemoer commented Apr 29, 2021

1977er commented May 31, 2023

AiyionPrime commented May 31, 2023

1977er commented Dec 20, 2024 •

edited

Loading

wireguard: bridge fdb entry sometimes missing? #147

wireguard: bridge fdb entry sometimes missing? #147

Comments

lemoer commented Feb 13, 2021 • edited Loading

AiyionPrime commented Feb 14, 2021

lemoer commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

CodeFetch commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

CodeFetch commented Feb 14, 2021

AiyionPrime commented Feb 14, 2021

AiyionPrime commented Feb 20, 2021

lemoer commented Feb 20, 2021 • edited Loading

lemoer commented Feb 20, 2021

AiyionPrime commented Feb 20, 2021

lemoer commented Feb 22, 2021

lemoer commented Feb 23, 2021

AiyionPrime commented Feb 24, 2021

lemoer commented Feb 28, 2021

lemoer commented Feb 28, 2021

lemoer commented Apr 27, 2021 • edited Loading

I collected some data:

lemoer commented Apr 28, 2021 • edited Loading

lemoer commented Apr 29, 2021

1977er commented May 31, 2023

AiyionPrime commented May 31, 2023

1977er commented Dec 20, 2024 • edited Loading

lemoer commented Feb 13, 2021 •

edited

Loading

lemoer commented Feb 20, 2021 •

edited

Loading

lemoer commented Apr 27, 2021 •

edited

Loading

lemoer commented Apr 28, 2021 •

edited

Loading

1977er commented Dec 20, 2024 •

edited

Loading