Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wireguard: bridge fdb entry sometimes missing? #147

Open
lemoer opened this issue Feb 13, 2021 · 22 comments
Open

wireguard: bridge fdb entry sometimes missing? #147

lemoer opened this issue Feb 13, 2021 · 22 comments

Comments

@lemoer
Copy link
Contributor

lemoer commented Feb 13, 2021

I just found for my router, that the bridge fdb entry for 00:00:00:00:00:00 was missing when I used bridge fdb. Only 72:4c:e2:db:6f:37 dev vx-99 dst fe80::247:34ff:fef4:26cc via wg-99 self is visible.

Details:

  • Wg handshakes are established.
  • The router received nothing on the vpn interface.
  • On the router, I didn't see a batctl neighbor on vpn.
  • On sn10 I saw the node as neighbour.
  • systemctl restart wg_netlink.service didn't help.
  • Rebooting solved the issue.
    • But the entry for 00:00:00:00:00:00 is still not existing.
    • But seems to work.

We should keep an eye on this.

@lemoer lemoer changed the title bridge fdb entry sometimes missing bridge fdb entry sometimes missing? Feb 13, 2021
@AiyionPrime
Copy link
Member

You did keep in mind that some of our tools filter the 00:00:00:00:00:00 entry?

@lemoer
Copy link
Contributor Author

lemoer commented Feb 14, 2021

No, I didn't. Which tools filtering it?

@AiyionPrime
Copy link
Member

I thought we did in wg_established; but that one's implicit, as the entry does just never handshake and is therefore blocked by awk.
Will look into this again.

@CodeFetch
Copy link
Contributor

E.g. the statistics export 5d22e24

@AiyionPrime
Copy link
Member

True, but not what I had in mind.
Maybe the was a shell script before that filtered; or something in our netlink.py.

if public_key == (43 * "0" + "="):

@CodeFetch
Copy link
Contributor

@lemoer Why should there be an fdb entry for 00:00:00:00:00:00? Does it have to do anything with the dummy peer at all?

@AiyionPrime
Copy link
Member

Interesting:
for i in 01 07 08 09 10; do echo sn$i; ssh zone.ffh.s.sn$i -C bridge fdb | grep 00:00:00:00:00:00; done

sn01
00:00:00:00:00:00 dev vx-14 dst fe80::2ce:7ff:fe40:5a6c self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::2fa:fcff:feb9:c861 self permanent
00:00:00:00:00:00 dev vx-21 dst fe80::28b:f8ff:fe51:88ee self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2be:e1ff:feac:7147 self permanent
sn07
sn08
sn09
00:00:00:00:00:00 dev vx-15 dst fe80::231:b7ff:fea4:a410 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::209:95ff:fe01:9ea9 self permanent
00:00:00:00:00:00 dev vx-16 dst fe80::2a5:a4ff:feec:563a self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::213:18ff:fe6e:f314 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::252:acff:fee3:caeb self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::21b:e3ff:fe04:4409 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::256:3cff:fe07:1fce self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::292:3cff:fe3e:21d8 self permanent
sn10
00:00:00:00:00:00 dev vx-19 dst fe80::2d7:55ff:fe3e:dbbc self permanent
00:00:00:00:00:00 dev vx-20 dst fe80::255:cdff:fe56:6f7d self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::2e3:6cff:fe5d:d07c self permanent
00:00:00:00:00:00 dev vx-99 dst fe80::247:34ff:fef4:26cc self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2b6:3dff:fe32:5577 self permanent
00:00:00:00:00:00 dev vx-14 dst fe80::2ff:96ff:fe41:1b70 self permanent
00:00:00:00:00:00 dev vx-15 dst fe80::25a:5dff:fed9:de19 self permanent
00:00:00:00:00:00 dev vx-13 dst fe80::2af:aeff:fe58:3cdb self permanent

@AiyionPrime
Copy link
Member

Maybe we should add a script, that allows to reproduce this issue with less effort in order to have more eyes on it?

@lemoer lemoer changed the title bridge fdb entry sometimes missing? wireguard: bridge fdb entry sometimes missing? Feb 20, 2021
@lemoer
Copy link
Contributor Author

lemoer commented Feb 20, 2021

Maybe it's not a bridge fdb problem. I just observed on sn01, that the vx-... interfaces are not added to batman:

sn01:

[root@sn01]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
[root@sn01]:~ # 

sn09:

[root@sn09]:~ # ls -d /sys/class/net/bat* | cut -d '/' -f 5 | xargs -n 1 -I X batctl meshif X if | grep vx
vx-10: active
vx-11: active
vx-12: active
vx-13: active
vx-14: active
vx-15: active
vx-16: active
vx-17: active
vx-18: active
vx-19: active
vx-20: active
vx-21: active
vx-22: active
vx-23: active
vx-99: active

@lemoer
Copy link
Contributor Author

lemoer commented Feb 20, 2021

As quickfix I fired:

ls -d /sys/class/net/bat* | cut -d '/' -f 5 | grep -v bat0 | sed 's_bat__g' | xargs -n 1 -I XX systemctl start [email protected]

My node is now properly connected. But I am not sure whether all problems described in this issue are solved.

@AiyionPrime
Copy link
Member

I'll look into it tomorrow in the afternoonn

@lemoer lemoer added this to the Beginn der stabilen Phase milestone Feb 22, 2021
@lemoer lemoer added the bug label Feb 22, 2021
@lemoer
Copy link
Contributor Author

lemoer commented Feb 22, 2021

I added the milestone "Beginn der stabilen Phase", as this is likely to be a bug. But as this happens sporadically, I am not sure, whether we will resolve this issue before the "stabile Phase".

lemoer added a commit that referenced this issue Feb 23, 2021
On some supernodes, this did not work on system boot. The vx-* 
interfaces were added to batman but directly removed again. This was 
because the interfaces were not up. Now we set them up before the vx-* 
is added to batman.

Discussed in #147.

#147 (comment)
@lemoer
Copy link
Contributor Author

lemoer commented Feb 23, 2021

I implemented a fix for the mentioned issue in 5fc0673.

But I am not sure whether all problems described in this issue are solved.

@AiyionPrime
Copy link
Member

If what you did in 5fc0673 is indeed a fix,
we need to rewrite wait_for_iface.sh, as it's then broken, right?

@lemoer
Copy link
Contributor Author

lemoer commented Feb 28, 2021

I think, the discussed problem here is the same as #175.

@lemoer
Copy link
Contributor Author

lemoer commented Feb 28, 2021

It does not make sense to have either #175 or #147 (this issue) as blocker for the infrastructure freeze week, so I'll remove the milestone here.

@lemoer lemoer removed this from the wireguard infrastructure freeze milestone Feb 28, 2021
@lemoer
Copy link
Contributor Author

lemoer commented Apr 27, 2021

Today there appeared a similar issue, but this time only the route is missing and the fdb entry is there. Maybe it's related, maybe not...

(Originally reported by @bschelm via Mail.)


I collected some data:

WG is established:

root@NDS-PoE-Test1:~# ubus call wgpeerselector.vpn status
{
	"peers": {
		"sn07": false,
		"sn01": false,
		"sn09": false,
		"sn10": {
			"established": 12262
		},
		"sn05": false
	}
}

WG is established:

[root@sn10]:~ # ffh_wg_established.sh | grep dom14
95	dom14	/etc/wireguard/peers-wg/aiyion-JT-OR750i
1819344	dom14	/etc/wireguard/peers-wg/charon
595543	dom14	/etc/wireguard/peers-wg/nds-esperanto
11446	dom14	/etc/wireguard/peers-wg/nds-fwh-tresckowstr-technik-vorne
3	dom14	/etc/wireguard/peers-wg/nds-poe-test1
2268739	dom14	/etc/wireguard/peers-wg/nds-schwule-sau
1684077	dom14	/etc/wireguard/peers-wg/nds-the-dalek-mothership
1643047	dom14	/etc/wireguard/peers-wg/nds-the-tardis
683281	dom14	/etc/wireguard/peers-wg/wgtest-1043-lemoer

IPv6 of the router:

root@NDS-PoE-Test1:~# ip a s vpn
12: vpn: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet6 fe80::2dc:dfff:fecc:981d/128 scope link 
       valid_lft forever preferred_lft forever
root@NDS-PoE-Test1:~# ip a s vx_vpn_wired
15: vx_vpn_wired: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1330 qdisc noqueue master bat0 state UNKNOWN group default qlen 1000
    link/ether 02:29:04:5d:75:e7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::29:4ff:fe5d:75e7/64 scope link 
       valid_lft forever preferred_lft forever

But no appropriate route is installed:

[root@sn10]:~ # ip -6 route | grep -i wg-14
fe80::213:18ff:fe6e:f314 dev wg-14 proto static metric 1024 pref medium
fe80::/64 dev wg-14 proto kernel metric 256 pref medium

Bridge fdb entry is ok:

[root@sn10]:~ # bridge fdb list | grep wg-14
1e:bd:8f:52:15:d7 dev vx-14 dst fe80::213:18ff:fe6e:f314 via wg-14 self 
02:29:04:5d:75:e7 dev vx-14 dst fe80::2dc:dfff:fecc:981d via wg-14 self 

@lemoer
Copy link
Contributor Author

lemoer commented Apr 28, 2021

Even if we restart the service, the route is not created...

Some analysis is following:

Here we see that we have 91 peers per interface:

[root@sn10]:~ # wg | grep -e '^[^ ]' | cut -d ' ' -f 1 | uniq -c
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:
      1 interface:
     91 peer:

A small patch applied to netlink.py:

diff --git a/netlink.py b/netlink.py
index 31a1e76..743dfdb 100644
--- a/netlink.py
+++ b/netlink.py
@@ -97,10 +97,13 @@ class ConfigManager:
         with WireGuard() as wg:
             clients = wg.info(self.wg_interface)[0].WGDEVICE_A_PEERS.value
 
+            print(f"LEN: {len(clients)}, iface={self.wg_interface}")
             for client in clients:
                 latest_handshake = client.WGPEER_A_LAST_HANDSHAKE_TIME["tv_sec"]
                 public_key = client.WGPEER_A_PUBLIC_KEY["value"].decode("utf-8")
 
+                print(f"A: {public_key}")
+
                 peer = self.find_by_public_key(public_key)
                 if len(peer) < 1:
                     peer = WireGuardPeer(public_key)

Shows only 89 or 90 peers:

[root@sn10]:~ # /usr/bin/python3 /srv/wireguard/vxlan-glue/netlink.py -c /etc/wireguard/netlink_cfg.json | grep LEN
LEN: 90, iface=wg-10
LEN: 90, iface=wg-11
LEN: 90, iface=wg-12
LEN: 90, iface=wg-13
LEN: 89, iface=wg-14
LEN: 89, iface=wg-15
LEN: 89, iface=wg-16
LEN: 90, iface=wg-17
LEN: 89, iface=wg-18
LEN: 90, iface=wg-19
LEN: 90, iface=wg-20
LEN: 90, iface=wg-21
LEN: 90, iface=wg-22
LEN: 90, iface=wg-23
LEN: 90, iface=wg-99

(Even if 90 would only be an off by one discrepancy, this discrepancy would not be consistent, as a few interfaces also have 89 only.)

@lemoer
Copy link
Contributor Author

lemoer commented Apr 29, 2021

The recent finding of this week has now been fixed in freifunkh/wireguard-vxlan-glue@7c876de .

@1977er
Copy link
Member

1977er commented May 31, 2023

Can this be closed as of the last comment? Or is there anything we can / should regularily test (via monitoring)?

@AiyionPrime
Copy link
Member

I think it reads as if the finding of that week and not the whole issue was resolved.
But maybe it has been regardless.

@1977er
Copy link
Member

1977er commented Dec 20, 2024

Added further monitoring in Zabbix (FF Wireguard Template). Per supernode:

  • number of (wg) peers
  • number of vxlan-glue routes
  • number of fdb entries

Added a trigger if these numbers mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants