Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

routers apparently prefer fastd #175

Open
AiyionPrime opened this issue Feb 25, 2021 · 14 comments
Open

routers apparently prefer fastd #175

AiyionPrime opened this issue Feb 25, 2021 · 14 comments

Comments

@AiyionPrime
Copy link
Member

@CodeFetch and @bschelm observed, routers tend to like connection via fastd, rather then wireguard.

@CodeFetch further found this to be connected to packetloss in wireguard.

We need statistics to back these theses up.

@bschelm
Copy link

bschelm commented Feb 25, 2021

A router that has a WG-connection and several wifi mesh partners seemed to have lost the connection to WG, although in the status page of the router, it shows still connected to the WG supernode. However, that router did not or could not use that WG-connection but instead routed via wifi mesh.

What I tried is, disable wifi for 5 minutes via "wifi down ; sleep 300 ; wifi" in order to force the router to user the WG-connection instead of the wifi mesh way. Didn't work. Router was offline for 5 minutes.

What helped, was a restart of WG with "ifdown vpn ; sleep 5 ; ifup vpn"

@lemoer
Copy link
Contributor

lemoer commented Feb 26, 2021 via email

@bschelm
Copy link

bschelm commented Feb 26, 2021

I would have to wait for another occasion.
It happened twice already.
I can't tell when it happened because the router, in that case, is still online via mesh.
You see it only when you click on the router.
After restarting WG, it connected to a different SN.

@lemoer
Copy link
Contributor

lemoer commented Feb 26, 2021 via email

@bschelm
Copy link

bschelm commented Feb 26, 2021

Nope.
VPN-Neighbours is always zero.
Same on my router.

@lemoer
Copy link
Contributor

lemoer commented Feb 27, 2021

Screenshot from 2021-02-27 14-14-54

@bschelm I added another graph to the dashboard. It's quite messy, so I selected some traces and posted a screenshot above. The selected traces contain rx TQ from and tx TQ to the supernodes. Are your outages correlated to the gaps in the graph?

@lemoer
Copy link
Contributor

lemoer commented Feb 27, 2021

Well, the time range is kinda long. Here is a more detailed screenshot of the recent history:

Screenshot from 2021-02-27 14-24-31

@lemoer
Copy link
Contributor

lemoer commented Feb 28, 2021

From all what I have heard, this doesn't happen very often. So let's start with our Infrastructure Freeze Week, and see whether it will occur again in that week. If it happens again, please do not "fix" it directly, but collect as many data as possible:

  • output of batctl n from the router
  • output of batctl meshif bat14 n from the connected supernode
  • output of wg show from the router
  • output of wg show from the connected supernode
  • screenshot of the status page of the router
  • ip -6 route from the router
  • ip -6 route from the supernode
  • 20 seconds of tcpdump -n -i vpn inbound -w /tmp/test1.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i vpn outbound -w /tmp/test2.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i vx_vpn_wired inbound -w /tmp/test3.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i vx_vpn_wired outbound -w /tmp/test4.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i br-wan inbound -w /tmp/test5.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i br-wan outbound -w /tmp/test6.pcap from the router (collect it via scp)
  • 20 seconds of tcpdump -n -i vx-14 inbound -w /root/test7.pcap from the supernode (collect it via scp)
  • 20 seconds of tcpdump -n -i vx-14 outbound -w /root/test8.pcap from the supernode (collect it via scp)
  • 20 seconds of tcpdump -n -i wg-14 inbound -w /root/test9.pcap from the supernode (collect it via scp)
  • 20 seconds of tcpdump -n -i wg-14 outbound -w /root/test10.pcap from the supernode (collect it via scp)
  • output of bridge fdb show | grep vx from the connected supernode
  • output of logread from the router
  • output of uci export from the router
  • output of ip addr show from the router
  • Find the exact time, when the problem has started.

Hopefully this data will be enough to find the issue.

@lemoer
Copy link
Contributor

lemoer commented Feb 28, 2021

I think, this is the same issue as #147 .

@lemoer
Copy link
Contributor

lemoer commented Feb 28, 2021

It does not make sense to have either #175 (this issue) or #147 as blocker for the infrastructure freeze week, so I'll remove the milestone here.

@lemoer lemoer removed this from the wireguard infrastructure freeze milestone Feb 28, 2021
@AiyionPrime
Copy link
Member Author

I think, this is the same issue as #147 .

I don't remember exactly why, but we came to the conclusion it wasn't;
maybe @1977er remembers this better,
but I think it was due to some fixes applied on sn09, which did not correlate to resolving this issue.

@lemoer
Copy link
Contributor

lemoer commented Apr 16, 2023

Is this still an issue?

@AiyionPrime
Copy link
Member Author

We still have both WireGuard and fastd nodes and have not yet resolved the issue.

@lemoer
Copy link
Contributor

lemoer commented Apr 17, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants