-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancing iroh's Hole Punching Success Rate #2317
Comments
Thanks for the report! I love an opportunity to improve our holepunching from real-world situations.
I'm not sure I fully understand the network layout yet. Are C1 & C2 on the same local network or different ones? What works: What doesn't work: Is that correct?
This part is an unfortunate mistake in the current release. It should never try to connect to the mainline DHT and this is fixed for the next release. So this error is entirely harmless and does not affect any functionality.
Could you make the test code available? I also would really appreciate it if you could provide us with the full DEBUG-level logs of both nodes. If you prefer to share those more privately you could also email them or something else you're comfortable with. I would be great if we could work together to improve the holepunching for your situation! |
@flub Your grasp of the network topology is spot on; C1 and C2 are on separate networks, each with its own dynamic public IP. I have not conducted tests for the scenario I have attached the code and logs to this message, where I have replaced some real IPs and domain names for confidentiality. Additionally, I've included an article on NAT hole punching that I recently encountered, and I'm uncertain whether it will aid in enhancing the efficiency of hole punching. code and log.zip |
For the C1 -> P1 -> NAT -> C2 scenario, I've looked into Tailscale's direct connections, and indeed, they use UDP. My case is unique because the C2 router has UPnP enabled and a public IP. I think the simplest method might be for C2 to open a TCP port through UPnP, allowing C1 to initiate a direct TCP connection. Regarding the inability of two mobile phones, P1 and P2, to connect directly, in addition to using a server relay, openp2p-cn can the use of a client with a public IP and a router that supports automatic port mapping to act as a relay (with the client user's consent). This approach can help overcome the issues of high costs associated with central server resources and significant latency due to being far from the central server(I'm in the China, far away from your servers, haha. For self-host , using a home broadband-connected client as a relay server can also save a significant amount of costs.).However, this project is in a semi-open state; the client is continuously being updated, but the source code has not been updated further. I'm not promoting this project; I just think this approach is quite good for self-host. |
I spent 4 hours retesting because I found that I have two phones, Xiaomi and iPhone. When uses the iPhone to share Wi-Fi, iroh doctor can establish a direct connection every time. However, using the Xiaomi phone depends on luck; sometimes it works, sometimes it doesn't. During each test, I might switch networks, connect to a Tailscale for remote, and then restart server(if Tailscale is turned on, iroh will use Tailscale's direct connection, so I turn it off during testing). I'm not sure what specific operation suddenly enables a direct connection, but once a direct connection is established, it can be repeated every time (I continue to test several times after a successful direct connection). I switched phones many times and the phenomenon remained related to the phone. The test code I previously uploaded often crashed automatically. So today, I used iroh-net/examples/listen.rs and connect.rs for testing, and the results matched those of the |
There has been new progress in the testing, and the situation is somewhat complex. The test involves 2 routers, 3 computers, and 2 phones. In the previous tests, UPnP was always turned on, but I found that there seem to be some issues with the UPnP port mapping on the router I have been consistently testing with. I'm not sure if it's a router issue or a problem with The conclusion is: UPnP turned on (mapping works as expected): With UPnP turned off (both the Router and Phone have public IPv6, and both C1 and C2 have IPv6 enabled):
The retesting has once again produced different results. In the morning, with UPnP turned off and no IPv6, several attempts to establish a connection through mobile hotspot sharing failed. In the afternoon, even with UPnP still disabled and no IPv6, the results mirrored those from a few days prior; the iPhone hotspot was capable of establishing a connection, while the Xiaomi phone was not. I'm not sure what the issue is; the testing has been paused for now without any clear understanding or identifiable pattern. |
Apologies I still haven't been able to find the time to dig into your reports here. With some luck I might next week. In theory iroh should be fine with or without upnp working, but it is true that most places don't have upnp working so upnp probably still has more bugs. |
Although the test results have shown some inconsistencies, sharing them might still provide some reference (though it could also lead to confusion). If subsequent optimization of this issue requires testing, I can help.
Indeed, it should be so. |
Our approach has been consciously to not do this automatically on all clients, or integrate it otherwise into a normal client. Instead we decided to let users who have the ability and want to run a relay do this explicitly by running their own relay server. Everyone can run a relay server, and use it together with other relay servers. You need to add it to your Relay Map in the client configuration. The relay servers do not need to be aware of each other. We also publish the Maybe this creates slightly more friction to running a relay server, but it is an important component to the connectivity, and letting any client participate as a relay would not result in the desired reliability and uptime for our goals. So we feel this extra friction is worth it. |
The holepunching system we used based on DERP (and I believe also ICE) employs a coordination server to send traffic from both clients at the same time, with the help of some information gained by STUN. Thus taking away the need to guess the ports and addresses used by the NATs. It deals with symmetrics NATs reliably. We should write down how we do this ourselves sometime, but it still is getting tweaks so might be a bit early. In the meantime https://tailscale.com/blog/how-nat-traversal-works is probably the best description and includes some good overview of NATs in todays world. |
I've finally looked through your logs properly. Apologies for taking so long to get back to this. I found one bug because some weird thing in your logs - but it won't fix your issue. Otherwise the logs look fine: there are coordinated holepunching attempts. But nothing makes it through. One explanation could be that something is filtering your network. However another option could be that your network is a bit lossy. You do mention mobile phones and that it changes rather randomly, maybe even time of the day. If it starts being less reliable at busy times it might be because some packets are dropped. So this issue made me realise (again probably) our holepunching is rather vulnerable to packet loss. This is certainly something we should figure out how to improve on. |
I created #2481 to track the packet-loss problem during holepunching. |
Perhaps your guess is correct; it might indeed be a network issue on my end. I have been consistently able to successfully establish a connection when one end of the network is a mobile hotspot for the past week. |
A client functioning as a relay differs from a standard relay service; it involves more effort and may not guarantee stability (e.g., when the client is shut down). This might be best implemented by users themselves according to their specific requirements. |
To be clear, packet loss doesn't mean there's an issue with your network. It is entirely normal for networks to lose packets, especially when there's congestion. Since iroh continues to try holepunching every 5 seconds it would be interesting to see if it eventually succeeds, maybe after a long time. But even so, I don't expect this to be guaranteed to work eventually. |
In #2480, I mentioned that the server has multiple IP addresses, and I'm not entirely sure if the slow hole punching is related to address selection. My code prints the current ConnectionType after each data transmission. From the observations, for the slow hole punching scenarios, the connection type goes through |
I conducted a small experiment where I modified the following code in let typ = match (best_addr, relay_url.clone()) {
(Some(best_addr), Some(relay_url)) => ConnectionType::Mixed(best_addr, relay_url),
(Some(best_addr), None) => ConnectionType::Direct(best_addr),
(None, Some(relay_url)) => ConnectionType::Relay(relay_url),
(None, None) => ConnectionType::None,
};
let typ = match typ {
ConnectionType::Mixed(addr, relay_url) => ConnectionType::Mixed(best_addr.unwrap(), relay_url),
_ => typ,
};
if self.conn_type.update(typ).is_ok() {
let typ = self.conn_type.get();
info!(%typ, "new connection type");
}
(best_addr, relay_url) to: let typ = match (best_addr, relay_url.clone()) {
(Some(best_addr), Some(relay_url)) => ConnectionType::Mixed(best_addr, relay_url),
(Some(best_addr), None) => ConnectionType::Direct(best_addr),
(None, Some(relay_url)) => ConnectionType::Relay(relay_url),
(None, None) => ConnectionType::None,
};
let best_addr: SocketAddr = "[2408:843f:1800:880f:8367:751d:96b6:fb3e]:35298".parse().unwrap();
let best_addr = Some(best_addr);
let typ = match typ {
ConnectionType::Mixed(addr, relay_url) => ConnectionType::Mixed(best_addr.unwrap(), relay_url),
_ => typ,
};
if self.conn_type.update(typ).is_ok() {
let typ = self.conn_type.get();
info!(%typ, "new connection type");
}
(best_addr, relay_url) By forcing the specified hole punching IP, the hole punching succeeded quickly. I tested this 10 times, and without forcing the IP, only one attempt was fast. With the forced IP, all hole punching attempts were quick. I speculate that if the correct IP can be selected quickly here, the hole punching process will be faster. |
The only useful part is the two lines of code above, which provide the correct IP and port from the beginning. By the way, my two computers haven't been able to successfully punch through all day today; they are behind routers with different public IPs. |
Checking this log file (from #2480) I don't see anything wrong again and mostly suspect this is due to packet loss. It's good to know you have so much trouble with this, would be good to have an idea of how widespread this is. |
Apologies, it seems that the issue with the computers behind the two routers failing to punch through consistently yesterday was due to my relay configuration. #2490 (comment) Today, I switched to the default relay and was able to successfully punch through. However, this does not conflict with the logs I uploaded. When I uploaded the logs, I was using the correct relay. |
Hi, I have forgotten completely about tracking this issue by now. Do we need to still fix things here or can it be closed by now? |
I haven't done much testing recently, but I remember the current punching success rate is quite high. I'll close it for now, and reopen it if I notice any issues later. |
My test environment includes two computers, C1 and C2, both equipped with dynamic public IP addresses, situated behind a home router that has UPNP enabled. Additionally, there are two mobile phones, P1 and P2, using mobile networks.
When testing with Tailscale, except for the requirement for mobile phones P1 and P2 to connect through a relay, all other device-to-device connections are made directly.
When I conduct tests using iroh, direct connections are only achievable when both computers C1 and C2 are using networks with dynamic public IP addresses. If I connect either C1 or C2 to the network via a shared connection from a mobile phone, the connection must be established through a relay.
I performed the same test with Tailscale, and even if one end is connected via a shared mobile network, as long as one end is on a network with a dynamic public IP, there is only a short initial period where the connection might go through a relay before quickly switching to a direct connection.
My test code uses DnsDiscovery and connects via
endpoint.connect_by_node_id
. I noticed that the client reports an error every time it starts:ERROR mainline::rpc: Could not bootstrap the routing table
. I implemented a loop that sends messages from the client to the server and back every second, while also printing the current connection type (conn_type
). The printed information is as follows:connect type: Mixed(10.0.0.222:56394, RelayUrl("https://xxxxx.:12346/"))
, where the IP part alternates between a local network IP and a public IP.The text was updated successfully, but these errors were encountered: