Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DNS lookup behaviour #2277

Closed
flub opened this issue May 10, 2024 · 1 comment · Fixed by #2313
Closed

Improve DNS lookup behaviour #2277

flub opened this issue May 10, 2024 · 1 comment · Fixed by #2313
Assignees
Labels
Milestone

Comments

@flub
Copy link
Contributor

flub commented May 10, 2024

IIUC the hickory DNS configuragion we use ends up with doing DNS requests using UDP with a 5 seconds timeout with two attempts, i.e. one retry. These retries are sequential, so the full timeout is 10 seconds.

If the first DNS request is lost, this is essentially fatal for us, netcheck as a whole has a timeout of 5s so will not be able to resolve a relay URL. The relay client has no specific timeout I think, so might be more successful in connecting (e.g. when connecting to a relay extracted from a NodeAddr).

Given the unreliability of UDP we should probably adopt a similar strategy as how netcheck does it's probes for DNS: perform multiple requests in parallel, but stagger their start time. The aim is that most requests never send the second query, but should things be slow or a request is lost there are backup requests happening faster than after 5 seconds.

Some rather ad-hoc testing gave lookup times in the range of 50ms - 200ms on various public DNS servers. Which leads me to suggest the following strategy:

  • 1st request at T+0ms
  • 2nd request at T+200ms
  • 3rd request at T+300ms

And each request with a 3s timeout.

This may be a solution to #2086 as well.

@divagant-martian
Copy link
Contributor

Just a heads up: As far as I can tell hickory does not allow us to specify individual name servers per query. This means that the most direct way to stagger concurrent requests to a same server is to stagger calls to hickory's lookup (or similar) functions.

What does this mean?
This means that if we have multiple servers, after shuffling, the staggered calls won't necessarily mean that the udp queries itself will follow the staggering strategy to the letter. I looked into hickory-proto to use a more direct approach but not only would this be a far bigger effort, we would lose certain aspects hickory already handles for us like backed off retries, caching, etc.

If going with the imperfect, least complex solution seems agreeable I think it should be simple to get done quickly

@divagant-martian divagant-martian linked a pull request May 20, 2024 that will close this issue
4 tasks
@ramfox ramfox added this to the v0.17.0 milestone May 22, 2024
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in iroh May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment