Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DNS #129

Open
jabl opened this issue Jun 10, 2016 · 4 comments
Open

Improve DNS #129

jabl opened this issue Jun 10, 2016 · 4 comments

Comments

@jabl
Copy link
Contributor

jabl commented Jun 10, 2016

Occasionally our users are hitting slurm problems like

sbatch: error: Unable to resolve "slurmctld-host.example.org": Unknown host
sbatch: error: Unable to establish control machine address
sbatch: error: Batch job submission failed: No error

We're not 100% sure why this happens, my best guess at the moment is something like the DNS server for the cluster internal net (dnsmasq on the install host) doesn't answer fast enough, and then the next entry in /etc/resolv.conf is tried, which is an external DNS server which doesn't know anything about the cluster internal net, and thus we get the failure.

One thing that might make us especially susceptible to this is that as we use sssd we have disabled nscd, which normally does cache dns lookups.

I think a better DNS setup would be something like

  • We should run a backup DNS server for the internal net, in case the install node is down or doesn't answer fast enough. I'd guess the admin node could be a good choice for this, except that at least on our system /etc/hosts on the admin-node also contains the IPMI addresses, so we'd need another file for dnsmasq to read the hosts from.
  • The DNS servers should be configured to recurse to the external DNS servers for any records they are not authoritative for.
  • All other nodes, which aren't DNS servers, should run a local DNS caching resolver. dnsmasq or unbound seem to be the best choices here, consensus on the Internet (TM) seems to be that nscd dns caching is crap and should not be used. So on these nodes /etc/resolv.conf should only contain 127.0.0.1 as the only nameserver.
  • The local DNS cache's should recurse to the authoritative DNS servers for the internal net, and never directly to the outside DNS servers.
@A1ve5
Copy link
Contributor

A1ve5 commented Jun 10, 2016

yes, DNS is a weak point since the begining. I'll have to study this a bit more regarding implications but what you're suggesting makes sense. Let's see if we can get this properly done and tested before September.

@jabl
Copy link
Contributor Author

jabl commented Aug 31, 2016

I started an effort to setup a caching stub resolver on all the nodes at https://github.com/jabl/ansible-role-systemd-resolved (using systemd-resolved). Unfortunately it turns out that systemd-resolved v219 in EL 7.2 doesn't resolve short hostnames correctly. However, as far as I've been able to determine, newer versions should at least have some improvements here, and I guess EL 7.3 will rebase systemd to a newer version, so if nothing else one could wait a few more months until EL 7.3 is out (beta was recently released) and check again.

@jabl jabl mentioned this issue Sep 6, 2016
@martbhell martbhell added review and removed ready labels Sep 21, 2016
@martbhell
Copy link
Contributor

Added for review - are these problems gone now?

@jabl
Copy link
Contributor Author

jabl commented Sep 21, 2016

So far we haven't got any complaints, so I suppose the /etc/hosts thing fixed the cluster internal name resolution woes. That being said, to robustly resolve external names something like the original suggestion above is probably still needed. Of course, that's not nearly as critical as, say, a job failing to look up the slurm controller.

@martbhell martbhell added ready and removed review labels Sep 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants