Improve DNS #129

jabl · 2016-06-10T09:40:48Z

Occasionally our users are hitting slurm problems like

sbatch: error: Unable to resolve "slurmctld-host.example.org": Unknown host
sbatch: error: Unable to establish control machine address
sbatch: error: Batch job submission failed: No error

We're not 100% sure why this happens, my best guess at the moment is something like the DNS server for the cluster internal net (dnsmasq on the install host) doesn't answer fast enough, and then the next entry in /etc/resolv.conf is tried, which is an external DNS server which doesn't know anything about the cluster internal net, and thus we get the failure.

One thing that might make us especially susceptible to this is that as we use sssd we have disabled nscd, which normally does cache dns lookups.

I think a better DNS setup would be something like

We should run a backup DNS server for the internal net, in case the install node is down or doesn't answer fast enough. I'd guess the admin node could be a good choice for this, except that at least on our system /etc/hosts on the admin-node also contains the IPMI addresses, so we'd need another file for dnsmasq to read the hosts from.
The DNS servers should be configured to recurse to the external DNS servers for any records they are not authoritative for.
All other nodes, which aren't DNS servers, should run a local DNS caching resolver. dnsmasq or unbound seem to be the best choices here, consensus on the Internet (TM) seems to be that nscd dns caching is crap and should not be used. So on these nodes /etc/resolv.conf should only contain 127.0.0.1 as the only nameserver.
The local DNS cache's should recurse to the authoritative DNS servers for the internal net, and never directly to the outside DNS servers.

A1ve5 · 2016-06-10T12:01:18Z

yes, DNS is a weak point since the begining. I'll have to study this a bit more regarding implications but what you're suggesting makes sense. Let's see if we can get this properly done and tested before September.

jabl · 2016-08-31T12:00:18Z

I started an effort to setup a caching stub resolver on all the nodes at https://github.com/jabl/ansible-role-systemd-resolved (using systemd-resolved). Unfortunately it turns out that systemd-resolved v219 in EL 7.2 doesn't resolve short hostnames correctly. However, as far as I've been able to determine, newer versions should at least have some improvements here, and I guess EL 7.3 will rebase systemd to a newer version, so if nothing else one could wait a few more months until EL 7.3 is out (beta was recently released) and check again.

martbhell · 2016-09-21T06:20:54Z

Added for review - are these problems gone now?

jabl · 2016-09-21T06:36:33Z

So far we haven't got any complaints, so I suppose the /etc/hosts thing fixed the cluster internal name resolution woes. That being said, to robustly resolve external names something like the original suggestion above is probably still needed. Of course, that's not nearly as critical as, say, a job failing to look up the slurm controller.

A1ve5 added the ready label Jun 10, 2016

martbhell added the enhancement label Jun 27, 2016

jabl mentioned this issue Sep 6, 2016

Add hosts-int role #151

Merged

martbhell added review and removed ready labels Sep 21, 2016

martbhell added ready and removed review labels Sep 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DNS #129

Improve DNS #129

jabl commented Jun 10, 2016 •

edited by A1ve5

Loading

A1ve5 commented Jun 10, 2016

jabl commented Aug 31, 2016 •

edited

Loading

martbhell commented Sep 21, 2016

jabl commented Sep 21, 2016

Improve DNS #129

Improve DNS #129

Comments

jabl commented Jun 10, 2016 • edited by A1ve5 Loading

A1ve5 commented Jun 10, 2016

jabl commented Aug 31, 2016 • edited Loading

martbhell commented Sep 21, 2016

jabl commented Sep 21, 2016

jabl commented Jun 10, 2016 •

edited by A1ve5

Loading

jabl commented Aug 31, 2016 •

edited

Loading