-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DNS #129
Comments
yes, DNS is a weak point since the begining. I'll have to study this a bit more regarding implications but what you're suggesting makes sense. Let's see if we can get this properly done and tested before September. |
I started an effort to setup a caching stub resolver on all the nodes at https://github.com/jabl/ansible-role-systemd-resolved (using systemd-resolved). Unfortunately it turns out that systemd-resolved v219 in EL 7.2 doesn't resolve short hostnames correctly. However, as far as I've been able to determine, newer versions should at least have some improvements here, and I guess EL 7.3 will rebase systemd to a newer version, so if nothing else one could wait a few more months until EL 7.3 is out (beta was recently released) and check again. |
Added for review - are these problems gone now? |
So far we haven't got any complaints, so I suppose the /etc/hosts thing fixed the cluster internal name resolution woes. That being said, to robustly resolve external names something like the original suggestion above is probably still needed. Of course, that's not nearly as critical as, say, a job failing to look up the slurm controller. |
Occasionally our users are hitting slurm problems like
sbatch: error: Unable to resolve "slurmctld-host.example.org": Unknown host
sbatch: error: Unable to establish control machine address
sbatch: error: Batch job submission failed: No error
We're not 100% sure why this happens, my best guess at the moment is something like the DNS server for the cluster internal net (dnsmasq on the install host) doesn't answer fast enough, and then the next entry in /etc/resolv.conf is tried, which is an external DNS server which doesn't know anything about the cluster internal net, and thus we get the failure.
One thing that might make us especially susceptible to this is that as we use sssd we have disabled nscd, which normally does cache dns lookups.
I think a better DNS setup would be something like
The text was updated successfully, but these errors were encountered: