Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure the slurm user exists before starting slurm(d/ctld/dbd) #73

Closed
martbhell opened this issue Jan 17, 2017 · 4 comments
Closed

Comments

@martbhell
Copy link
Contributor

martbhell commented Jan 17, 2017

When rebooting a compute node many times in sequence, quite a few times slurmd couldn't find the unix user slurm when it started slurmd.

Maybe because ypbind was not up yet "completely".
We could add extra systemd settings to the systemd scripts in:

/etc/systemd/system/slurmd.service.d/myfile.confg

Not sure which should be used..

@tiggi
Copy link
Contributor

tiggi commented Jan 17, 2017 via email

@martbhell
Copy link
Contributor Author

martbhell commented Jan 17, 2017

Having it require ypbind didn't help anyway - maybe still too aggressive parallelism. Requires=remote-fs.target did not solve it either. Maybe one can have systemd restart slurmd once or twice with some delay in between if it doesn't?

For LDAP - created fgci-org/fgci-ansible#176

@martbhell martbhell changed the title Make sure slurm(d/ctld/dbd) starts after ypbind Make sure the slurm user exists before starting slurm(d/ctld/dbd) Jan 17, 2017
@martbhell
Copy link
Contributor Author

Adding Restart=on-failure and increase the interval in /etc/systemd/system/slurmd.service.d/slurmd_extra.conf helps in some cases.

For services that have Restart= configured the defaults is to attempt to restart it five times but with 100ms interval.

From /etc/systemd/system.conf on CentOS 7.3:

#DefaultRestartSec=100ms
#DefaultStartLimitInterval=10s
#DefaultStartLimitBurst=5
[Service]
Restart=on-failure
RestartSec=20

martbhell added a commit that referenced this issue Jan 18, 2017
 - currently defaults to only do this on slurmd but allow to
   optionally enable it also for slurmdbd and slurmctld
 - #73
@martbhell martbhell changed the title Make sure the slurm user exists before starting slurm(d/ctld/dbd) Make sure the slurm user exists before starting slurm(d/ctld/dbd) Jan 18, 2017
@martbhell
Copy link
Contributor Author

As we merged in #74 then closing this.
It's not a real absolute fix to "makes sure slurm user exists before starting slurm daemons" - but it should make systemd restart the daemons in some scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants