-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new check: check_reboot_slurm #6
base: dev
Are you sure you want to change the base?
Conversation
- set a slurm node to drained and reason=reboot and nhc will: - reboot the node when it is drained - set it to idle when it's back online
Sounds useful, but unless I'm missing something, which is very possible, you can do pretty much the same thing natively in Slurm, with |
Nice @kcgthb I didn't know about that. One benefit of letting NHC do this is that the node must pass all the health checks before it's put online. Using NHC also adds a bit of delay while waiting for nhc to run. |
As I mentioned yesterday during my talk at the HPCAC Stanford Conference, I have had an item for "rolling reboots" on my "TODO list" for some time now, having first discussed it with someone from Compute Canada at MoabCon back in 2012 or so. I don't want to limit it to use with SLURM, so clearly it follows that the name would have to change, but I definitely want to implement something like this. I'll provide some additional feedback once I have a chance to really dig into your patch! Thanks as always for the submission! :-) |
Due to the delay caused by my changing jobs, I'm having to bump the merging of new checks to the 1.4.4 release to ensure that they have ample baking time without delaying the 1.4.3 release any further. Hope you understand! This is still very much in plan for merging! |
Check chrony
replace check on /var/spool with /local for Slurm 23 with job_container/tmpfs
Hi!
Another custom script we've been using for a while and it's been working quite nicely.
Tested with:
What it does: