-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow NHC to work with recent changes to Slurm reboot #84
base: dev
Are you sure you want to change the base?
Conversation
Add `boot` node state to node online and offline script. Properly handle `scontrol reboot asap` so Slurm doesn't erroneously online the node after the first NHC call after a reboot. Make sure node is always offlined after reboot until NHC onlines it, regardless whether `scontrol reboot` or `scontrol reboot asap` ran. Prevent onlining a node while there is a pending reboot. For context, see mej#81 and https://bugs.schedmd.com/show_bug.cgi?id=6391
I've pushed our internal version to my fork: https://github.com/mej/nhc/pull/65/files. It's a completely different helper, and it's activated by setting |
So I see two separate and distinct steps to address the Slurm
Since the current contents of the nhc/dev branch have been thoroughly tested and stable, I don't plan to make very many changes to that code prior to release. So if there's a way to address item 1 (make NHC behave sanely with respect to the I will freely admit that, at this point, I have not ever used the Your thoughts? |
@mej I am fine with the timeline you proposed. This patch was just to get things working with minimal changes to NHC. It's possible there is a better way to architect it. |
@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom Example:
We use I also noticed that if I set
I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted. |
Good to hear! This patch was originally written to target 18.08 (I believe), and we've been continually tweaking the Slurm boot logic since, so I am pleasantly surprised that it is still useful. I'm guessing that it needs to be updated a bit to solve the problem you are mentioning (though it's possible it was always flawed). Unfortunately, I don't have the go-ahead to work on this further, so I'll have to leave further development to someone else. Note that 21.08 is also making some changes in node states and how they are printed out, which may affect this as well. |
# because $STATUS would show MIX@ or ALLOC@, not BOOT. | ||
# See src/common/slurm_protocol_defs.c-->node_state_string() | ||
SHOW_NODE_OUTPUT="$($SLURM_SCONTROL show node $HOSTNAME)" | ||
if [[ $SHOW_NODE_OUTPUT == *"State=REBOOT "* ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to change this to "State=REBOOT"
without the trailing space. While a node was booting this is the state I saw: State=REBOOT*+DRAIN
.
Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this:
The default of |
In case anyone else comes across this, another thing we had to change: --- node-mark-offline.20210824 2021-07-30 10:45:41.518514000 -0400
+++ node-mark-offline 2021-08-24 16:17:17.380167000 -0400
@@ -73,6 +73,10 @@
echo "$0: Not offlining $HOSTNAME: Already offline with no note set."
exit 0
fi
+ if [[ "$OLD_NOTE_LEADER" == "Reboot" && "$OLD_NOTE" == "ASAP" ]]; then
+ echo "$0: Not offlining $HOSTNAME: Pending reboot."
+ exit 0
+ fi
;;
boot*)
# Offline node after reboot if vanilla `scontrol reboot` was This avoids NHC clearing the reboot state of a node while it's pending reboot if that node has issues while waiting for the reboot. I plan to open new pull request based off this one once we've had a bit more time to thoroughly test things out. |
Add
boot
node state to node online and offline script.Properly handle
scontrol reboot asap
so Slurm doesn't erroneouslyonline the node after the first NHC call after a reboot.
Make sure node is always offlined after reboot until NHC onlines it,
regardless whether
scontrol reboot
orscontrol reboot asap
ran.Prevent onlining a node while there is a pending reboot.
For context, see #81 and
https://bugs.schedmd.com/show_bug.cgi?id=6391