Allow NHC to work with recent changes to Slurm reboot #84

hintron · 2019-04-23T21:10:02Z

Add boot node state to node online and offline script.
Properly handle scontrol reboot asap so Slurm doesn't erroneously
online the node after the first NHC call after a reboot.
Make sure node is always offlined after reboot until NHC onlines it,
regardless whether scontrol reboot or scontrol reboot asap ran.
Prevent onlining a node while there is a pending reboot.

For context, see #81 and
https://bugs.schedmd.com/show_bug.cgi?id=6391

Add `boot` node state to node online and offline script. Properly handle `scontrol reboot asap` so Slurm doesn't erroneously online the node after the first NHC call after a reboot. Make sure node is always offlined after reboot until NHC onlines it, regardless whether `scontrol reboot` or `scontrol reboot asap` ran. Prevent onlining a node while there is a pending reboot. For context, see mej#81 and https://bugs.schedmd.com/show_bug.cgi?id=6391

hintron · 2019-04-23T21:12:34Z

This was a rework of pull request #83. The main differences from #83 are:

prevent NHC from onlining nodes that are pending reboot
make sure that a vanilla scontrol reboot (i.e. no asap modifier) still keeps things in drain after reboot until NHC sees fit to online the node.

martijnkruiten · 2020-05-22T14:13:59Z

I've pushed our internal version to my fork: https://github.com/mej/nhc/pull/65/files. It's a completely different helper, and it's activated by setting export OFFLINE_NODE="$HELPERDIR/node-mark-reboot" above the tests that should trigger a reboot. You might find something useful there. I will check your version as well.

mej · 2020-08-17T02:30:17Z

So I see two separate and distinct steps to address the Slurm reboot functionality handling being missing from NHC:

Add handling to ensure that the boot state is recognized and handled intelligently in terms of onlining/offlining the nodes, but without any "integration," per se.
Provide additional functionality that integrates with Slurm's scontrol reboot feature so that all pre-, post-, and during-reboot states are not only dealt with correctly but can also facilitate unique capabilities that don't yet exist.

Since the current contents of the nhc/dev branch have been thoroughly tested and stable, I don't plan to make very many changes to that code prior to release. So if there's a way to address item 1 (make NHC behave sanely with respect to the boot state(s)) in the short term for the 1.4.3 release now, given that we would then address item 2 for the subsequent 1.4.4 release, that would be ideal.

I will freely admit that, at this point, I have not ever used the scontrol reboot feature, so I'm open to input from both of you (@hintron and @martijnkruiten) regarding how each of these is best handled. At the moment, I'm leaning toward merging in @martijnkruiten's #65 changes for 1.4.3 and then possibly taking an approach closer to the changes here in #84 for the 1.4.4 release. I think the separate node-mark-reboot script, along with the minimal changes to existing code, is the best bet for getting 1.4.3 out the door in the safest way possible.

Your thoughts?

hintron · 2020-08-17T16:41:34Z

@mej I am fine with the timeline you proposed. This patch was just to get things working with minimal changes to NHC. It's possible there is a better way to architect it.

treydock · 2021-07-28T19:04:11Z

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

Example:

scontrol reboot ASAP nextstate=resume reason='KEY_INFO - reboot to test NHC changes' p0002

We use KEY_INFO as a way to tell our monitoring to ignore this drained node since we generate alerts in Prometheus when nodes are drained for reasons other than KEY_INFO or NHC:.

I also noticed that if I set nextstate=RESUME when I do scontrol reboot ASAP the node still comes up unhealthy without NHC marking the node offline , I am still testing what exactly is happening but in my test environment I have a systemd unit file override to run NHC as a ExecStartPost so that NHC runs right after slurmd starts. It seems this patch assumes nextstate=DRAIN which might be fine given NHC will then be able to resume the node.

# /etc/systemd/system/slurmd.service.d/nhc.conf
# File managed by Puppet
[Service]
ExecStartPost=-/usr/sbin/nhc

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

hintron · 2021-07-28T19:43:16Z

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

Good to hear!

This patch was originally written to target 18.08 (I believe), and we've been continually tweaking the Slurm boot logic since, so I am pleasantly surprised that it is still useful. I'm guessing that it needs to be updated a bit to solve the problem you are mentioning (though it's possible it was always flawed).

Unfortunately, I don't have the go-ahead to work on this further, so I'll have to leave further development to someone else. Note that 21.08 is also making some changes in node states and how they are printed out, which may affect this as well.

treydock · 2021-07-29T17:39:50Z

helpers/node-mark-offline

+                    # because $STATUS would show MIX@ or ALLOC@, not BOOT.
+                    # See src/common/slurm_protocol_defs.c-->node_state_string()
+                    SHOW_NODE_OUTPUT="$($SLURM_SCONTROL show node $HOSTNAME)"
+                    if [[ $SHOW_NODE_OUTPUT == *"State=REBOOT "* ]]; then


I had to change this to "State=REBOOT" without the trailing space. While a node was booting this is the state I saw: State=REBOOT*+DRAIN.

treydock · 2021-07-29T17:42:58Z

Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this:

scontrol reboot ASAP nextstate=down <nodes>

The default of nextstate=resume prevents this extra reboot logic from ever getting used. Also as noted previously, can't pass a reason to the reboot command, have to let the default reason be used.

treydock · 2021-08-24T20:22:44Z

In case anyone else comes across this, another thing we had to change:

--- node-mark-offline.20210824  2021-07-30 10:45:41.518514000 -0400
+++ node-mark-offline   2021-08-24 16:17:17.380167000 -0400
@@ -73,6 +73,10 @@
                         echo "$0:  Not offlining $HOSTNAME:  Already offline with no note set."
                         exit 0
                     fi
+                    if [[ "$OLD_NOTE_LEADER" == "Reboot" && "$OLD_NOTE" == "ASAP" ]]; then
+                        echo "$0:  Not offlining $HOSTNAME:  Pending reboot."
+                        exit 0
+                    fi
                     ;;
                 boot*)
                     # Offline node after reboot if vanilla `scontrol reboot` was

This avoids NHC clearing the reboot state of a node while it's pending reboot if that node has issues while waiting for the reboot. I plan to open new pull request based off this one once we've had a bit more time to thoroughly test things out.

hintron mentioned this pull request Apr 23, 2019

Handle Changes to Slurm's REBOOT workflow #81

Closed

mej self-assigned this Aug 17, 2020

mej added bug enhancement labels Aug 17, 2020

mej added this to the 1.4.3 Release milestone Aug 17, 2020

mej modified the milestones: 1.4.3 Release, 1.5 Release Apr 18, 2021

mej mentioned this pull request Apr 18, 2021

Ignore node status booted by Slurm #36

Closed

treydock reviewed Jul 29, 2021

View reviewed changes

martijnkruiten mentioned this pull request Mar 22, 2022

WIP: Node mark reboot helper #65

Open

treydock mentioned this pull request Apr 13, 2022

Numerous improvements to Slurm reboot handling and planned node state #115

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow NHC to work with recent changes to Slurm reboot #84

Allow NHC to work with recent changes to Slurm reboot #84

hintron commented Apr 23, 2019

hintron commented Apr 23, 2019 •

edited

Loading

martijnkruiten commented May 22, 2020

mej commented Aug 17, 2020

hintron commented Aug 17, 2020

treydock commented Jul 28, 2021

hintron commented Jul 28, 2021 •

edited

Loading

treydock Jul 29, 2021

treydock commented Jul 29, 2021

treydock commented Aug 24, 2021

Allow NHC to work with recent changes to Slurm reboot #84

Are you sure you want to change the base?

Allow NHC to work with recent changes to Slurm reboot #84

Conversation

hintron commented Apr 23, 2019

hintron commented Apr 23, 2019 • edited Loading

martijnkruiten commented May 22, 2020

mej commented Aug 17, 2020

hintron commented Aug 17, 2020

treydock commented Jul 28, 2021

hintron commented Jul 28, 2021 • edited Loading

treydock Jul 29, 2021

Choose a reason for hiding this comment

treydock commented Jul 29, 2021

treydock commented Aug 24, 2021

hintron commented Apr 23, 2019 •

edited

Loading

hintron commented Jul 28, 2021 •

edited

Loading