Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow NHC to work with recent changes to Slurm reboot #84

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

hintron
Copy link

@hintron hintron commented Apr 23, 2019

Add boot node state to node online and offline script.
Properly handle scontrol reboot asap so Slurm doesn't erroneously
online the node after the first NHC call after a reboot.
Make sure node is always offlined after reboot until NHC onlines it,
regardless whether scontrol reboot or scontrol reboot asap ran.
Prevent onlining a node while there is a pending reboot.

For context, see #81 and
https://bugs.schedmd.com/show_bug.cgi?id=6391

Add `boot` node state to node online and offline script.
Properly handle `scontrol reboot asap` so Slurm doesn't erroneously
online the node after the first NHC call after a reboot.
Make sure node is always offlined after reboot until NHC onlines it,
regardless whether `scontrol reboot` or `scontrol reboot asap` ran.
Prevent onlining a node while there is a pending reboot.

For context, see mej#81 and
https://bugs.schedmd.com/show_bug.cgi?id=6391
@hintron
Copy link
Author

hintron commented Apr 23, 2019

This was a rework of pull request #83. The main differences from #83 are:

  1. prevent NHC from onlining nodes that are pending reboot
  2. make sure that a vanilla scontrol reboot (i.e. no asap modifier) still keeps things in drain after reboot until NHC sees fit to online the node.

@martijnkruiten
Copy link
Contributor

I've pushed our internal version to my fork: https://github.com/mej/nhc/pull/65/files. It's a completely different helper, and it's activated by setting export OFFLINE_NODE="$HELPERDIR/node-mark-reboot" above the tests that should trigger a reboot. You might find something useful there. I will check your version as well.

@mej mej self-assigned this Aug 17, 2020
@mej mej added this to the 1.4.3 Release milestone Aug 17, 2020
@mej
Copy link
Owner

mej commented Aug 17, 2020

So I see two separate and distinct steps to address the Slurm reboot functionality handling being missing from NHC:

  1. Add handling to ensure that the boot state is recognized and handled intelligently in terms of onlining/offlining the nodes, but without any "integration," per se.
  2. Provide additional functionality that integrates with Slurm's scontrol reboot feature so that all pre-, post-, and during-reboot states are not only dealt with correctly but can also facilitate unique capabilities that don't yet exist.

Since the current contents of the nhc/dev branch have been thoroughly tested and stable, I don't plan to make very many changes to that code prior to release. So if there's a way to address item 1 (make NHC behave sanely with respect to the boot state(s)) in the short term for the 1.4.3 release now, given that we would then address item 2 for the subsequent 1.4.4 release, that would be ideal.

I will freely admit that, at this point, I have not ever used the scontrol reboot feature, so I'm open to input from both of you (@hintron and @martijnkruiten) regarding how each of these is best handled. At the moment, I'm leaning toward merging in @martijnkruiten's #65 changes for 1.4.3 and then possibly taking an approach closer to the changes here in #84 for the 1.4.4 release. I think the separate node-mark-reboot script, along with the minimal changes to existing code, is the best bet for getting 1.4.3 out the door in the safest way possible.

Your thoughts?

@hintron
Copy link
Author

hintron commented Aug 17, 2020

@mej I am fine with the timeline you proposed. This patch was just to get things working with minimal changes to NHC. It's possible there is a better way to architect it.

@mej mej modified the milestones: 1.4.3 Release, 1.5 Release Apr 18, 2021
@treydock
Copy link
Contributor

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

Example:

scontrol reboot ASAP nextstate=resume reason='KEY_INFO - reboot to test NHC changes' p0002

We use KEY_INFO as a way to tell our monitoring to ignore this drained node since we generate alerts in Prometheus when nodes are drained for reasons other than KEY_INFO or NHC:.

I also noticed that if I set nextstate=RESUME when I do scontrol reboot ASAP the node still comes up unhealthy without NHC marking the node offline , I am still testing what exactly is happening but in my test environment I have a systemd unit file override to run NHC as a ExecStartPost so that NHC runs right after slurmd starts. It seems this patch assumes nextstate=DRAIN which might be fine given NHC will then be able to resume the node.

# /etc/systemd/system/slurmd.service.d/nhc.conf
# File managed by Puppet
[Service]
ExecStartPost=-/usr/sbin/nhc

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

@hintron
Copy link
Author

hintron commented Jul 28, 2021

@hintron I was testing this patch on our NHC deployment and noticed some of the logic is ignored if I give a custom reason= to scontrol during the reboot. We tend to do this so we can have a meaningful reason why the reboot is taking place rather than relying on the default Reboot ASAP message getting set by SLURM.

I am on SLURM 20.11.7, and still testing this patch, but so far is a very welcome feature to NHC as we use SLURM for rolling reboots and are having issues with jobs starting after reboot before GPFS is mounted.

Good to hear!

This patch was originally written to target 18.08 (I believe), and we've been continually tweaking the Slurm boot logic since, so I am pleasantly surprised that it is still useful. I'm guessing that it needs to be updated a bit to solve the problem you are mentioning (though it's possible it was always flawed).

Unfortunately, I don't have the go-ahead to work on this further, so I'll have to leave further development to someone else. Note that 21.08 is also making some changes in node states and how they are printed out, which may affect this as well.

# because $STATUS would show MIX@ or ALLOC@, not BOOT.
# See src/common/slurm_protocol_defs.c-->node_state_string()
SHOW_NODE_OUTPUT="$($SLURM_SCONTROL show node $HOSTNAME)"
if [[ $SHOW_NODE_OUTPUT == *"State=REBOOT "* ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change this to "State=REBOOT" without the trailing space. While a node was booting this is the state I saw: State=REBOOT*+DRAIN.

@treydock
Copy link
Contributor

Aside from my minor tweak, this works with SLURM 20.11.7 by doing something like this:

scontrol reboot ASAP nextstate=down <nodes>

The default of nextstate=resume prevents this extra reboot logic from ever getting used. Also as noted previously, can't pass a reason to the reboot command, have to let the default reason be used.

@treydock
Copy link
Contributor

In case anyone else comes across this, another thing we had to change:

--- node-mark-offline.20210824  2021-07-30 10:45:41.518514000 -0400
+++ node-mark-offline   2021-08-24 16:17:17.380167000 -0400
@@ -73,6 +73,10 @@
                         echo "$0:  Not offlining $HOSTNAME:  Already offline with no note set."
                         exit 0
                     fi
+                    if [[ "$OLD_NOTE_LEADER" == "Reboot" && "$OLD_NOTE" == "ASAP" ]]; then
+                        echo "$0:  Not offlining $HOSTNAME:  Pending reboot."
+                        exit 0
+                    fi
                     ;;
                 boot*)
                     # Offline node after reboot if vanilla `scontrol reboot` was

This avoids NHC clearing the reboot state of a node while it's pending reboot if that node has issues while waiting for the reboot. I plan to open new pull request based off this one once we've had a bit more time to thoroughly test things out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants