-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helper scripts are not called when the node fails the health check with Slurm #147
Comments
Did you configure slurm.conf to call NHC? We use the line: |
Yes, this was configured. I can see that
|
I've never seen the check named "check_xid_errors", I wonder where that came from? Did you define this in your nhc.conf file? |
Yes. this is from https://github.com/NVIDIA/deepops. I add this line
|
I don't know about this check. You could try to configure a "fake" check in nhc.conf on the node, like adding a check of check_hw_physmem for values that are definitely wrong. This should cause slurmd to mark the node offline next time it calls NHC. Make sure to configure all the NHC parameters in slurm.conf, for example: HealthCheckProgram=/usr/sbin/nhc The default value of HealthCheckInterval is 0 which disables NHC! BTW, which version of Slurm do you run? |
I am using I am using slurm 23.02.4 |
I tried the standard check
|
@OleHolmNielsen I think the helper script should be run by nhc rather than Slurm? Based on the log, nhc is definitely executed by Slurm. |
I can now confirm that it is a bug in nhc 1.4.3. Reinstalling with 1.4.3 again does not work, but reinstalling with 1.4.2 corrects this bug. |
I can confirm 1.4.3 doesn't run the |
add: the |
Hello, fwiw, I had the problem in 1.4.3 because the scontrol was not in the PATH and the auto-detection didn't work in nhcmain_find_rm. Setting NHC_RM in /etc/sysconfig/nhc worked for me. |
I believe we're being affected by this issue as well. Any movement on this? I'm experiencing exactly the same behavior as @szhengac, and I'm at my wit's end. |
Hi,
I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into
drain
state. If I manually call the helper script likesudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25
, the node will be put ondrain
state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!/var/log/nhc.log:
The text was updated successfully, but these errors were encountered: