-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EDAC hardware check (ECC memory errors) #23
base: master
Are you sure you want to change the base?
Conversation
…es by re-execing NHC after each loop. Some problems have been reported when NHC is used as a Grid Engine load sensor, including memory leaks, due to the long-running NHC process. All function variables reported as being leaked are local to their respective functions, but for unknown reasons (BASH bug?), they're being leaked anyway. In an attempt to work around these issues, this commit re-execs NHC after the completion of each iteration. There are some side effects to this, including resetting of non-exported environment variables, so this is experimental for now.
new check: check_nvsmi_healthmon which uses nvidia-smi
… does or does not match a specific node range
…after the last check, such as if pbsnodes hangs and causes nhc to timeout
Added more debug information for if a check matches the node range in nhc.conf
…n works before running tests that need it.
…keeping debugging output added by NREL. Avoids executing mcheck() 3 times on same data.
fixed misleading timeout message if nhc times out after all checks were completed
David Whiteside (GitHub user starboarder2001) noticed (see his PR mej#9 for details) that NHC would sometimes report that it had timed out while executing a particular check when, in fact, it had already completed that check and was trying to finish its work and/or clean up (e.g., marking a node online). This tweaks his fix to be even more specific about what exactly NHC is doing as it goes through the final stages of execution.
…bracketing delimiters.
…s the result file.
…o add file/line/function info to PS4 for xtrace output.
…of authorized users.
This fixes mej#15 by using squeue instead of stat to obtain the list of authorized users
…m bbbbbrie to use built-in NHC variable HOSTNAME_S instead of shelling out to the hostname command.
…es by re-execing NHC after each loop. Some problems have been reported when NHC is used as a Grid Engine load sensor, including memory leaks, due to the long-running NHC process. All function variables reported as being leaked are local to their respective functions, but for unknown reasons (BASH bug?), they're being leaked anyway. In an attempt to work around these issues, this commit re-execs NHC after the completion of each iteration. There are some side effects to this, including resetting of non-exported environment variables, so this is experimental for now.
* sge-fixes: workaround: Try avoiding SGE-related memory leaks and other BASH issues by re-execing NHC after each loop. Allow SIGUSR1 and SIGUSR2 to toggle bash tracing and debug mode, respectively.
* master: Makefile.am: Minor/cosmetic fix for removing /var/{lib,run}/nhc in uninstall-local rule. Allow SLURM nodes in reservation to be marked offline Fixed bad links
One of my last major changes before leaving LBNL was to globally disable pathname expansion (i.e., globbing) by default throughout NHC. Apparently when I did that, I missed a spot: the unit test driver script! So only the unit tests specific to `nhc` itself have been running since that happened (commit 8fa6657). While this problem has existed for over a year-and-a-half in wallclock time, commit-wise it hasn't been very long. (About 10-ish, not counting merging work from others.) That's my story, and I'm sticking to it.
Fix the process substitution sanity check so that we're not skipping 46 unit tests for no reason whatsoever.
check_file_test added negative option
While taking another look at @SMark-Black's mej#50 and mej#53, I realized that the code in question regarding `$MAX_SYS_UID` is doing exactly what it is supposed to do, given the intended meaning of the variable based on its name. What was *actually* wrong was that the `nhc_common_get_max_sys_uid()` function has been reading the wrong variable! So `nhc_common_get_max_sys_uid()` will now look for `$SYS_UID_MAX` in `/etc/login.defs` like it should have been doing all along, and using the value of `UID_MIN - 1` as a fallback if necessary. NOTE: This means that the default auto-detected value of `$MAX_SYS_UID` will likely be something ending in `99` (like `499` or `999`) rather than `00` because it was always intended to be (and the code has always treated it as) the *top* of the exempt UID range, NOT the bottom of the non-exempt UID range! If you have any configs or scripts that rely on different assumptions, please make sure to make any necessary updates. Closes mej#50.
Based on a couple changes suggested by @SMark-Black in his PR mej#53, add another command to look for to auto-detect LSF, and add support for the LSF `res` daemon to the `check_ps_userproc_lineage()` check. Also moved the setting of `$RM_DAEMON_MATCH` to inside the check -- that's the only thing in that whole entire file that actually requires a resource manager!
Disable the logfile when in eval (`-e`) mode so that NHC doesn't try to output to both the log and `stdout` and wind up saying the same thing twice!
EDAC support is high on our priority list at LANL, so getting this merged is very much on my radar and at the top of my priority list! I want to make sure it gets some bake time in production before putting it into a release, so I'm re-targetting this for 1.4.4, but it will be going in very early in the new year! :-) |
Add a Table of Contents to the README.md documentation file automatically generated by the `gh-md-toc` script from @ekalinin. Many thanks to @basvandervlies for both suggesting this much-needed addition and helping find an editor-agnostic way to generate it automatically going forward! To update: ```bash $ git clone https://github.com/ekalinin/github-markdown-toc.git $ github-markdown-toc/gh-md-toc --insert README.md ``` Closes mej#67. Feedback is welcome on (1) whether or not to submodule-ize this, and (2) whether or not any changes are needed to tweak the output for NHC...it looks like it might need some manual tweaking at the moment.
Made a couple tweaks to the `gh-md-toc` script (NHC-specific) to put the Table of Contents more in line with what the intent of the formatting was, not necessarily the exact indentation level. I intentionally skip heading levels to achieve the correct style, but that confuses the ToC generator. Some special casing in the `awk` script has remedied that. I haven't taken the time to make the changes in a way that would be generic enough to consider upstreaming. Maybe in the future!
The current edac-utils release 0.16 from RHEL/CentOS 7 contains a bug as documented in https://github.com/grondo/edac-utils/blob/master/NEWS: Version 0.18 (2011-11-09);
We get this useless output from the command: I have opened a bug report for RHEL 7 requesting an upgrade of edac-utils to version 0.18, see https://bugzilla.redhat.com/show_bug.cgi?id=1662858 Hopefully this will be useful for NHC when Red Hat implements the update. |
It turns out that edac-utils is deprecated (though still supported) in RHEL 7, and the hardware checking functionality is replaced by rasdaemon, see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/migration_planning_guide/sect-red_hat_enterprise_linux-migration_planning_guide-deprecated_packages The rasdaemon is documented for RHEL 7 in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/sec-checking_for_hardware_errors Usage of rasdaemon on RHEL 7 requires a daemon: systemctl start rasdaemon For NHC it may be preferable to use rasdaemon in stead of the deprecated edac-utils or mcelog. |
2cc5f7c
to
38142c4
Compare
This patch will add a
check_hw_edac
check to verify correctable and uncorrectable ECC errors in memory, as reported byedac-utils
EDAC is an alternative to MCE checks, with support for older hardware (cf. http://www.mcelog.org/faq.html#13) and could be used on platforms where
mcelog
is not available.It is very closely modelled after the
check_hw_mcelog
function, with similar thresholds definitions.