Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reporting backchannel + auto-monitoring #585

Open
2 tasks
lars-t-hansen opened this issue Sep 2, 2024 · 0 comments
Open
2 tasks

Error reporting backchannel + auto-monitoring #585

lars-t-hansen opened this issue Sep 2, 2024 · 0 comments
Labels
component:infra Shell scripts, cron scripts, web server, etc task:enhancement New feature or request use-case

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Sep 2, 2024

It appears that mail is not set up on most compute nodes and so the MAILTO in crontab won't work (manifestly does not work on Fox). It's not clear to me what will happen other than logging if something goes wrong when sonar is run by systemd either.

This is sort of a big deal - we need some type of auto-monitoring for the system, there are too many things that can go wrong, witness what happened when the nvidia-smi format changed.

I think we need two things:

  • when on-node infra reports failures, these failures should be embedded in the infra output somehow -- as a field in sonar data, a field in sysinfo data, a field in the sacctd data. on ingest, these fields can be factored out and reported, or there can be a periodic job that scrubs the recent records and performs reporting, probably better
  • when a node stops reporting (we should see this in missing heartbeats from sonar, but there's also the issue of eg a master node not reporting sacctd data) there should be a report that is a little more than just a colored line on a dashboard. probably somebody should get mail. this is not completely obvious for missing heartbeats from a node since we don't want any kind of mail storm for that and nodes can be down for a long time.
@lars-t-hansen lars-t-hansen added task:enhancement New feature or request use-case component:infra Shell scripts, cron scripts, web server, etc labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:infra Shell scripts, cron scripts, web server, etc task:enhancement New feature or request use-case
Projects
None yet
Development

No branches or pull requests

1 participant