Skip to content

Commit

Permalink
Merge pull request #592 from NAICNO/larstha-541-misc-reports
Browse files Browse the repository at this point in the history
For #541 - clean up various reports
  • Loading branch information
lars-t-hansen authored Sep 4, 2024
2 parents 27369d3 + ee65960 commit 03ad4d9
Show file tree
Hide file tree
Showing 51 changed files with 1,020 additions and 0 deletions.
1 change: 1 addition & 0 deletions adhoc-reports/fox-gpu-report/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sonalyze
6 changes: 6 additions & 0 deletions adhoc-reports/fox-gpu-report/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
This is some code that was used to create a note on Fox GPU usage.

You need to create a note to sonalyze in the current directory. Beyond that, see the code (and the
note, on Gdocs,
https://docs.google.com/document/d/1UmXWjvc4xC64spLAzQPoNzhKZsFoO4YtkV1jIe5_iRg/edit?usp=sharing if
you're UiO-internal.)
24 changes: 24 additions & 0 deletions adhoc-reports/fox-gpu-report/load.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

from=$1
from=${from:-2024-01-01}

to=$2
to=${to:-2024-08-19}

hosts="gpu-1 gpu-2 gpu-7 gpu-8"

for host in $hosts; do
echo $host
./sonalyze load \
-remote https://naic-monitor.uio.no \
-cluster fox \
-auth-file ~/.ssh/sonalyzed-auth.txt \
-host "$host" \
-daily \
-from "$from" \
-to "$to" \
-user - \
-fmt awk,date,time,cpu,res,gpu,gpumem,rgpu,rgpumem \
| ./load.py
done
38 changes: 38 additions & 0 deletions adhoc-reports/fox-gpu-report/load.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/python3
#
# The input is the output from `sonalyze load` with awk formatting and
# these fields:
#
# (ignored) (ignored) (ignored) (ignored) (ignored) (ignored) rgpu (ignored)...
#
# eg this command:
#
# sonalyze load \
# -remote https://naic-monitor.uio.no \
# -cluster fox \
# -auth-file ~/.ssh/sonalyzed-auth.txt \
# -host 'gpu-8' \
# -daily \
# -from 2024-01-01 \
# -fmt awk,date,time,cpu,res,gpu,gpumem,rgpu,rgpumem
#
# The output is one line of text, tab-separated
#
# average-utilization days-below-40 days-above-80

import sys

n=0
sum=0
n_above_80=0
n_below_40=0
for l in sys.stdin:
fs = l.split()
k = int(fs[6])
n += 1
sum += k
if k < 40:
n_below_40 += 1
elif k > 80:
n_above_80 += 1
print(str(sum/n) + "\t" + str(n_below_40) + "\t" + str(n_above_80))
16 changes: 16 additions & 0 deletions adhoc-reports/fox-gpu-report/waittime.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
#
# Note, this is meant to be run on naic-monitor.uio.no against a local data store.

from=$1
from=${from:-2024-01-01}
for host in gpu-1 gpu-2 gpu-7 gpu-8; do
echo $host
./sonalyze sacct \
-data-dir ~/fox-experiment/data2 \
-from $from \
-all \
-fmt awk,Submit,Start,Wait,NodeList \
| grep -E "$host\$" \
| ./waittime.py
done
63 changes: 63 additions & 0 deletions adhoc-reports/fox-gpu-report/waittime.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
#!/usr/bin/python
#
# Input format:
# (ignored) (ignored) WaitTimeSec (ignored) ...
#
# Output format (tab-separated)
# total-jobs 0h 1h 2h 3h 4-11h 12-23h 1-6d > 6d

import sys

def main():
# Read data and bucket by hours waited
buckets = {}
count=0
for l in sys.stdin:
fs = l.split()
w = int(fs[2])
b = w % 3600
count += 1
if not b in buckets:
buckets[b] = 1
else:
buckets[b] = buckets[b] + 1

waiting0=0
waiting1=0
waiting2=0
waiting3=0
waiting4=0
waiting12=0
waiting_days=0
waiting_weeks=0
if 0 in buckets:
waiting0=buckets[0]
if 1 in buckets:
waiting1=buckets[1]
if 2 in buckets:
waiting2=buckets[2]
if 3 in buckets:
waiting3=buckets[3]
for i in range(4,12):
if i in buckets:
waiting4 += buckets[i]
for i in range(12,23):
if i in buckets:
waiting12 += buckets[i]
for k in buckets:
if k > 24 and k < 24*7:
waiting_days += buckets[k]
elif k >= 24.7:
waiting_weeks += buckets[k]

print(str(count) + "\t" +
str(waiting0) + "\t" +
str(waiting1) + "\t" +
str(waiting2) + "\t" +
str(waiting3) + "\t" +
str(waiting4) + "\t" +
str(waiting12) + "\t" +
str(waiting_days) + "\t" +
str(waiting_weeks))

main()
1 change: 1 addition & 0 deletions adhoc-reports/leadership-report/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__pycache__
19 changes: 19 additions & 0 deletions adhoc-reports/leadership-report/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
These scripts were used to generate some tables for the prototype NRIS Leadership Meeting report.

IMPORTANT SETUP INSTRUCTION:

Before running be sure to cd to ../../code and run install.sh to build and install the binaries in
~/go/bin. Otherwise you risk using stale code.

PROGRAMS:

lm-slurm-heatmaps.bash generates all the txt files, which are heat maps for various kinds of jobs.
There's no sense in rerunning this as it stands, as the date ranges in the script are fixed and
the outputs will be the same, but the dates can be changed.

bad-gpu-heatmap.py prints a 5x5 heatmap of job counts that use varying amounts of GPU and GPU
memory on the Fox GPUs.

bad-gpu-jobs.py prints a list of commands(!) that are run on Fox GPUs without using much GPU.
This is mostly a PoC, and the report can be adapted easily to print other things, such as jobs,
users, and accounts.
11 changes: 11 additions & 0 deletions adhoc-reports/leadership-report/all-map.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
| 23005 15072 2205 1014 1360 540 374 409 649 2233
| 2851 3020 1897 1193 1980 1587 880 868 546 15150
| 519 819 461 179 147 69 97 142 259 1019
| 741 389 394 116 422 233 384 492 1195 92961
| 1493 535 459 160 302 189 353 572 882 14846
| 126 108 240 16 22 18 44 48 104 626
| 74 165 325 336 375 397 43 148 186 317
| 46 96 179 9 15 78 413 518 434 612
| 54 59 192 30 110 518 665 323 533 371
| 413 186 320 405 783 279 450 1001 1212 10866
57 changes: 57 additions & 0 deletions adhoc-reports/leadership-report/bad-gpu-heatmap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/python3
#
# See README.md for setup instructions.
#
# Example report: heat map of peak gpu and gpu memory usage of programs that ran on the GPU nodes
# for at least 2 hours.

import os, subprocess, sys, re
from sonalyze import sonalyze
from heatmap import heatmap

# Exclude gpu-3 because it is dedicated to interactive work and has no Slurm jobs. Maybe gpu-9 also?
cluster = "fox"
hostglob = "gpu-[2,3,4-13]"
from_date = "2024-03-01"
to_date = "2024-05-31"

####################################################################################################
# Find job IDs of all noninteractive slurm jobs. Here we don't filter TIMEOUT and CANCELLED jobs,
# nor do we use -some-gpu, and we use -all to be maximally inclusive.

sacct_output = sonalyze("sacct", cluster,
["-fmt", "awk,JobID,User,Account,State,JobName",
"-from", from_date,
"-to", to_date,
"-all",
"-host", hostglob])
noninteractive_jobs={}
interactive=re.compile(r"interactive|OOD|ood")
for l in sacct_output:
if interactive.search(l) != None:
continue
fields=l.split(" ")
noninteractive_jobs[fields[0]] = True

#print("{0} sacct records, {1} noninteractive left".format(len(sacct_output), len(noninteractive_jobs)))

####################################################################################################
# Slurp and filter job data. Only look at jobs that ran at least 2 hours (elapsed) since those are
# considered somewhat "serious".

jobs_output = sonalyze("jobs", cluster,
["-from", from_date,
"-to", to_date,
"-u", "-",
"-min-runtime", "2h",
"-host", hostglob,
"-fmt", "awk,job,sgpu,sgpumem"])

out=[]
for l in jobs_output:
fields=l.split(" ")
if fields[0] in noninteractive_jobs:
out.append([fields[2], fields[4]])

for l in heatmap(out, 5):
print(l)
68 changes: 68 additions & 0 deletions adhoc-reports/leadership-report/bad-gpu-jobs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/usr/bin/python3
#
# See README.md for setup instructions.
#
# Example report: noninteractive programs that ran on the GPU nodes for at least 2 hours and used
# less than 25% of its allocated GPU and less than 25% of its allocated GPU memory (at peak). This
# is pretty arbitrary!
#
# This is a good example of how sonar and slurm data are joined. One would think that the job
# numbers to include or exclude computed in the first step could be passed to sonalyze in the second
# step, but curl has limits on command line / url parameter length.

import os, subprocess, sys, re
from sonalyze import sonalyze

# Exclude gpu-3 because it is dedicated to interactive work and has no Slurm jobs. Maybe gpu-9 also?
cluster = "fox"
hostglob = "gpu-[2,3,4-13]"
from_date = "2024-03-01"
to_date = "2024-05-31"

####################################################################################################
# Find job IDs of all noninteractive slurm jobs. Here we don't filter TIMEOUT and CANCELLED jobs,
# nor do we use -some-gpu, and we use -all to be maximally inclusive.

sacct_output = sonalyze("sacct", cluster,
["-fmt", "awk,JobID,User,Account,State,JobName",
"-from", from_date,
"-to", to_date,
"-all",
"-host", hostglob])
noninteractive_jobs={}
interactive=re.compile(r"interactive|OOD|ood")
for l in sacct_output:
if interactive.search(l) != None:
continue
fields=l.split(" ")
noninteractive_jobs[fields[0]] = True

#print("{0} sacct records, {1} noninteractive left".format(len(sacct_output), len(noninteractive_jobs)))

####################################################################################################
# Slurp and filter job data. Only look at jobs that ran at least 2 hours (elapsed) since those are
# considered somewhat "serious".

jobs_output = sonalyze("jobs", cluster,
["-from", from_date,
"-to", to_date,
"-u", "-",
"-min-runtime", "2h",
"-host", hostglob,
"-fmt", "awk,job,sgpu,sgpumem,user,cmd"])
all_cmds={}
jobs={}
for l in jobs_output:
# We want noninteractive jobs whose gpu and gpumem peaks both fall below 25%
# fields are jobid, sgpu-avg, sgpu-peak, sgpumem-avg, sgpumem-peak, user, cmd
fields=l.split(" ")
jobs[fields[0]]=fields
if fields[0] in noninteractive_jobs and int(fields[2]) < 25 and int(fields[4]) < 25:
for c in fields[6].split(","):
if c in all_cmds:
all_cmds[c].append(fields[0])
else:
all_cmds[c] = [fields[0]]

for c in all_cmds:
print(c + " " + str(len(all_cmds[c])))
1 change: 1 addition & 0 deletions adhoc-reports/leadership-report/heatmap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 28 1 3 2 3 13 9 12 67
| 1 1 1
| 1 5 11
| 1 25
| 1 10
| 1 2 4
| 1 1 30 9
| 2 1
| 1 2 16 32 9
| 1 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 1 1 3 2 3 12 9 12 34
| 1
| 1 5 9
| 1 25
| 1 10
| 1 2 3
| 1 1 30 9
| 1 1
| 1 2 16 32 9
| 1 1
11 changes: 11 additions & 0 deletions adhoc-reports/leadership-report/lm-serious-map.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 416 88 31 19 11 15 33 38 63 323
| 56 166 13 10 7 7 12 35 46 28
| 19 21 13 4 1 5 5 10 3 50
| 12 7 7 3 1 1 1 56
| 8 18 2 2 1 2 1 12
| 6 29 4 2 5 2 10
| 14 4 1 6 11 76 14
| 11 3 1 1 1
| 6 2 8 14 47 342 340 78 11
| 22 10 33 34 17 8 1 3 1 8
11 changes: 11 additions & 0 deletions adhoc-reports/leadership-report/lm-serious-noninteractive-map.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 252 83 24 16 9 15 33 37 63 323
| 24 160 11 6 2 3 6 34 42 27
| 6 17 11 2 1 4 4 10 2 50
| 2 6 3 1 1 56
| 3 15 1 2 1 12
| 2 29 1 5 2 10
| 5 11 76 14
| 2 2
| 1 1 7 13 46 342 340 78 11
| 5 27 34 14 7 1 1 8
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 160 81 19 12 4 13 31 36 54 307
| 12 158 11 3 2 1 2 1 22
| 2 17 11 2 1 3 4 6 49
| 1 6 2 1 53
| 1 15 1 2 1 10
| 2 29 1 3 2 5
| 5 11 76 14
| 2 2
| 1 1 7 13 46 342 340 78 11
| 5 27 34 14 7 1 1 7
11 changes: 11 additions & 0 deletions adhoc-reports/leadership-report/lm-serious-timeout-map.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
--- --- --- --- --- --- --- --- --- ---
| 92 2 5 4 5 2 2 1 9 16
| 12 2 3 2 4 33 42 5
| 4 1 4 2 1
| 1 1 1 3
| 2 2
| 2 5
|
|
|
| 1
Loading

0 comments on commit 03ad4d9

Please sign in to comment.