Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for "sonar analyze" #43

Open
bast opened this issue Jun 16, 2023 · 9 comments
Open

Ideas for "sonar analyze" #43

bast opened this issue Jun 16, 2023 · 9 comments

Comments

@bast
Copy link
Member

bast commented Jun 16, 2023

  • sum up by process and identify most used processes
  • map processes to actual codes that we recognize taking the mappings in data/ as starting point
  • later we will extend the mappings
  • allow users to run this to see a history of their own processes
  • the output can be like a ranking (top 10 or 20 most used codes)
  • for the most used codes/processes we can then look how are they used (what memory and CPU footprint)
  • later: since we have the slurm job ID, we can also compare what the job asked for vs. what the job used (either user wants to know on their own, or we want to know for jobs that consume lots of CPU/mem -> advanced user support)

big picture goals:

  • have data instead of anecdotes about how the system is used
  • input for procurement benchmarks
  • identify resource usage problems which will generate interesting support projects
@bast
Copy link
Member Author

bast commented Jun 16, 2023

How I fetch data so that I can test on real data locally:

$ rsync --info=progress2 -a saga.sigma2.no:/cluster/shared/sonar/data/2023 example-data

@lars-t-hansen
Copy link
Collaborator

I agree, these cover all the non-automated use cases for https://github.com/NAICNO/Jobanalyzer as well, and I think the automated use cases will require a superstructure separate from sonar (but using sonar data).

@bast
Copy link
Member Author

bast commented Jun 16, 2023

@benteb is working on this part. More ideas and suggestions most welcome.

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Jun 23, 2023

As we're going to need to synthesize jobs for batchless systems (see #56), we're also going to need a way to list those jobs from the logs, so I'm sketching out a utility for that over in https://github.com/NAICNO/Jobanalyzer, see subdirectory lsjobs. (Eventually it'll likely be merged into sonar, jobgraph, or associated tool.) The program addresses a couple of the use cases for Jobanalyzer, and feeds into some of the ideas above too.

Edit: lsjobs is now called sonalyze, see next comment.

@lars-t-hansen
Copy link
Collaborator

It turns out (see above comment) that what I've been doing for Jobanalyzer is basically sonar analyze; the tool is now called sonalyze and can analyze jobs and system load based on sonar logs. The tool is operational and looks pretty good for single-node systems, with multi-node systems being the next work item. It would probably be useful to look to that code and for us to pool our resources going forward.

@bast
Copy link
Member Author

bast commented Jul 29, 2023

Thank you! I will check it out. Definitely better to pool resources than duplicating efforts. It is very possible that we might not need sonar analyze anymore.

@lars-t-hansen
Copy link
Collaborator

I think that this is "done" and that we should move specific, detailed requests to the Jobanalyzer repo and try to resolve them there. In that repo, there is a top-level directory adhoc-reports/ that has some sample applications. These are primitive but evocative. For example, here's a rundown of the jobs Pubudu (a colleague here) ran on the ML cluster over the 90 days:

[larstha@deathstar adhoc-reports]$ bash user-load-by-job-90d.sh 
1389165 58389 bwa
1208256 57369 bwa,htvc,java,postsort,python3,sort
235179 33597 applyBQSR,bwa,java,postsort,python3
704277 20808 applyBQSR,bwa,deepvariant,htvc,java,postsort,python3
244584 15264 bwa,starter
387450 12600 applyBQSR,bwa,deepvariant,htvc,java,postsort,python3,sort
374280 11640 applyBQSR,bwa,deepvariant,htvc,java,postsort,python3,starter
187812 10893 bwa,postsort,python3
393111 6510 bwa,htvc,java,postsort,python3
164973 3933 bwa,htvc,java
87264 3756 deepvariant,htvc,java
92772 2412 applyBQSR,bwa,deepvariant,htvc,java,python3,sort
37287 1143 deepvariant,htvc,java,python3
30186 909 bwa,htvc,java,starter
344505 840 bwa,sonard
59874 732 applyBQSR,htvc,java
33759 729 applyBQSR,bwa,htvc,java
21270 366 htvc,java,postsort,python3,starter
27390 150 htvc,java
1110 0 htop
12354 0 screen
1431 0 cp
147363 0 nvtop
1500 0 samtools
20292 0 seqkit

(The columns are CPU seconds, GPU seconds, and command names of the processes in the job - most of his work is implemented by these immense pipelines, and he runs a lot of experiments.)

This is a simple bash+awk script that executes remote sonalyze queries to the host that is the keeper of the sonar data. The script is not production ready - it needs parameters, formatting, headers, etc - but it's getting us toward what the present Issue is asking for.

@lars-t-hansen
Copy link
Collaborator

Hacked up another one: Top 25 jobs by program name on Fox for the last week, the columns are command name, cpu seconds, and percentage of accumulated CPU time:

[larstha@deathstar adhoc-reports]$ bash top-commands.sh 
cp2k.popt 785676363 76.6074
python 113719059 11.0882
nrniv 62076312 6.05275
R 21018129 2.04937
python,wandb-service(2 17817300 1.73728
pyrochlore_pt_w 6016500 0.586639
python,python_<defunct>,wandb-service(2 4992903 0.486833
python,python_<defunct> 2982276 0.290787
pyrochlore_wl.4 2270700 0.221405
R,snakemake 733257 0.0714963
molpro.exe 688473 0.0671296
TestBuild.x86_6,python3 674949 0.0658109
gmx 614124 0.0598802
kromosynth-eval,kromosynth-gRPC,kromosynth-rend 542838 0.0529295
python3 390198 0.0380463
R,iss,snakemake 345294 0.0336679
l1002.exe,l502.exe,l508.exe,l703.exe 337176 0.0328764
salmon 319629 0.0311654
kromosynth,kromosynth-eval,kromosynth-gRPC,kromosynth-rend 310482 0.0302736
bcftools,hisat2-align-s,python,python_<defunct>,python3.9 304644 0.0297043
cc1plus,ninja,nvcc,python,python3 191880 0.0187093
PCR_cpp,R,calculate_read_,iss,process_frags,snakemake 180696 0.0176188
jupyter-lab,python,wandb-service(2 177294 0.0172871
kromosynth,kromosynth-eval,kromosynth-rend 174726 0.0170367
gmsgenux.out 160722 0.0156712

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Feb 13, 2024

And for hack value, the same on Saga (b/c I'm stealing the current sonar data from Saga):

[larstha@deathstar adhoc-reports]$ bash top-commands.sh 
slurmstepd,vasp 239477182 9.52487
python,slurmstepd 82719713 3.29006
mpirun,orca_gstep,orca_gtoint,orca_gtoint_mpi,orca_numfreq,orca_scf,orca_scf_mpi,orca_scfgrad,orca_scfgrad_mp,orca_util_mpi,slurmstepd 69090892 2.74799
osloctm3 62095641 2.46977
vsearch 51851361 2.06232
eT 47399968 1.88527
orca_scf_mpi,slurmstepd 40480806 1.61007
mpirun,orca,orca_gstep,orca_gtoint_mpi,orca_scf_mpi,orca_scfgrad_mp,orca_util_mpi,python,slurmstepd 33806350 1.3446
mpirun,orca,orca_gtoint_mpi,orca_scf_mpi,orca_scfgrad_mp,python,slurmstepd 32891478 1.30821
l1002.exe,l1002.exel,l502.exe,l502.exel,l701.exe,l701.exel,l702.exe,l703.exe,l703.exel,l716.exe,l914.exe,l914.exel 31855362 1.267
l1002.exe,l1110.exe,l302.exe,l401.exe,l502.exe,l701.exe,l703.exe 31623165 1.25777
mpirun,orca_gtoint,orca_gtoint_mpi,orca_numfreq,orca_scf,orca_scf_mpi,orca_scfgrad,orca_scfgrad_mp,orca_util_mpi,sh,slurmstepd 31474978 1.25187
mpirun,orca,orca_gstep,orca_gtoint_mpi,orca_scf_mpi,orca_scfgrad_mp,python,slurmstepd 29565045 1.17591
l1002.exe,l1110.exe,l502.exe,l701.exe,l703.exe 28925388 1.15047
l1002.exe,l103.exe,l1101.exe,l1110.exe,l302.exe,l401.exe,l502.exe,l701.exe,l703.exe 28653897 1.13967
mpirun,orca,orca_gtoint_mpi,orca_scf_mpi,orca_scfgrad_mp,orca_util_mpi,python,slurmstepd 28632784 1.13883
l1002.exe,l1002.exel,l303.exe,l502.exe,l502.exel,l701.exe,l702.exe,l703.exe,l703.exel,l716.exe,l801.exe,l914.exe,l914.exel 27941490 1.11133
ncl 26610918 1.05841
python 25413480 1.01079
java 24319444 0.967272
l1002.exe,l1002.exel,l502.exe,l502.exel,l701.exe,l702.exe,l703.exe,l703.exel,l716.exe,l801.exe,l914.exe,l914.exel 23696034 0.942477
csh,l302.exe,l303.exe,l502.exe,l601.exe,l701.exe,l703.exe 23543262 0.936401
l502.exe 23364153 0.929277
mpirun,orca_gstep,orca_gtoint_mpi,orca_scf_mpi,orca_scfgrad_mp,python,slurmstepd 23355595 0.928937
l1002.exe,l1110.exe,l502.exe,l703.exe 22539663 0.896484

This highlights a couple of things, partly the annoying appearance of slurmstepd (this is a saga artifact I think, I don't see it on Fox, it could be because sonar on saga does not use the same process filtering as on Fox) but also that many jobs are pipelines or trees of processes and that that makes it a little more annoying to say that a particular program used so and so much compute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants