Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ad-hoc report: fox heavy GPU usage #522

Closed
2 of 3 tasks
lars-t-hansen opened this issue Jun 21, 2024 · 2 comments
Closed
2 of 3 tasks

Ad-hoc report: fox heavy GPU usage #522

lars-t-hansen opened this issue Jun 21, 2024 · 2 comments
Assignees

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Jun 21, 2024

Experimental / speculative.

This comes out of https://gitlab.sigma2.no/naic/wp2/identify-most-resource-intensive-users/-/issues/1. We will try to create an ad-hoc report that:

  • takes a time interval of interest (typically some range of dates) and a cluster name as an argument
  • finds jobs that ran for at least 24h
  • produces a list of jobs (user+job data) on that cluster that used at least one gpu-day over the lifetime of the job
  • annotates a job with a mark if the job used one gpu-day in a 24h period

Soft dependencies (we can do without them for now):

@lars-t-hansen lars-t-hansen self-assigned this Jun 21, 2024
@lars-t-hansen
Copy link
Collaborator Author

The output from the prototype report looks like this (manually reformatted a little b/c good formatting is not currently implemented):

>24h User            GpuTime GpuTime/  Host(s)      Command
                             duration
     ec-lgcharpe     438260s    97%    gpu-1.fox    python3,wandb-service(2
     ec-nicoca       131084s    73%    gpu-7.fox    python3
     ec-nicoca       310050s    73%    gpu-7.fox    python3
     ec-nicoca       321326s    75%    gpu-1.fox    python3
*    ec-thallesss    954508s   353%    gpu-8.fox    python,python_<defunct>,torchrun
*    ec-thallesss    1495751s  349%    gpu-2.fox    python,python_<defunct>,torchrun
     ec-dhananjt     115676s    27%    gpu-12.fox   python
     ec-nicoca       317770s    73%    gpu-7.fox    python3
     ec-abgani       371619s    86%    gpu-7.fox    2_run.sh,pmemd.cuda,pmemd.cuda_<defunct>,slurm_script
     ec-nicoca       165963s    73%    gpu-7.fox    python3
*    ec-thallesss    1132497s  353%    gpu-8.fox    python,torchrun
*    ec-thallesss    583203s   358%    gpu-2.fox    python,torchrun

The mark in the left column indicates that the job used at least one full GPU for at least 24h running, note it would also be marked if it used two GPUs at 60%, say, for the same period (120% total).

@lars-t-hansen
Copy link
Collaborator Author

I guess that could usefully be showing time as dd:hh:mm:ss. 115675s (the shorter one) is 32.1 hours. 1495751s (the longer one) is 415.5 hours (divided by four cards, so over 100 hours of real time).

Plenty other information is available about these jobs, if of interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant