Read data directly from GPU APIs #87

lars-t-hansen · 2023-08-16T11:26:39Z

Related to #86. Currently we run nvidia-smi and rocm-smi to obtain GPU data. This is bad for several reasons:

the output formats are idiosyncratic, not documented, not stable
sometimes we have to run commands multiple times to get all data we need

Much better would probably be to use the programmatic APIs towards the cards.

On the other hand, needing to link against these C libraries adds to the complexity of sonar and creates a situation where the same sonar binary may not be usable on all systems. So a compromise solution would be to create small C (probably) programs that we wrap around the programmatic APIs and invoke from sonar. These would need to be run once and would have a defined and compact output format.

lars-t-hansen · 2023-11-22T07:31:07Z

This may be the same issue or a separate issue: On a multi-card node it usually happens that a single card goes down. But in this case, nvidia-smi hangs or errors out for all the cards. Going directly to the API might fix that problem too.

(For NVIDIA at least that problem may be fixable while staying with nvidia-smi: we can enumerate devices by enumerating /dev/nvidia*, then use nvidia-smi -i to probe cards individually. But eight invocations for eight cards is not the situation we'd like to find ourselves in.)

lars-t-hansen · 2023-11-27T13:45:00Z

A related issue: Currently on ML9, sonar data (and nvidia-smi) say that one card is being used 100%, the others are idle. But nvtop says two cards are running at 100%. It would be useful to try to reduce this discrepancy.

lars-t-hansen · 2024-05-03T06:58:04Z

Re the output format: On the very new node gpu-13.fox, nvidia-smi pmon now has a different output format and sonar is not able to parse it.

lars-t-hansen · 2024-05-06T07:54:23Z

ml1: NVIDIA System Management Interface -- v545.23.08

$ nvidia-smi pmon -c 1 -s u
# gpu         pid  type    sm    mem    enc    dec    command
# Idx           #   C/G     %      %      %      %    name
    0    1174916     C     88     54      -      -    python         
    0    1186862     C      -      -      -      -    python3        
    1    1174916     C     92     53      -      -    python         
    1    1223470     C      -      -      -      -    python3        
    2    1174916     C     89     53      -      -    python         
    2     941737     C      -      -      -      -    python3

gpu-13.fox: NVIDIA System Management Interface -- v550.54.14

$ nvidia-smi pmon -c 1 -s u
# gpu         pid   type     sm    mem    enc    dec    jpg    ofa    command 
# Idx           #    C/G      %      %      %      %      %      %    name 
    0          -     -      -      -      -      -      -      -    -              
    1          -     -      -      -      -      -      -      -    -              
    2          -     -      -      -      -      -      -      -    -              
    3          -     -      -      -      -      -      -      -    -

It could look like the sensible thing to do here would be to decode the # gpu line and use that as a key into the other data.

lars-t-hansen · 2024-05-06T07:54:55Z

Going to fork that off as its own bug, and leave this bug to be about the original subject matter.

lars-t-hansen · 2024-09-04T10:27:28Z

As noted here we would need to build with the NVIDIA library called "nvml" to do this (on NVIDIA). It is poorly documented and part of a larger SDK, unclear if that is needed on every machine or just during build.

lars-t-hansen · 2024-09-16T06:25:00Z

From last week's Slurm conference: Slurm has been using nvml (and something related for AMD) to talk to the GPU but are finding that this is hard to manage - discrepancies between build system and deploy system are problematic. Plus NVIDIA have reportedly been changing the API even after promising not to do so. They are finding that they can get what they need from the /sys filesystem instead and in Slurm 24.11 the GPU monitoring will be via /sys. We should investigate this for the same reasons.

lars-t-hansen · 2024-09-20T09:32:14Z

$ cat /proc/driver/nvidia/gpus/*/information
Model: 		 NVIDIA GeForce RTX 2080 Ti
IRQ:   		 296
GPU UUID: 	 GPU-35080357-601c-7113-ec05-f6ca1e58a91e
Video BIOS: 	 90.02.17.00.b2
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:18:00.0
Device Minor: 	 0
GPU Excluded:	 No
Model: 		 NVIDIA GeForce RTX 2080 Ti
IRQ:   		 297
GPU UUID: 	 GPU-be013a01-364d-ca23-f871-206fe3f259ba
Video BIOS: 	 90.02.0b.40.09
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:3b:00.0
Device Minor: 	 1
GPU Excluded:	 No
Model: 		 NVIDIA GeForce RTX 2080 Ti
IRQ:   		 298
GPU UUID: 	 GPU-daa9f6ac-c8bf-87be-8adc-89b1e7d3f38a
Video BIOS: 	 90.02.0b.40.09
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:86:00.0
Device Minor: 	 2
GPU Excluded:	 No

lars-t-hansen · 2024-09-20T09:35:01Z

https://unix.stackexchange.com/questions/766505/where-does-nvidia-smi-get-its-information-from

lars-t-hansen added enhancement New feature or request Later Low priority / background task labels Aug 16, 2023

lars-t-hansen mentioned this issue Aug 16, 2023

Use exec, not shell, to run subprograms #88

Closed

lars-t-hansen added the performance label Aug 21, 2023

lars-t-hansen mentioned this issue Aug 21, 2023

Sonar is slow without --batchless because of how we get slurm job IDs #97

Closed

lars-t-hansen mentioned this issue Sep 7, 2023

Deploy sonar prototype to Fox NAICNO/Jobanalyzer#57

Closed

21 tasks

lars-t-hansen mentioned this issue May 6, 2024

Generalize parsing of nvidia-smi pmon output #170

Closed

lars-t-hansen mentioned this issue Aug 27, 2024

The Nvidia driver rides again - nvidia-smi output changed in 560 #183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read data directly from GPU APIs #87

Read data directly from GPU APIs #87

lars-t-hansen commented Aug 16, 2023

lars-t-hansen commented Nov 22, 2023

lars-t-hansen commented Nov 27, 2023

lars-t-hansen commented May 3, 2024

lars-t-hansen commented May 6, 2024

lars-t-hansen commented May 6, 2024

lars-t-hansen commented Sep 4, 2024

lars-t-hansen commented Sep 16, 2024

lars-t-hansen commented Sep 20, 2024

lars-t-hansen commented Sep 20, 2024

Read data directly from GPU APIs #87

Read data directly from GPU APIs #87

Comments

lars-t-hansen commented Aug 16, 2023

lars-t-hansen commented Nov 22, 2023

lars-t-hansen commented Nov 27, 2023

lars-t-hansen commented May 3, 2024

lars-t-hansen commented May 6, 2024

lars-t-hansen commented May 6, 2024

lars-t-hansen commented Sep 4, 2024

lars-t-hansen commented Sep 16, 2024

lars-t-hansen commented Sep 20, 2024

lars-t-hansen commented Sep 20, 2024