Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm: Memory calculation fails to take HIP_VISIBLE_DEVICES into account #1104

Open
jeroen-mostert opened this issue Aug 30, 2024 · 4 comments · May be fixed by #1106
Open

ROCm: Memory calculation fails to take HIP_VISIBLE_DEVICES into account #1104

jeroen-mostert opened this issue Aug 30, 2024 · 4 comments · May be fixed by #1106

Comments

@jeroen-mostert
Copy link

jeroen-mostert commented Aug 30, 2024

I have a ROCm compiled with support for both the discrete GPU and the iGPU, but with HIP_VISIBLE_DEVICES set to 0 to ensure only the discrete GPU is considered (the iGPU is just for experimenting, it's far too slow to meaningfully use). But because rocminfo and rocm-smi list both GPUs, setting gpulayers to -1 to trigger the automatic calculation uses the reported iGPU memory. This is wrong/suboptimal in two respects:

  1. The "reported memory" for the iGPU takes only the reserved GPU memory into account, but compute memory actually ends up as GTT memory, meaning all of system memory is available rather than the few piddly MB for video graphics (from Linux 6.10 onwards at least, but I've also observed this happening on 6.9). To get any kind of speed out of this you need to compile GGML with GGML_HIP_UMA, though, which tanks perf on discrete GPUs -- but I digress.
  2. With HIP_VISIBLE_DEVICES set, the iGPU is actually not used for inference at all, yet its lower memory is used, causing the calculation to always offload 0 layers.

If I ugly hack my local koboldcpp.py to simply ignore any devices beyond the first the auto-layer calculation does its job correctly. I'm too lazy to write a real patch fixing the problem, I just wanted to mention it here. Taking HIP_VISIBLE_DEVICES (and/or CUDA_VISIBLE_DEVICES, which AMD supports for compatibility) should not be particularly hard. Taking the iGPU reported memory thing into account is probably too complicated (AFAIK there isn't even a reliable way to detect it's an iGPU; things like the marketing name being "AMD Radeon Graphics" and the uuid being "GPU-XX" are certainly suggestive, but hardly convincing). If you're using an iGPU (or God forbid a combination) you probably don't want to rely on the auto-layer feature anyway.

I don't know if Nvidia would have similar issues with CUDA_VISIBLE_DEVICES active.

@LostRuins
Copy link
Owner

Do you know if I am able to differentiate between the GPU types in rocminfo? I dont have amd so I can't tell. Ideally there should be a rocminfo flag to skip igpus

@jeroen-mostert
Copy link
Author

jeroen-mostert commented Aug 30, 2024

Yeah, that would be ideal, wouldn't it? :P rocminfo has no options, not even --help or --version. rocm-smi is a lot more detailed, but despite all the options it has, there is nothing I could see that screams "iGPU". You just sort of have to know that gfx1036 (Raphael) is an iGPU arch, for example. There is a CUDA API for getting device info that has an "integrated" flag, and even that just returns "0" for the iGPU as though it was discrete in the HIP emulation.

FWIW this is the output of rocm-smi --showproductname --showmeminfo vram --showuniqueid --csv:

device,Unique ID,VRAM Total Memory (B),VRAM Total Used Memory (B),Card Series,Card Model,Card Vendor,Card SKU,Subsystem ID,Device Rev,Node ID,GUID,GFX Version
card0,0x9246____________,17163091968,692142080,Navi 21 [Radeon RX 6800/6800 XT / 6900 XT],0x73bf,Advanced Micro Devices Inc. [AMD/ATI],001,0x2406,0xc1,1,45534,gfx1030
card1,N/A,67108864,26079232,Raphael,0x164e,Advanced Micro Devices Inc. [AMD/ATI],RAPHAEL,0x364e,0xc6,2,52156,gfx1036

I stubbed out the id of my discrete card just in case anyone could get up to shenanigans with it (I don't think this is actually the serial number because there's an option for that too, and that never returns anything, but who knows). So maybe the unique ID is usable as the iGPU has "N/A"? But I couldn't tell you if this is a guarantee.

Incidentally there is also a --json option, which is probably more reliable/convenient than hard-coding the CSV column order as is done now, if you're not actually parsing the CSV header. Here's how that looks (except the real output is unformatted):

{
  "card0": {
    "Unique ID": "0x9246____________",
    "VRAM Total Memory (B)": "17163091968",
    "VRAM Total Used Memory (B)": "708767744",
    "Card Series": "Navi 21 [Radeon RX 6800/6800 XT / 6900 XT]",
    "Card Model": "0x73bf",
    "Card Vendor": "Advanced Micro Devices, Inc. [AMD/ATI]",
    "Card SKU": "001",
    "Subsystem ID": "0x2406",
    "Device Rev": "0xc1",
    "Node ID": "1",
    "GUID": "45534",
    "GFX Version": "gfx1030"
  },
  "card1": {
    "Unique ID": "N/A",
    "VRAM Total Memory (B)": "67108864",
    "VRAM Total Used Memory (B)": "26079232",
    "Card Series": "Raphael",
    "Card Model": "0x164e",
    "Card Vendor": "Advanced Micro Devices, Inc. [AMD/ATI]",
    "Card SKU": "RAPHAEL",
    "Subsystem ID": "0x364e",
    "Device Rev": "0xc6",
    "Node ID": "2",
    "GUID": "52156",
    "GFX Version": "gfx1036"
  }
}

@jeroen-mostert
Copy link
Author

jeroen-mostert commented Aug 30, 2024

To be clear, I'm not suggesting koboldcpp should actually do something like detect iGPUs and actively ignore them -- this would be counterproductive for people who actually want to use them, which is absolutely possible for most AMD configurations (though you may need HSA_OVERRIDE_GFX_VERSION). AMD's iGPUs tend to be actually useful in practice for compute work, unlike Intel's (pre-Xe, at least). CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES is pretty well established as a GPU selection tool and already restricts koboldcpp just fine too, it's just that the HW reporting tools ignore it (at least the AMD ones do, no clue about NV).

@jeroen-mostert
Copy link
Author

jeroen-mostert commented Aug 30, 2024

Decided to be un-lazy and cook up a PR anyway. The logic may or may not apply to NV as well, per my earlier comment.

Note that the calculation as a whole is, I think, wrong/suspect/deficient when multiple GPUs are used simultaneously, since there seems to be no attempt at using tensor_split, meaning the layer calc will default to the smallest GPU even if the model is being distributed. But that's another issue altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants