understanding GPU offloading #799

mariopaolo · 2023-07-25T03:16:55Z

mariopaolo
Jul 25, 2023

hi everyone, I just deployed localai on a k3s cluster (TrueCharts app on TrueNAS SCALE).
my configuration is:

image: master-cublas-cuda11-ffmpeg
build_type: cublas
gpu: gtx1070 8GB

when inspecting the container vars I see among the other envvars

...
NVIDIA_VISIBLE_DEVICES=GPU-3ab9c8fb-1166-f464-0b2c-e2ac905f005f
NVIDIA_DRIVER_CAPABILITIES=all
BUILD_TYPE=cublas
...

and the logs show

@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also https://github.com/go-skynet/LocalAI/issues/288
@@@@@
3:59AM INF Starting LocalAI using 6 threads, with models path: /models
3:59AM INF LocalAI version: v1.21.0 (fb6cce487fb53d9de1c1a6b3414261f52b5cdbe0)

 ┌───────────────────────────────────────────────────┐
 │                   Fiber v2.48.0                   │
 │               http://127.0.0.1:8080               │
 │       (bound on host 0.0.0.0 and port 8080)       │
 │                                                   │
 │ Handlers ............ 32  Processes ........... 1 │
 │ Prefork ....... Disabled  PID ................. 7 │
 └───────────────────────────────────────────────────┘

everything is working and I can successfully use all the localai endpoints.

after reading this page, I realized only few models have CUDA support, so I downloaded one of the supported one to see if the GPU would kick in. wizardlm-7b-uncensored.ggccv1.q5_1.bin should be supported as per footnote:

*** 7b and 40b with the `ggccv` format, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML

I added the suggested overrides as documented here in the notes:

When enabling GPU inferencing, set the number of GPU layers to offload with:
gpu_layers: 1 to your YAML model config file and f16: true.
You might also need to set low_vram: true if the device has low VRAM.

I installed it with this JSON:

{
    "url": "github:go-skynet/model-gallery/wizard.yaml",
    "name": "falcon-uncensored",
    "license": "other",
    "urls": [
        "https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML"
    ],
    "tags": [
        "falcon",
        "transformers",
        "license:other",
        "text-generation-inference",
        "region:us"
    ],
    "overrides": {
        "parameters": {
            "model": "wizardlm-7b-uncensored.ggccv1.q5_1.bin"
        },
        "f16": true,
        "gpu_layers": 35,
        "mmap": true,
        "batch": 512        
    },
    "files": [
        {
            "filename": "wizardlm-7b-uncensored.ggccv1.q5_1.bin",
            "sha256": "e84769c8beb3911105f743173d4b99f7cf475bb316cdc02875a6cd1e16c77524",
            "uri": "https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML/resolve/main/wizardlm-7b-uncensored.ggccv1.q5_1.bin"
        }
    ],
    "gallery": {
        "url": "github:go-skynet/model-gallery/huggingface.yaml",
        "name": "huggingface"
    }
}

I then went ahead and called the chat/completions endpoint:

curl -X POST https://ai.domain.ltd/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "wizardlm-7b-uncensored.ggccv1.q5_1.bin",
     "messages": [{"role": "user", "content": "what is life?"}],
     "temperature": 1
   }'

and while localai is doing its thing I run nvidia-smi on the host and I get this

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   31C    P8     8W / 166W |    189MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2913624      C   ...backend-assets/grpc/llama      189MiB |
+-----------------------------------------------------------------------------+

I am puzzled to see this, because yes, the GPU is doing *something*, but it doesn't look like the model is being offloaded to the GPU: 189MiB is a very low figure and I don't see the corresponding logs (as hinted here: And if the GPU inferencing is working, you should be able to see something like:)

On the other hand, I was able to run the suggested example using gpt-3.5-turbo, and indeed when calling it I can see the model being offloaded to the GPU in nvidia-smi, even though the requests just hangs indefinitely and it never finishes (or it takes forever, but really shouldn't since the GPU should be faster). this seems related to this issue

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   53C    P2   131W / 166W |   5992MiB /  8192MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2989543      C   ...backend-assets/grpc/llama      189MiB |
|    0   N/A  N/A   3502766      C   ...backend-assets/grpc/llama     5797MiB |
+-----------------------------------------------------------------------------+

so my question is, how I can enable GPU offloading on supported models, like for instance on the mentioned wizardlm-7b-uncensored.ggccv1.q5_1.bin? and how do I make sure it's working as intended?

thanks for the amazing software, hope to get this sorted out.

rioncarter · 2023-07-25T13:54:51Z

rioncarter
Jul 25, 2023

I am experiencing the same general issue as above, with only a couple of differences to my configuration:

GPU: RTX 6000 ADA
NVIDIA Driver Version: 535.54.03
CUDA Version: 12.2
BUILD_TYPE=cublas
Model: wizardlm-uncensored-falcon-40b.ggccv1.q4_0.bin (from https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML, which was linked on the model compatibility page)
gpu_layers: 60 (in the model config)
Container images tried:
- quay.io/go-skynet/local-ai:v1.22.0-cublas-cuda12
- quay.io/go-skynet/local-ai:master-cublas-cuda12

In looking further at the Model Compatibility page I notice that the falcon back-end lists support for 'CUDA' acceleration while llama indicates support for 'CUDA, openCL, cuBLAS, Metal' and I don't see a BUILD_TYPE=cuda option (this may be a red herring, but I'm new to the project and am not sure...)

Echoing the question above: How I can enable GPU offloading on supported models, like for instance on the mentioned wizardlm-7b-uncensored.ggccv1.q5_1.bin? and how do I make sure it's working as intended?

1 reply

rioncarter Jul 28, 2023

I opened #829 to document the behavior observed here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

understanding GPU offloading #799

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

understanding GPU offloading #799

mariopaolo Jul 25, 2023

Replies: 1 comment · 1 reply

rioncarter Jul 25, 2023

rioncarter Jul 28, 2023

mariopaolo
Jul 25, 2023

Replies: 1 comment 1 reply

rioncarter
Jul 25, 2023