understanding GPU offloading #799
Unanswered
mariopaolo
asked this question in
Q&A
Replies: 1 comment 1 reply
-
I am experiencing the same general issue as above, with only a couple of differences to my configuration:
In looking further at the Model Compatibility page I notice that the falcon back-end lists support for 'CUDA' acceleration while llama indicates support for 'CUDA, openCL, cuBLAS, Metal' and I don't see a BUILD_TYPE=cuda option (this may be a red herring, but I'm new to the project and am not sure...) Echoing the question above: How I can enable GPU offloading on supported models, like for instance on the mentioned wizardlm-7b-uncensored.ggccv1.q5_1.bin? and how do I make sure it's working as intended? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
hi everyone, I just deployed
localai
on a k3s cluster (TrueCharts app on TrueNAS SCALE).my configuration is:
master-cublas-cuda11-ffmpeg
cublas
gtx1070 8GB
when inspecting the container vars I see among the other envvars
and the logs show
everything is working and I can successfully use all the
localai
endpoints.after reading this page, I realized only few models have CUDA support, so I downloaded one of the supported one to see if the GPU would kick in.
wizardlm-7b-uncensored.ggccv1.q5_1.bin
should be supported as per footnote:I added the suggested overrides as documented here in the notes:
I installed it with this JSON:
I then went ahead and called the
chat/completions
endpoint:and while
localai
is doing its thing I runnvidia-smi
on the host and I get thisI am puzzled to see this, because yes, the GPU is doing *something*, but it doesn't look like the model is being offloaded to the GPU: 189MiB is a very low figure and I don't see the corresponding logs (as hinted here:
And if the GPU inferencing is working, you should be able to see something like:
)On the other hand, I was able to run the suggested example using
gpt-3.5-turbo
, and indeed when calling it I can see the model being offloaded to the GPU innvidia-smi
, even though the requests just hangs indefinitely and it never finishes (or it takes forever, but really shouldn't since the GPU should be faster). this seems related to this issueso my question is, how I can enable GPU offloading on supported models, like for instance on the mentioned
wizardlm-7b-uncensored.ggccv1.q5_1.bin
? and how do I make sure it's working as intended?thanks for the amazing software, hope to get this sorted out.
Beta Was this translation helpful? Give feedback.
All reactions