Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: automatically adjust default gpu_layers by available GPU memory #3541

Open
mudler opened this issue Sep 13, 2024 · 4 comments · May be fixed by #3737
Open

feat: automatically adjust default gpu_layers by available GPU memory #3541

mudler opened this issue Sep 13, 2024 · 4 comments · May be fixed by #3737
Labels
enhancement New feature or request roadmap

Comments

@mudler
Copy link
Owner

mudler commented Sep 13, 2024

Is your feature request related to a problem? Please describe.
Having defaults high number of GPU layers doesn't always work. For instance big models can overfit the card and constrain the user to configure gpu_layers manually

Describe the solution you'd like
With libraries like https://github.com/gpustack/gguf-parser-go we could get along and identify beforeahead how much gpu vram could be used and adjust the default settings

Describe alternatives you've considered
Keep things as is

Additional context

@mudler mudler added enhancement New feature or request roadmap labels Sep 13, 2024
@mudler mudler changed the title automatically adjust default gpu_layers by available GPU memory feat: automatically adjust default gpu_layers by available GPU memory Sep 13, 2024
@mudler mudler mentioned this issue Sep 23, 2024
7 tasks
@siddimore
Copy link
Contributor

@mudler happy to take this task and work on it. I have to think a bit on the approach or google on alternatives

@siddimore
Copy link
Contributor

siddimore commented Oct 2, 2024

@mudler rough design/thoughts on the addition of this feature. Chat-GPT generated markdown for the solution

Design Document: Optimizing GPU Layer Configuration in LocalAI Using gguf-parser

Overview

Rough solution to optimize GPU layer configuration when using LocalAI for running large models, such as Qwen2-72B-Instruct-GGUF. The optimization leverages the gguf-parser library to dynamically adjust GPU memory usage based on the model's requirements and the available hardware resources.

Problem Statement

Large models like Qwen2-72B-Instruct-GGUF can easily exceed the VRAM capacity of a single GPU, requiring manual tuning of GPU layers to fit the model within the available memory. Overfitting the GPU with layers can lead to reduced performance, especially on systems with limited GPU memory.

Solution Approach

Dynamically adjust the GPU layer configuration based on the model metadata provided by gguf-parser. This approach will allow us to:

  • Estimate VRAM usage and distribute model layers across multiple GPUs.
  • Offload layers between system memory and GPU memory if necessary.
  • Ensure optimal performance without manual intervention.

Key Features

  • VRAM Estimation: Use gguf-parser to estimate GPU memory requirements.
  • Dynamic Layer Distribution: Use --tensor-split and --rpc flags to distribute layers across multiple GPUs and servers .
  • Batch Size Adjustment: Adjust the batch size to fit within available memory while maintaining performance.
  • Flash Attention Tuning: Enable/disable flash attention based on hardware capabilities to optimize performance.

Components

  1. LocalAI Instance:

    • Runs the model using the optimized GPU configuration.
    • Distributes layers across multiple GPUs based on VRAM estimation.
  2. gguf-parser Integration:

    • Parses the model metadata to provide the following details:
      • VRAM requirement per GPU
      • Layer distribution for both local and remote GPUs
      • Batch size and context length
      • Offloading support (RAM usage for system memory)
  3. Layer Distribution and Offloading Logic:

    • Adjusts the number of GPU layers dynamically based on the VRAM and RPC flags.
    • If the VRAM exceeds the capacity, offloads the excess to system memory or distributes it across multiple GPUs.

Reference

Workflow

  1. Model Parsing with gguf-parser:

    • Retrieve model metadata using gguf-parser.
      gguf-parser --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
    • Key metrics:
      • Model size: 72.71 B parameters (~59.92 GiB)
      • VRAM requirement: 73.47 GiB
      • Transformer layers: 80 layers
      • Supported flags: --tensor-split, --rpc
      • Offloading capability: Unsupported for distribution inference
  2. VRAM and Layer Adjustment:

    • Compare the model's VRAM requirement with the available VRAM on the system.
    • If the model exceeds the VRAM limit, adjust the number of layers or distribute them across multiple GPUs using --tensor-split.
    • Example command to split the model across two GPUs:
      local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"
  3. Batch Size and Context Length Adjustment:

    • The batch size recommended for this model is 2048 / 512 tokens.
    • Dynamically adjust the batch size based on available memory to prevent memory overrun:
      local-ai --batch-size=512 --ctx-size=32768 --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"

Estimation Process

VRAM Estimation (Per GPU)

Using gguf-parser, the following memory requirements extracted:

  • VRAM for one GPU: 73.47 GiB (full model on a single GPU)
  • RAM Offload: 441.38 MiB can be used for offloading parts of the model to system memory.

Tensor Split for Multi-GPU Setup

The model can be distributed across multiple GPUs using the --tensor-split flag:

  • Example: 50% of the model layers on GPU 0 and 30% on GPU 1.
    local-ai --tensor-split="0:50,1:30" --path="~/.cache/lm-studio/models/Qwen/Qwen2-72B-Instruct-GGUF/qwen2-72b-instruct-q6_k-00001-of-00002.gguf"

@siddimore
Copy link
Contributor

siddimore commented Oct 4, 2024

@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.

Using GGUF-Parser i have following output for model: Meta Llama Meta Llama 3.1 405B Instruct

RAM and VRAM estimates:
{ "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }

Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp.
Screenshot 2024-10-03 at 11 11 01 PM

@mudler
Copy link
Owner Author

mudler commented Oct 4, 2024

@mudler @sozercan Some more context now i have a working prototype for parsing gguf models.

Using GGUF-Parser i have following output for model: Meta Llama Meta Llama 3.1 405B Instruct

RAM and VRAM estimates: { "estimate": { "items": [ { "offloadLayers": 127, "fullOffloaded": true, "ram": { "uma": 364586936, "nonuma": 521873336 }, "vram": [ { "uma": 16106705920, "nonuma": 34165613568 } ] } ], "type": "model", "architecture": "llama", "contextSize": 131072, "flashAttention": false, "noMMap": false, "embeddingOnly": false, "distributable": true, "logicalBatchSize": 2048, "physicalBatchSize": 512 }, "architecture": { "type": "model", "architecture": "llama", "maximumContextLength": 131072, "embeddingLength": 16384, "blockCount": 126, "feedForwardLength": 53248, "attentionHeadCount": 128, "attentionHeadCountKV": 16, "attentionLayerNormRMSEpsilon": 0.00001, "attentionKeyLength": 128, "attentionValueLength": 128, "attentionCausal": true, "ropeDimensionCount": 128, "ropeFrequencyBase": 500000, "vocabularyLength": 128256, "embeddingGQA": 8, "embeddingKeyGQA": 2048, "embeddingValueGQA": 2048 }, "metadata": { "type": "model", "architecture": "llama", "quantizationVersion": 2, "alignment": 32, "name": "Models Meta Llama Meta Llama 3.1 405B Instruct", "license": "llama3.1", "fileType": 10, "littleEndian": true, "fileSize": 17239928096, "size": 17232101376, "parameters": 47232516096, "bitsPerWeight": 2.9186844657567317 }, "tokenizer": { "model": "gpt2", "tokensLength": 128256, "mergesLength": 280147, "addedTokenLength": 0, "bosTokenID": 128000, "eosTokenID": -1, "eotTokenID": -1, "eomTokenID": -1, "unknownTokenID": -1, "separatorTokenID": -1, "paddingTokenID": -1, "tokensSize": 2099452, "mergesSize": 5204765 }

Based on the above values rough math for a machine with 10GB VRAM, GPU_Layers comes out to 37 layers, which can then be set in localAI as a parameter to pass down to llamacpp. Screenshot 2024-10-03 at 11 11 01 PM

that sounds likely the good direction - would be cool now to instrument the library from the code, and set the GPU layers in the model defaults accordingly

defaultHigh := 99999999

@siddimore siddimore linked a pull request Oct 9, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request roadmap
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants