Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : NvAPI performance state change support #8116

Closed
wants to merge 10 commits into from

Conversation

sasha0552
Copy link
Contributor

@sasha0552 sasha0552 commented Jun 25, 2024

Note: I will continue this PR a bit later

Related: #8084

Reference implementation

TODO:

  • Implement performance state switching functions
  • Place performance state switching calls in a common function before/after inference start/end
    • Switch only if Pascal GPU(s) present
  • Compile only if CUDA enabled
    • Enable by default if CUDA enabled, otherwise disable
  • Log performance state changes and library loading status
  • Synchronize pstate changes between n instances of llama.cpp on a single GPU
  • Clean up temporary/debug code

Alternative implementations (just thoughts):

  • A separate daemon process?
  • Add options in main/server/etc to allow calling processes before/after inference? (probably the simplest solution)

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 25, 2024
@github-actions github-actions bot added the build Compilation issues label Jun 25, 2024
@sasha0552
Copy link
Contributor Author

Superseded with nvidia-pstated.

@sasha0552 sasha0552 closed this Jul 26, 2024
@mirh
Copy link

mirh commented Sep 15, 2024

I mean.. is it though?
I see why a datacenter gpu that doesn't even have a display output could even call it a day with just a rough loop that checks for "any activity at all", but for the seemingly best results you would need some kind of "token cycles awareness".

Besides on top of that, a very nice feature that could also be integrated with direct nvapi support is bus activity monitoring.
It's supposedly not super accurate, but printing a note when pcie is busy more than 50 or 70 percent of the time could help many people to quickly diagnose when a model isn't slow just because it is slow but because they are depending on RAM swapping for non-trivial amounts of layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants