Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if use gpu-operator install nvidia driver, when node restart, driver-plugin start CrashLoopBackOff #136

Open
lengrongfu opened this issue Jan 24, 2024 · 3 comments · May be fixed by #157
Open

Comments

@lengrongfu
Copy link
Member

  1. when the node reboots after, gpu-operator can reinstall NVIDIA driver;
  2. but the same time device-plugin pod starts, but this driver does not install completely. so pod in CrashLoopBackOff.

we can add a initContainers to check nvidia driver to resolve this problme.

@lengrongfu
Copy link
Member Author

/assign

@lengrongfu lengrongfu linked a pull request Feb 10, 2024 that will close this issue
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@lengrongfu
Copy link
Member Author

User Case:

  • User use gpu-operator install driver and toolkit, then node reboot, hami-device-plugin can restart pod success.
  • User use gpu-operator install driver and toolkit, then delete driver pod, retry install new driver, hami-device-plugin can restart pod success.
  • User use manual install driver and toolkit, then node reboot, hami-device-plugin can restart pod success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant