Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU usage by Kilosort4 when running with run_sorter_by_property #3591

Open
jazlynntan opened this issue Dec 19, 2024 · 4 comments
Open

GPU usage by Kilosort4 when running with run_sorter_by_property #3591

jazlynntan opened this issue Dec 19, 2024 · 4 comments
Labels
concurrency Related to parallel processing

Comments

@jazlynntan
Copy link

Hello,

I'm running kilosort4 for a single shank using run_sorter_by_property(). Using Kilosort4 independently in the same conda environment, the same data (with all 4 shanks) took about 1.5h. However, a single shank within spikeinterface took about 6h. This leads me to suspect that the GPU is not being used?

This is the output while kilosort within spikeinterface was running:
image

The GPU memory seems to be used by the process but the speed seems to suggest that the GPU is not used for computation. Meanwhile the CPU usage appeared to be maxed out.

This is the code I'm using:

sorted = si.run_sorter_by_property('kilosort4',
                          shank1,
                          grouping_property='group',
                          folder=os.path.join('shank1_output'),
                          verbose=True,
                          engine="joblib",engine_kwargs={"n_jobs": 16},**params_kilosort4)

I tried using 'auto' and 'cuda' for the torch_device parameter, but both faced the same issue.

May I know if I am doing something wrong? Thank you!

@JoeZiminski
Copy link
Collaborator

Hi @jazlynntan I have no idea and this is just a quick guess, but I'm not sure what will happen regarding GPU access when when parallelising multiple sortings over separate cores. Presumably the separate processes are all attempting to compute on the GPU but from the runtime it doesn't seem like they are sequentially accessing the GPU in any useful way. It might be worth testing by running with n_jobs=1 and see if this results in the GPU being used.

@zm711
Copy link
Collaborator

zm711 commented Dec 19, 2024

That was Sam's idea too in another issue (I forget which one). He recommended doing engine='loop' instead so that it occurs serially when n_jobs>1 rather than all jobs trying to access the gpu at the same time.

@zm711 zm711 added the concurrency Related to parallel processing label Dec 19, 2024
@zm711
Copy link
Collaborator

zm711 commented Dec 19, 2024

Also we might need to add a note to our docs to explain that joblib might not play well with gpu-based sorters. Not sure, but this is the second issue related to this.

@jazlynntan
Copy link
Author

Hi, I tried the first suggestions:

sorted = si.run_sorter_by_property('kilosort4',
                          shank1,
                          grouping_property='group',
                          folder=os.path.join('shank1_output'),
                          verbose=True,
                          engine="joblib",engine_kwargs={"n_jobs": 1},**params_kilosort4)

I think the same problem persists? GPU memory is used but not the computation. The whole sorting for the single shank took about 5h and the resource report is as follows:

INFO:kilosort.run_kilosort:********************************************************
INFO:kilosort.run_kilosort:CPU usage:    18.80 %
INFO:kilosort.run_kilosort:Memory:       23.01 %     |     28.90   /   125.60 GB
INFO:kilosort.run_kilosort:------------------------------------------------------
INFO:kilosort.run_kilosort:GPU usage:    `conda install pynvml` for GPU usage
INFO:kilosort.run_kilosort:GPU memory:   44.24 %     |     10.47   /    23.67 GB
INFO:kilosort.run_kilosort:Allocated:     0.04 %     |      0.01   /    23.67 GB
INFO:kilosort.run_kilosort:Max alloc:     3.73 %     |      0.88   /    23.67 GB
INFO:kilosort.run_kilosort:********************************************************

I'm now attempting to use 'loop' for engine and 16 jobs. I'll update again when its done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
concurrency Related to parallel processing
Projects
None yet
Development

No branches or pull requests

3 participants