Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Allocation Performance #1416

Open
mdcarr941 opened this issue Nov 26, 2024 · 4 comments
Open

Memory Allocation Performance #1416

mdcarr941 opened this issue Nov 26, 2024 · 4 comments

Comments

@mdcarr941
Copy link

mdcarr941 commented Nov 26, 2024

Hello!

I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.

I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:
Image
I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model once in the lifetime of my application, but it would be nice to reach performance parity with python.

Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.

Finally, if it is possible what would be the right way to go about it?

@NiklasGustafsson
Copy link
Contributor

Hmmm.

I thought the pre-allocation was done by libtorch (it certainly is on the CUDA side of things), which would benefit both TorchSharp and PyTorch.

I wonder if it has anything to do with the .NET JIT -- first time hitting certain code paths, etc. Still, 60s is a lot of time.

Can you try warming it up and measuring a second (full) creation of the model, see if that takes as much time?

@mdcarr941
Copy link
Author

Interestingly, it takes nearly the same amount of time to instantiate the model a second time. Both instantiations take about 46 seconds. I ran the test a couple times, and the run to run variance was on the order of one second.

I don't think the JIT is the culprit, because that code runs in userspace, but the profile shows that nearly all of the time is spent in kernelspace.

@NiklasGustafsson
Copy link
Contributor

NiklasGustafsson commented Dec 5, 2024

Good point, I didn't consider that it was kernel mode time, even though you said so.

Second WAG:

I wonder if has anything to do with the fact that .NET doesn't have native support for b16, so there has to be some conversion of weights done at runtime. Could it be that it's done in kernel mode? Probably not.

@mdcarr941
Copy link
Author

Yeah, the kernel isn't responsible for doing type conversions. It just gives bytes to your process, it's up to the process to handle them appropriately.

I did rerun my test using ScalarTypes Float16 and Float32. The Float16 test was similar to the Bfloat16 test, with 48 then 45 seconds to instantiate. Interestingly, the Float32 test results were 43 then 44 seconds to instantiate. What's interesting about this is that the time did not scale with the amount of memory allocated, which makes sense if the number of allocations is the bottleneck. This suggests that an in process cache would speed things up.

I feel I should provide a little more justification for why this is worth anyone's time, because just beating Python may not motivate everyone. The main reason I want this is so I can run my unit tests faster. The model instantiation latency creates a lower bound on the amount of time it takes to run any test that requires the model. So if I want to test a token sampler, or reinforcement learning environment, I must wait at least 45 seconds every time, which gets old when I'm running such tests on the order of 50 times a day.

Another use case this would benefit is CLI applications. Because these apps run interactively, the user feels the latency imposed by model instantiations. Improving this performance could make AI powered CLI tools written with TorchSharp viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants