Memory Allocation Performance #1416

mdcarr941 · 2024-11-26T19:25:02Z

Hello!

I recently implemented the Llama 3.1 model logic with TorchSharp. It works, and I was able to load pretrained bf16 weights using TorchSharp.PyBridge. However, I found that it takes on the order of 60 seconds to instantiate the model on my development machine. By instantiate I just mean create an instance of the model class, which just entails all of the object and tensor allocations, but does not include the IO time to read model weights from disk (which is actually really fast with my SSD). Note that these tensor allocations are happening on the CPU device, I only transfer to my GPU after I've loaded the model weights. I observed the same behavior on Linux and Windows, and I determined that it was TorchSharp specific, because the (roughly) equivalent PyTorch code is able to instantiate much faster.

I profiled my program on Windows to get an idea of what's taking so long, and it appears the process is spending almost all of its time in the kernel:

I haven't dug into this further, but my hunch is that a ton of small allocations are being performed. I remember reading somewhere (I think it was in the PyTorch docs) about how PyTorch caches memory, and uses the cache to respond to allocation requests from user code. Model instantiation performance isn't huge problem for me, because I just need to instantiate the model once in the lifetime of my application, but it would be nice to reach performance parity with python.

Would it be possible to implement a similar allocation system in TorchSharp, and avoid the problem of context switching into the OS for lots of small allocations? By "possible" I just mean technically possible, I'm not looking for a commitment.

Finally, if it is possible what would be the right way to go about it?

NiklasGustafsson · 2024-12-02T18:17:31Z

Hmmm.

I thought the pre-allocation was done by libtorch (it certainly is on the CUDA side of things), which would benefit both TorchSharp and PyTorch.

I wonder if it has anything to do with the .NET JIT -- first time hitting certain code paths, etc. Still, 60s is a lot of time.

Can you try warming it up and measuring a second (full) creation of the model, see if that takes as much time?

mdcarr941 · 2024-12-04T23:31:34Z

Interestingly, it takes nearly the same amount of time to instantiate the model a second time. Both instantiations take about 46 seconds. I ran the test a couple times, and the run to run variance was on the order of one second.

I don't think the JIT is the culprit, because that code runs in userspace, but the profile shows that nearly all of the time is spent in kernelspace.

NiklasGustafsson · 2024-12-05T20:40:47Z

Good point, I didn't consider that it was kernel mode time, even though you said so.

Second WAG:

I wonder if has anything to do with the fact that .NET doesn't have native support for b16, so there has to be some conversion of weights done at runtime. Could it be that it's done in kernel mode? Probably not.

mdcarr941 · 2024-12-05T22:31:40Z

Yeah, the kernel isn't responsible for doing type conversions. It just gives bytes to your process, it's up to the process to handle them appropriately.

I did rerun my test using ScalarTypes Float16 and Float32. The Float16 test was similar to the Bfloat16 test, with 48 then 45 seconds to instantiate. Interestingly, the Float32 test results were 43 then 44 seconds to instantiate. What's interesting about this is that the time did not scale with the amount of memory allocated, which makes sense if the number of allocations is the bottleneck. This suggests that an in process cache would speed things up.

I feel I should provide a little more justification for why this is worth anyone's time, because just beating Python may not motivate everyone. The main reason I want this is so I can run my unit tests faster. The model instantiation latency creates a lower bound on the amount of time it takes to run any test that requires the model. So if I want to test a token sampler, or reinforcement learning environment, I must wait at least 45 seconds every time, which gets old when I'm running such tests on the order of 50 times a day.

Another use case this would benefit is CLI applications. Because these apps run interactively, the user feels the latency imposed by model instantiations. Improving this performance could make AI powered CLI tools written with TorchSharp viable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Allocation Performance #1416

Memory Allocation Performance #1416

mdcarr941 commented Nov 26, 2024 •

edited

Loading

NiklasGustafsson commented Dec 2, 2024

mdcarr941 commented Dec 4, 2024

NiklasGustafsson commented Dec 5, 2024 •

edited

Loading

mdcarr941 commented Dec 5, 2024

Memory Allocation Performance #1416

Memory Allocation Performance #1416

Comments

mdcarr941 commented Nov 26, 2024 • edited Loading

NiklasGustafsson commented Dec 2, 2024

mdcarr941 commented Dec 4, 2024

NiklasGustafsson commented Dec 5, 2024 • edited Loading

mdcarr941 commented Dec 5, 2024

mdcarr941 commented Nov 26, 2024 •

edited

Loading

NiklasGustafsson commented Dec 5, 2024 •

edited

Loading