Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchSharp memory issue #1278

Open
lintao185 opened this issue Apr 2, 2024 · 19 comments
Open

TorchSharp memory issue #1278

lintao185 opened this issue Apr 2, 2024 · 19 comments

Comments

@lintao185
Copy link

 m = torch.nn.Conv2d(3, 64, 7, 2, 3, bias=False).cuda()

    for i in range(1000000):
        x = torch.randn(1, 3, 224, 224, dtype=torch.float).cuda()
        y = m.forward(x)

image

var m=TorchSharp.torch.nn.Conv2d(3, 64, 7, 2, 3, bias: false).cuda();

for (int i = 0; i < 1000000; i++)
{
    var x = torch.randn(1, 3, 224, 224).@float().cuda();
    var y = m.forward(x);
}

image
In PyTorch, when using GPU inference, GPU memory can be released at the appropriate time. In TorchSharp, when using GPU inference, there is a GPU memory leak that requires manual release.

@lintao185
Copy link
Author

var m = torch.nn.Conv2d(3, 64, 7, 2, 3, bias: false).cuda();
for (int i = 0; i < 1000000; i++)
{
    using var x = torch.randn(1, 3, 224, 224).@float().cuda();
    using var y = m.forward(x);
}

Although, however, this is the only way it can be written.

@yueyinqiu
Copy link
Contributor

Perhaps this page could help: https://github.com/dotnet/TorchSharp/wiki/Memory-Management

@lintao185
Copy link
Author

torch.NewDisposeScope() is a relatively elegant solution, although not as elegant as Pytorch.

@lintao185
Copy link
Author

I'm kind of considering giving up, as unexpected exceptions occur when torch.NewDisposeScope is nested, especially when one function calls another and the called function also has torch.NewDisposeScope. Objects that shouldn't be disposed of are being released. It seems training AI with C# is not very realistic, which is quite frustrating.

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 3, 2024

I‘m sorry to hear that. But DisposeScope is designed to work in that case. Could you describe the issue more specifically and thus we could fix that?

Oh... Since you mentioned that 'objects that shouldn't be disposed of are being released', I guess MoveToOuterDisposeScope could work?

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 3, 2024

It might be because f is not disposed. Hope this could help:

using TorchSharp;

for (int i = 0; i < 10000000000; i++)
{
    using (torch.NewDisposeScope())
    {
        var f = torch.randn(1, 3, 224, 224).@float().cuda();
        using (torch.NewDisposeScope())
        {
            var f3 = torch.randn(1, 3, 224, 224).@float().cuda();
            f[..] = f3;
        }
    }
}
Console.ReadKey();

By the way, are you a Chinese user? I have created a qq group (957204993) just now so we perhaps could discuss there with instant messages, which could be more convenient.

@lintao185
Copy link
Author

public static class Ops{
    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.MoveToOuterDisposeScope();
        }
    }
public static Tensor scale_boxes(int[] img1_shape, Tensor boxes, int[] img0_shape, (int, int)[] ratio_pad = null!, bool padding = true, bool xywh = false)
{
    using (torch.NewDisposeScope())
    {
        double gain;
        (double, double) pad;
        if (ratio_pad == null)
        {
            gain = Math.Min(img1_shape[0] * 1.0 / img0_shape[0], img1_shape[1] * 1.0 / img0_shape[1]);
            pad = (
                Math.Round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1),
                Math.Round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)
            );
        }
        else
        {
            gain = ratio_pad[0].Item1;
            pad = ratio_pad[1];
        }

        if (padding)
        {
            boxes[TensorIndex.Ellipsis, 0] -= pad.Item1;
            boxes[TensorIndex.Ellipsis, 1] -= pad.Item2;
            if (!xywh)
            {
                boxes[TensorIndex.Ellipsis, 2] -= pad.Item1;
                boxes[TensorIndex.Ellipsis, 3] -= pad.Item2;
            }
        }

        boxes[TensorIndex.Ellipsis, ..4] /= gain;
        return clip_boxes(boxes, img0_shape).MoveToOuterDisposeScope();
    }
}
}
public abstract class OutputData : IDisposable
{
    public abstract void Dispose();

    public abstract List<dynamic> ToList();
    public abstract OutputData MoveToOuterDisposeScope();
    ~OutputData()
    {
        Dispose();
    }
}
public class DetectPredictData : OutputData
{
    public Tensor Y { get; set; }
    public List<Tensor> X { get; set; }

    public override void Dispose()
    {
        Y?.Dispose();
        X?.ForEach(x => x.Dispose());
    }

    public override OutputData MoveToOuterDisposeScope()
    {
        Y?.MoveToOuterDisposeScope();
        X.ForEach(x => x.MoveToOuterDisposeScope());
        return this;
    }

    public override List<dynamic> ToList()
    {
        return [Y,X];
    }
}
for (int i = 0; i < 10000000; i++)
{
    using (torch.NewDisposeScope())
    {
        OutputData data = new DetectPredictData()
        {
            Y = torch.randn(3, 84, 8400).@float().cuda(),
            X = [
                torch.randn(3, 144, 80,80).@float().cuda(),
                torch.randn(3, 288, 40,40).@float().cuda(),
                torch.randn(3, 576, 20,20).@float().cuda(),
                ]
        };
        var pre = data.ToList();
        using (torch.NewDisposeScope())
        {
            Tensor f = pre[0];
            int[]? f1 = [1080, 2];
            var boxes = Ops.scale_boxes([564, 640], f[.., ..4], f1);
        }
        data.Dispose();
    }
}

image
After a day's work, I've located the position of the memory leak. Now I'm simulating the reproduction of this memory leak issue (with an 80% reproduction rate, as there are some unexplainable phenomena, thus 20% was not reproduced. This means that the solution for the simulated memory leak code does not apply to my project, and conversely, the solution for the memory leak in my project does not apply to this simulation code. It's quite awkward!!!)

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 4, 2024

I suppose that Ops.clip_boxes and Ops.scale_boxes should not invoke MoveToOuterDisposeScope().

That's because boxes is created here:

image

So it's related dispose scope is:

286dc92cf5a7b5b41967fe2a2204414a

When using MoveToOuterDisposeScope once, it's dispose scope will be:

7f7f95363b09593293f13c9cf04dcd54

And after using it twice, there are no dispose scope for it. Then it leaks.

(Only the tensors/parameters that is created in one dispose scope will be automatically attached to it. And in place actions will not modify its dispose scope.)

@lintao185
Copy link
Author

Yes, the simulated code can resolve the issue by removing MoveToOuterDisposeScope(), but for the code where the actual memory leak occurs, I cannot handle it in this way. I need to modify the code as follows, which is very confusing to me.

   public static Tensor clip_boxes(Tensor boxes, int[] shape)
   {
       using (torch.NewDisposeScope())
       {
           boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clone().clamp(0, shape[1]);
           boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clone().clamp(0, shape[0]);
           boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clone().clamp(0, shape[1]);
           boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clone().clamp(0, shape[0]);
           return boxes.MoveToOuterDisposeScope();
       }
       
   }

It's just by adding .clone(), which is very bizarre.

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 4, 2024

You could just remove that:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes;
        }
    }

I suppose there is no problem with this. Are you worried about any other things?

@lintao185
Copy link
Author

No no no, your code will cause a memory leak in my project, but adding .clone() fixes it. However, the simulated code still leaks memory even with .clone() added. Please trust me, there is still an issue with torch.NewDisposeScope().

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 4, 2024

Actually your clone cannot solve this problem. You could use it on the return instead:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.clone().MoveToOuterDisposeScope();
        }
    }

Or:

    public static Tensor clip_boxes(Tensor boxes, int[] shape)
    {
        using (torch.NewDisposeScope())
        {
            boxes = boxes.clone();
            boxes[TensorIndex.Ellipsis, 0] = boxes[TensorIndex.Ellipsis, 0].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 1] = boxes[TensorIndex.Ellipsis, 1].clamp(0, shape[0]);
            boxes[TensorIndex.Ellipsis, 2] = boxes[TensorIndex.Ellipsis, 2].clamp(0, shape[1]);
            boxes[TensorIndex.Ellipsis, 3] = boxes[TensorIndex.Ellipsis, 3].clamp(0, shape[0]);
            return boxes.MoveToOuterDisposeScope();
        }
    }

But I suppose there is no reason to use clone and keep MoveToOuterDisposeScope. Would there be any other problems in your project that cause a memory leak?

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 4, 2024

By the way, you could track the tensor's dispose scope here:

a9b3f111243da9798f874c27c6189d22

Hope this could help when debugging.

@lintao185
Copy link
Author

@yueyinqiu
Copy link
Contributor

Hmm I'm really not sure about that. Is it possible to share the whole project with me?

@lintao185
Copy link
Author

Sorry about that, it’s not convenient at the moment.

@yueyinqiu
Copy link
Contributor

yueyinqiu commented Apr 4, 2024

My only guess is that because of the higher usage of the memory (memory, not gpu memory), the garbage collection system is actived and thus the escaped tensors are released?

@lintao185
Copy link
Author

Not too sure, haven't found the exact cause yet, it's a bit odd.

@lintao185
Copy link
Author

lintao185 commented Apr 4, 2024

 public static Tensor clip_boxes(Tensor boxes, int[] shape)
 {
     using (torch.NewDisposeScope())
     {
         boxes[TensorIndex.Ellipsis, 0].clamp_(0, shape[1]);
         boxes[TensorIndex.Ellipsis, 1].clamp_(0, shape[0]);
         boxes[TensorIndex.Ellipsis, 2].clamp_(0, shape[1]);
         boxes[TensorIndex.Ellipsis, 3].clamp_(0, shape[0]);
         return boxes;
     }

 }

This can also solve the problem of memory leaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants