Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

788 closing mem leaks #791

Merged
merged 4 commits into from
Aug 8, 2024
Merged

Conversation

corepointer
Copy link
Collaborator

The code in this PR solves the memory leak documented in #788.

In addition three other minor memory leaks are closed - see the corresponding commits.

Assigning @pdamme as requested =)

@corepointer corepointer added the bug A mistake in the code. label Jul 19, 2024
@corepointer corepointer added this to the v0.3 milestone Jul 19, 2024
@corepointer corepointer linked an issue Jul 19, 2024 that may be closed by this pull request
@corepointer
Copy link
Collaborator Author

Don't know if that matmul accuracy test crashing has something to do with these changes. On my laptop there were two failing tests ("cartesian, success" and "where, success") and on my scale up box there were no failing tests 🤷‍♂️

test/api/cli/sql/cartesian_success_1.daphne and test/api/cli/sql/where_success_1.daphne both fail in src/compiler/inference/InferencePass.cpp:218 where numCols is 3 and numRows is -5

@pdamme pdamme self-requested a review July 19, 2024 19:17
pdamme
pdamme previously requested changes Jul 23, 2024
Copy link
Collaborator

@pdamme pdamme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, @corepointer. Closing these memory leaks is very important. Thanks for structuring the PR into four meaningful commits that fix distinct memory leaks. The first three commits (on closeFile(), castCol(), and startDAPHNE()) look totally fine to me.

I only have some comments on the fourth commit on the memory leaks of string scalars. Overall, the solution has the right ingredients and at first glance it looks very good. However, at a closer investigation, it doesn’t behave as expected internally. To illustrate the problems, I pushed a commit with some more prints (the logging facility didn't want to work for me...) and two script-level test cases (strings_1.daphne and strings_2.daphne). I had to comment out the fall-back freeing in ~DaphneContext(); otherwise, the two new test cases print rubbish data.

  • test/api/cli/controlflow/strings_1.daphne

    def foo(s:str) -> str {
        print(s);
        return s + "d";
    }
    
    s = "abc"; # string literal, no kernel call
    s = foo(s);
    print(s);

    bin/daphne --explain obj_ref_mgnt test/api/cli/controlflow/strings_1.daphne prints:

    IR after managing object references:
    module {
      func.func @"foo-1-1"(%arg0: !daphne.String) -> !daphne.String {
        %0 = "daphne.constant"() {value = 123 : si64} : () -> si64
        %1 = "daphne.constant"() {value = "abc"} : () -> !daphne.String
        %2 = "daphne.constant"() {value = 96365382536864 : ui64} : () -> ui64
        %3 = "daphne.constant"() {value = 96365382536736 : ui64} : () -> ui64
        %4 = "daphne.constant"() {value = 140731117186760 : ui64} : () -> ui64
        %5 = "daphne.createDaphneContext"(%4, %3, %2) : (ui64, ui64, ui64) -> !daphne.DaphneContext
        "daphne.decRef"(%arg0) : (!daphne.String) -> ()
        %6 = "daphne.cast"(%0) : (si64) -> !daphne.String
        %7 = "daphne.concat"(%1, %6) : (!daphne.String, !daphne.String) -> !daphne.String
        "daphne.decRef"(%6) : (!daphne.String) -> ()
        "daphne.decRef"(%1) : (!daphne.String) -> ()
        "daphne.destroyDaphneContext"() : () -> ()
        "daphne.return"(%7) : (!daphne.String) -> ()
      }
      func.func @main() {
        %0 = "daphne.constant"() {value = "abc"} : () -> !daphne.String
        %1 = "daphne.constant"() {value = true} : () -> i1
        %2 = "daphne.constant"() {value = false} : () -> i1
        %3 = "daphne.constant"() {value = 96365382536864 : ui64} : () -> ui64
        %4 = "daphne.constant"() {value = 96365382536736 : ui64} : () -> ui64
        %5 = "daphne.constant"() {value = 140731117186760 : ui64} : () -> ui64
        %6 = "daphne.createDaphneContext"(%5, %4, %3) : (ui64, ui64, ui64) -> !daphne.DaphneContext
        "daphne.incRef"(%0) : (!daphne.String) -> ()
        %7 = "daphne.generic_call"(%0) {callee = "foo-1-1"} : (!daphne.String) -> !daphne.String
        "daphne.decRef"(%0) : (!daphne.String) -> ()
        "daphne.print"(%7, %1, %2) : (!daphne.String, i1, i1) -> ()
        "daphne.decRef"(%7) : (!daphne.String) -> ()
        "daphne.destroyDaphneContext"() : () -> ()
        "daphne.return"() : () -> ()
      }
    }
    IncRef: 140731117181116 (not found)
    DecRef: 140731117181116 (not found)
    DecRef: 96365410280384 (found)
    DecRef: 140731117181028 (not found)
    Deleting 1 remaining string refs in ~DaphneContext()
    DecRef: 140731117181116 (not found)
    abc123
    DecRef: 96365410261984 (not found)
    

    The incRef/decRef are in the right places. Note that the value "abc" is constant-propagated into the UDF.

    Problems: From the very beginning, the string is never in the stringRefCount. Thus ìncRef and decRef always ignore it and the fall-back tidy-up code in ~DaphneContext() isn't aware of the string. Effectively, the string is not garbage collected, the memory leak still exists.

  • test/api/cli/controlflow/strings_2.daphne

    def foo(s:str) -> str {
        return s + " is my favorite number";
    }
    
    s = as.str(123); # kernel call (but could be constant folded in the future)
    s = foo(s);
    print(s);

    bin/daphne --explain obj_ref_mgnt test/api/cli/controlflow/strings_2.daphne prints:

    IR after managing object references:
    module {
      func.func @"foo-1"(%arg0: !daphne.String) -> !daphne.String {
        %0 = "daphne.constant"() {value = " is my favorite number"} : () -> !daphne.String
        %1 = "daphne.constant"() {value = 97987952754336 : ui64} : () -> ui64
        %2 = "daphne.constant"() {value = 97987952754208 : ui64} : () -> ui64
        %3 = "daphne.constant"() {value = 140722023857560 : ui64} : () -> ui64
        %4 = "daphne.createDaphneContext"(%3, %2, %1) : (ui64, ui64, ui64) -> !daphne.DaphneContext
        %5 = "daphne.concat"(%arg0, %0) : (!daphne.String, !daphne.String) -> !daphne.String
        "daphne.decRef"(%0) : (!daphne.String) -> ()
        "daphne.decRef"(%arg0) : (!daphne.String) -> ()
        "daphne.destroyDaphneContext"() : () -> ()
        "daphne.return"(%5) : (!daphne.String) -> ()
      }
      func.func @main() {
        %0 = "daphne.constant"() {value = 123 : si64} : () -> si64
        %1 = "daphne.constant"() {value = true} : () -> i1
        %2 = "daphne.constant"() {value = false} : () -> i1
        %3 = "daphne.constant"() {value = 97987952754336 : ui64} : () -> ui64
        %4 = "daphne.constant"() {value = 97987952754208 : ui64} : () -> ui64
        %5 = "daphne.constant"() {value = 140722023857560 : ui64} : () -> ui64
        %6 = "daphne.createDaphneContext"(%5, %4, %3) : (ui64, ui64, ui64) -> !daphne.DaphneContext
        %7 = "daphne.cast"(%0) : (si64) -> !daphne.String
        "daphne.incRef"(%7) : (!daphne.String) -> ()
        %8 = "daphne.generic_call"(%7) {callee = "foo-1"} : (!daphne.String) -> !daphne.String
        "daphne.decRef"(%7) : (!daphne.String) -> ()
        "daphne.print"(%8, %1, %2) : (!daphne.String, i1, i1) -> ()
        "daphne.decRef"(%8) : (!daphne.String) -> ()
        "daphne.destroyDaphneContext"() : () -> ()
        "daphne.return"() : () -> ()
      }
    }
    IncRef: 97987987429184 (found)
    DecRef: 140722023851833 (not found)
    DecRef: 97987987429184 (not found)
    Deleting 1 remaining string refs in ~DaphneContext()
    DecRef: 97987987429184 (found)
    123 is my favorite number
    DecRef: 97987987364976 (not found)
    Deleting 1 remaining string refs in ~DaphneContext()
    

    Again, the incRef/decRef are in the right places.

    Problems: The first DecRef: ... (not found) is because of a ConstantOp, the second/third DecRef: ... (not found) are because the UDF has its own DaphneContext with a fresh stringRefCount. For the same reason, the Deleting ... remaining string refs happens even though it never should.

Fixing these problems is required before we can merge the PR.

I recommend the following steps:

  1. We need one global instance of the stringRefCount map that is shared by all instances of DaphneContext. The solution can be done similar to DaphneContext::config, DaphneContext::dispatchMapping, and DaphneContext::stats, i.e., a single instance of the stringRefCount should be created in daphne_internal.cpp, the pointer to this single instance should be given to the createDaphneContext-kernel, and the DaphneContext should have a reference to the instance. The same holds for the mtxStrRefCnt mutex (maybe we could bundle the map and the mutex in a little helper struct).

  2. We should not register a newly created string scalar in the stringRefCount (with a reference counter of 1). Instead, we should assume that a char * that is not in the stringRefCount implicitly has a reference counter of 1. This approach has the following advantages:

    • We cannot forget registering a string scalar in stringRefCount. Currently, each kernel that creates a string scalar (there will be more such kernels in the future) needs to do that explicitly; a developer can easily forget that. It’s different from data objects, where the reference counter is automatically initialized to 1.
    • We correctly reflect the reference counter of string scalars that are not created by a C++ kernel. This applies to string constants (ConstantOp), strings evaluated through constant folding at compile-time, and may be the case for code-generated kernels in the future.
    • We reduce the number of accesses to the stringRefCounter, thereby spending less time in the critical section that is protected by the mtxStrRefCnt mutex.

    Consequently, we would not need the function DaphneContext::addStrRef() anymore. Furthermore the calls to this function in the individual kernels (castSca, concat) can be removed.

  3. The fall-back code (if(!stringRefCount.empty())) in ~DaphneContext() should be removed. As you point out in a comment, this situation should never happen. If it happens, this means sth went wrong and it might not even be safe to free the remaining string scalars. (The fact that I needed to comment out that code to even make the new test cases run supports that.) However, we could still log that it happened.

  4. The access to the stringRefCount in the IncRef<char> and DecRef<char>-kernels must be synchronized using the strRefCountMtx. While they read the map, the map could be edited by another thread (adding/removed entries). While DecRef<char> edits the map, other threads may read or modify the map.

Optionally, more script-level test cases would be great.

Feel free to let me know your thoughts.

@corepointer
Copy link
Collaborator Author

Thx for the detailed review and the suggestions @pdamme

  • Not relying on manually adding the ref count is indeed making the solution more robust. I was on the wrong train of thought by not assuming that a not-tracked ref could be a ref of 1 to not accidentally delete something that shouldn't be. And that's a lot of "not"s :-D If the objectRefMgmt pass does it's job correctly it should work.
  • The creation/deletion of DaphneContexts in UDFs is not only problematic with the string ref counting. I was dealing with this in e9d977d (dnn-ops PR Dnn ops #734) to not delete the CUDA contexts after running a UDF by making the DaphneContext global. As I was not dealing with vectorized engine code, this solution is not thread safe atm. Apart from that, this solution is rather a quick-fix anyway as we ideally avoid calling createDaphneContext in every UDF.
  • I think the used logger here is the runtime logger. Did you set the log level limit to DEBUG and use something like this in your UserConfig.json?
        {
            "comment": "DAPHNE runtime logs",
            "name": "runtime",
            "level": "DEBUG",
            "filename": "",
            "format": "%^[%n %L]:%$ %v"
        },

@corepointer
Copy link
Collaborator Author

I implemented the requested changes. But it seems that 2) won't work and explains why I went that addStrRef() route the other night: We can only delete dynamically allocated memory of course. What we get when deleting constants that the compiler emits is:

free(): invalid pointer
Signal: SIGABRT (Aborted)

So if we really want to delete what is allocated at compile time, we need a way to distinguish that from the run-time data and destruct correctly. MatrixConstants, besides strings, would also be something where this is needed.

@pdamme
Copy link
Collaborator

pdamme commented Aug 7, 2024

Thanks for continuing with this PR, @corepointer. You are absolutely right, we cannot free the string literals due to the way we allocate them in LowerToLLVMPass, I didn't think about that. There are multiple potential ways to fix this problem. I extended your implementation with one more commit and I think the memory leak is fixed now.

My idea was to always increase the reference counter of a string literal right after its creation. The effect is that the original object is never freed by DAPHNE's garbage collection. However, the memory consumption clearly indicates that MLIR frees it automatically, but I didn't double-check that in the IR. All strings created by kernels (e.g., concat) are properly freed by DAPHNE's garbage collection. At the same time, string literals defined inside UDFs or loop bodies can be used by many function calls or loop iterations (this would have been a challenge with an alternative solution, see below). To my mind, this is a simple trick and we could even apply it to matrix literals later (currently, we unnecessarily copy those for every use). I've implemented this idea on top of your code and pushed a commit.

So essentially, the memory leak is fixed, as far as I can tell. I manually tested it with the following DaphneDSL script:

def foo(i:si64) {
    print("veeery long string" + i); # I used a 10 KB string
}

for(i in 1:100000) {
    foo(i);
}

During the execution of daphne I watched its memory consumption. On main, it goes up to around 1 GB (10 KB * 100k loop iterations), while with the code in this PR, it stays constant at a few MB or so.


Before that, I considered another alternative: We could easily ensure that string literals are backed by strings that were allocated dynamically in C++ at compile-time and can be freed. To this end, we just need to adapt LowerToLLVMPass.cpp: In ConstantOpLowering::matchAndRewrite():

        if(auto strAttr = op.getValue().dyn_cast<StringAttr>()) {
            StringRef sr = strAttr.getValue();
#if 0
            // MLIR does not have direct support for strings. Thus, if this is
            // ...
#elif 1
            Type i8PtrType = LLVM::LLVMPointerType::get(
                    IntegerType::get(rewriter.getContext(), 8)
            );
            char * newString = new char[sr.size() + 1];
            memcpy(newString, sr.str().c_str(), sr.size() + 1);
            uint64_t ptr = reinterpret_cast<uint64_t>(reinterpret_cast<const void *>(newString));
            rewriter.replaceOpWithNewOp<LLVM::IntToPtrOp>(
                op.getOperation(),
                i8PtrType,
                rewriter.create<arith::ConstantOp>(
                    loc,
                    rewriter.getI64IntegerAttr(ptr)
                )
            );
#elif 0
            // ...
#endif
        }

That works, but then we get the same problem we used to have with matrix literals: When a string literal is in a UDF or a loop's body, it gets freed after the first function call or loop iteration and is not there anymore for the second time, leading to a crash. For matrix literals, we solved this by always copying the (small) matrix and leaving the original untouched. I think a similar approach can be done for strings, too. However, such artificial copying is not nice.

corepointer and others added 4 commits August 8, 2024 16:44
This minimal change free's the File struct in closeFile() that has previously been malloc'ed in open*() in our C based File I/O code.
A dynamically allocated temporary data structure is now deleted.
Valgrind reported a memory leak before adding this explicitn moduleOp->destroy()
This commit extends the ManageObjectRefs compiler pass and its runtime kernels to track usage of dynamically allocated string data and delete it after its last use. So far this has been found necessary with string concatenation and casting operations. Strings allocated by the compiler as constants will have their reference counter increased upon creation, which should prevent them from being garbage collected. For a detailed description see issue daphne-eu#788 and discussion in pull request daphne-eu#791.

Fixes daphne-eu#788, Closes daphne-eu#791

Co-authored-by: Patrick Damme <patrick.damme@tu-berlin.de>
@corepointer corepointer dismissed pdamme’s stale review August 8, 2024 19:26

I did some clean ups and ran the tests again. The failing sql tests I mentioned above are not related to this PR as they also occurred on main when running the test suite in debug mode.

@corepointer corepointer merged commit 2211a32 into daphne-eu:main Aug 8, 2024
2 checks passed
@corepointer
Copy link
Collaborator Author

Thx again. Deleting the branch after successful merge 👍

@corepointer corepointer deleted the 788-closing-mem-leaks branch August 8, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A mistake in the code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak due to missing garbage collection of string scalars
2 participants