This is an archive of the discontinued LLVM Phabricator instance.

Fix tsan problem where the per-thread shared_ptr() can be locked right before the cache is destroyed causing a race where it tries to remove an entry from a destroyed cache.
ClosedPublic

Authored by zhangqiaorjc on Jan 23 2023, 11:28 AM.

Diff Detail

Event Timeline

zhangqiaorjc created this revision.Jan 23 2023, 11:28 AM
zhangqiaorjc requested review of this revision.Jan 23 2023, 11:28 AM
zhangqiaorjc added a reviewer: rriddle.

This is originally from Parker Schuh. I'm just helping him creating this commit.

Can you include more context in the description of the patch?

zhangqiaorjc retitled this revision from TSAN bug fix. to Fix tsan problem where the per-thread shared_ptr() can be locked right before the cache is destroyed causing a race where it tries to remove an entry from a destroyed cache..Jan 23 2023, 1:05 PM

Can you include more context in the description of the patch?

I updated the title to "Fix tsan problem where the per-thread shared_ptr() can be locked right before the cache is destroyed causing a race where it tries to remove an entry from a destroyed cache."

zhangqiaorjc edited reviewers, added: burmako; removed: ezhulenev.Jan 23 2023, 2:46 PM
zhangqiaorjc edited reviewers, added: jpienaar; removed: burmako.Jan 26 2023, 10:39 AM

Could someone help review this patch? It's caught by tsan for jax tests and we verified the fix internally.

rriddle accepted this revision.Jan 26 2023, 4:33 PM
This revision is now accepted and ready to land.Jan 26 2023, 4:33 PM

Thank you River for the review! I don't have commit access. @rriddle could you help us commit this patch? Thank you so much!

hctim added a subscriber: hctim.Feb 1 2023, 9:18 AM

Hey, looks like this broke the asan-aarch64 buildbot. https://lab.llvm.org/buildbot/#/builders/239/builds/771

Sorry it took so long to bisect this down. It seems like the relevant leaks happen because of:

Indirect leak of 104 byte(s) in 1 object(s) allocated from:
    #0 0xaaaad7bce55c in operator new(unsigned long) /home/mitchp/llvm-project/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
    #1 0xaaaad7e81b44 in __libcpp_operator_new<unsigned long> /home/mitchp/build/libcxx_build_asan/include/c++/v1/new:266:10
    #2 0xaaaad7e81b44 in __libcpp_allocate /home/mitchp/build/libcxx_build_asan/include/c++/v1/new:292:10
    #3 0xaaaad7e81b44 in allocate /home/mitchp/build/libcxx_build_asan/include/c++/v1/__memory/allocator.h:115:38
    #4 0xaaaad7e81b44 in allocate /home/mitchp/build/libcxx_build_asan/include/c++/v1/__memory/allocator_traits.h:268:20
    #5 0xaaaad7e81b44 in __allocation_guard<std::__1::allocator<mlir::ThreadLocalCache<llvm::DenseSet<(anonymous namespace)::ParametricStorageUniquer::HashedStorage, (anonymous namespace)::ParametricStorageUniquer::StorageKeyInfo> >::PerInstanceState> > /home/mitchp/build/libcxx_build_asan/include/c++/v1/__memory/allocation_guard.h:53:18
    #6 0xaaaad7e81b44 in allocate_shared<mlir::ThreadLocalCache<llvm::DenseSet<(anonymous namespace)::ParametricStorageUniquer::HashedStorage, (anonymous namespace)::ParametricStorageUniquer::StorageKeyInfo> >::PerInstanceState, std::__1::allocator<mlir::ThreadLocalCache<llvm::DenseSet<(anonymous namespace)::ParametricStorageUniquer::HashedStorage, (anonymous namespace)::ParametricStorageUniquer::StorageKeyInfo> >::PerInstanceState>, void> /home/mitchp/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:947:48
    #7 0xaaaad7e81b44 in make_shared<mlir::ThreadLocalCache<llvm::DenseSet<(anonymous namespace)::ParametricStorageUniquer::HashedStorage, (anonymous namespace)::ParametricStorageUniquer::StorageKeyInfo> >::PerInstanceState, void> /home/mitchp/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:957:12
    #8 0xaaaad7e81b44 in ThreadLocalCache /home/mitchp/llvm-project/mlir/include/mlir/Support/ThreadLocalCache.h:123:7

And I'm guessing the rest of the leaks are cascading failures from that leak.

I've attached the rest of the logfile from ninja check-mlir inside of the sanitizer bot (https://github.com/google/sanitizers/wiki/SanitizerBotReproduceBuild, the buildscript that you'd want to use is buildbot_bootstrap_asan.sh instead of buildbot_fast.sh). It was generated with the following patch, which allows us to actually see the asan report:

diff --git a/mlir/test/CAPI/pdl.c b/mlir/test/CAPI/pdl.c
index 673ad8d6af7e..4dbada4060da 100644
--- a/mlir/test/CAPI/pdl.c
+++ b/mlir/test/CAPI/pdl.c
@@ -7,7 +7,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-// RUN: mlir-capi-pdl-test 2>&1 | FileCheck %s
+// RUN: mlir-capi-pdl-test 2>&1
+//
+// | FileCheck %s
 
 #include "mlir-c/Dialect/PDL.h"
 #include "mlir-c/BuiltinTypes.h"
diff --git a/mlir/test/CAPI/transform.c b/mlir/test/CAPI/transform.c
index 1db5c2ecafe4..c69e40642c47 100644
--- a/mlir/test/CAPI/transform.c
+++ b/mlir/test/CAPI/transform.c
@@ -7,7 +7,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-// RUN: mlir-capi-transform-test 2>&1 | FileCheck %s
+// RUN: mlir-capi-transform-test 
+//
+// 2>&1 | FileCheck %s
 
 #include "mlir-c/Dialect/Transform.h"
 #include "mlir-c/IR.h"

That patch is applied on top of bcc10817d5569172ee065015747e226280e9b698, which is the commit for this patch.

I'm not exactly sure why this doesn't reproduce on the x86_64 asan bot. I'm sorry about that.

pschuh added a subscriber: pschuh.Feb 3 2023, 1:48 PM