This is an archive of the discontinued LLVM Phabricator instance.

[mlir] use shared pointer to prevent vector resizes from destroying ops
ClosedPublic

Authored by ashay-github on Jul 5 2023, 7:25 AM.

Details

Summary

The MapVector type stores key-value pairs in a vector, which, when
resized, copies the entries and destroys the old ones. This causes the
underlying operations to be deleted, subsequently causing segfaults.

This patch makes the mappings map type refer to a shared pointer
instead, so that resizing the vector doesn't call the operations'
destructors.

Diff Detail

Event Timeline

ashay-github created this revision.Jul 5 2023, 7:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 7:25 AM
ashay-github requested review of this revision.Jul 5 2023, 7:25 AM
ftynse added a comment.Jul 5 2023, 7:27 AM

Do you have a test case where this was triggered?

springerm added inline comments.Jul 5 2023, 7:35 AM
mlir/lib/Dialect/Transform/IR/TransformOps.cpp
832 ↗(On Diff #537352)

getPayloadOp returns an iterator that skips deleted ops. It sounds like there is bug somewhere. A test case would be useful, then I can debug it.

Sadly, I can reproduce the problem only with MSVC Debug builds. The following test case (extracted verbatim from test-interpreter.mlir) triggers the bug:

func.func @bar() {
  // expected-remark @below {{transform applied}}
  %0 = arith.constant 0 : i32
  return
}

transform.with_pdl_patterns {
^bb0(%arg0: !transform.any_op):
  pdl.pattern @const : benefit(1) {
    %r = pdl.types
    %0 = pdl.operation "arith.constant" -> (%r : !pdl.range<type>)
    pdl.rewrite %0 with "transform.dialect"
  }

  transform.sequence %arg0 : !transform.any_op failures(propagate) {
  ^bb1(%arg1: !transform.any_op):
    %f = pdl_match @const in %arg1 : (!transform.any_op) -> !transform.any_op
    transform.foreach %f : !transform.any_op {
    ^bb2(%arg2: !transform.any_op):
      // expected-remark @below {{1}}
      transform.test_print_number_of_associated_payload_ir_ops %arg2 : !transform.any_op
      transform.test_print_remark_at_operand %arg2, "transform applied" : !transform.any_op
    }
  }
}

If it helps, this problem surfaced after the switch from DenseMap to MapVector (https://reviews.llvm.org/D153765).

mlir/lib/Dialect/Transform/IR/TransformOps.cpp
832 ↗(On Diff #537352)

True, but in checking for deleted ops, it checks for nullptr, whereas here, the pointer points to 0xdddddddddddddddddd, so it doesn't get filtered out.

springerm added inline comments.Jul 5 2023, 7:55 AM
mlir/lib/Dialect/Transform/IR/TransformOps.cpp
832 ↗(On Diff #537352)

Do you know which C++ expression points to 0xdddddddddddddddddd?

springerm added a comment.EditedJul 5 2023, 8:03 AM

I tried running the test case with ASAN and it does not detect any invalid memory accesses. Looking at the test case, there is no transform op that deletes operations. It would be interesting to see where the memory was deallocated. When there is a memory issue, ASAN shows where the memory was allocated, freed and accessed. Can you get such debug information from MSVC?

ashay-github added inline comments.Jul 5 2023, 8:04 AM
mlir/lib/Dialect/Transform/IR/TransformOps.cpp
832 ↗(On Diff #537352)

It's the op pointer, in the second iteration of the loop (the first iteration op pointer points to a valid operation).

I tried running the test case with ASAN and it does not detect any invalid memory accesses. Can you get such debug information from MSVC?

I see. Thanks, let me check.

Thanks much for the tip to use the sanitizer! The problem is that since
MapVector uses a vector for storing the key-value pairs, adding an entry to the
map can trigger a vector resize, which calls the destructor after copying the
mappings. The destructor ends up deleting the operation, ultimately causing
the segfault.

You can see the calls to the destructor using the following patch:

--- a/mlir/include/mlir/Dialect/Transform/IR/TransformInterfaces.h
+++ b/mlir/include/mlir/Dialect/Transform/IR/TransformInterfaces.h
@@ -177,6 +177,7 @@ private:
     ParamMapping params;
     ValueMapping values;
     ValueMapping reverseValues;
+    ~Mappings() { llvm::errs() << "destroying mappings\n"; }
   };

   friend LogicalResult applyTransforms(Operation *, TransformOpInterface,
@@ -285,7 +286,9 @@ public:
     /// transform IR region and payload IR objects.
     RegionScope(TransformState &state, Region &region)
         : state(state), region(&region) {
-      auto res = state.mappings.insert(std::make_pair(&region, Mappings()));
+      llvm::errs() << "about to add mappings\n";
+      auto res = state.mappings.insert(std::make_pair(&region, std::make_shared<Mappings>()));
+      llvm::errs() << "finished adding mappings\n";
       assert(res.second && "the region scope is already present");
       (void)res;
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS

If you now run `mlir-opt --test-transform-dialect-interpreter
--verify-diagnostics` on the test case I pasted a few hours ago, you'll see
messages like here:

...
about to add mappings
destroying mappings
destroying mappings
destroying mappings
finished adding mappings
...

I wasn't able to get a move constructor working for the Mappings struct, so I
modified the map definition to use a shared pointer to the struct. Is there a
better way to resolve the problem?

ashay-github retitled this revision from [mlir] save pointers to payload ops before applying transform to [mlir] use shared pointer to prevent vector resizes from destroying ops.Jul 5 2023, 11:16 AM
ashay-github edited the summary of this revision. (Show Details)

The destructor ends up deleting the operation, ultimately causing
the segfault.

This does not make sense to me yet. Where is the operation being deleted? Mappings is just a collection of DenseMaps and they do not own any operations.

ashay-github abandoned this revision.Jul 6 2023, 9:04 AM

Here's the slightly truncated output from running mlir-opt with ASAN. When Mappings is destroyed, it looks like it ends up deleting the operation as well. Let me know if you'd prefer to see the full ASAN output.

=================================================================
==3936==ERROR: AddressSanitizer: heap-use-after-free on address 0x1269b44a5680 at pc 0x7ff762b7b7cd bp 0x0076331811d0 sp 0x0076331811d8
READ of size 8 at 0x1269b44a5680 thread T0
    #0 0x7ff762b7b7cc in llvm::filter_iterator_base<class mlir::Operation *const *, class `public: __cdecl mlir::transform::TransformState::getPayloadOps(class mlir::Value) const'::`2'::<lambda_1>, struct std::bidirectional_iterator_tag>::findNextValid(void) llvm-project\llvm\include\llvm\ADT\STLExtras.h:468
    #1 0x7ff762b76bfc in llvm::filter_iterator_base<class mlir::Operation *const *, class `public: __cdecl mlir::transform::TransformState::getPayloadOps(class mlir::Value) const'::`2'::<lambda_1>, struct std::bidirectional_iterator_tag>::operator++(void) llvm-project\llvm\include\llvm\ADT\STLExtras.h:488
    #2 0x7ff76c066161 in mlir::transform::ForeachOp::apply(class mlir::transform::TransformRewriter &, class mlir::transform::TransformResults &, class mlir::transform::TransformState &) llvm-project\mlir\lib\Dialect\Transform\IR\TransformOps.cpp:831
...

0x1269b44a5680 is located 512 bytes inside of 2560-byte region [0x1269b44a5480,0x1269b44a5e80)
freed by thread T0 here:
    #0 0x7ff770de4601 in operator delete(void *, unsigned __int64, enum std::align_val_t) D:\a\_work\1\s\src\vctools\asan\llvm\compiler-rt\lib\asan\asan_win_delete_scalar_size_align_thunk.cpp:42
    #1 0x7ff76254c146 in llvm::deallocate_buffer(void *, unsigned __int64, unsigned __int64) llvm-project\llvm\lib\Support\MemAlloc.cpp:25
    #2 0x7ff7657de1b2 in llvm::DenseMap<class mlir::Value, class llvm::SmallVector<class mlir::Operation *, 2>, struct llvm::DenseMapInfo<class mlir::Value, void>, struct llvm::detail::DenseMapPair<class mlir::Value, class llvm::SmallVector<class mlir::Operation *, 2>>>::~DenseMap<class mlir::Value, class llvm::SmallVector<class mlir::Operation *, 2>, struct llvm::DenseMapInfo<class mlir::Value, void>, struct llvm::detail::DenseMapPair<class mlir::Value, class llvm::SmallVector<class mlir::Operation *, 2>>>(void) llvm-project\llvm\include\llvm\ADT\DenseMap.h:782
    #3 0x7ff7657de87b in mlir::transform::TransformState::Mappings::~Mappings(void) (llvm-project\build\bin\mlir-opt.exe+0x1435de87b)
...
    #13 0x7ff7657dda81 in mlir::transform::TransformState::RegionScope::RegionScope(class mlir::transform::TransformState &, class mlir::Region &) llvm-project\mlir\include\mlir\Dialect\Transform\IR\TransformInterfaces.h:288
    #14 0x7ff7657ed5e6 in mlir::transform::TransformState::make_region_scope(class mlir::Region &) llvm-project\mlir\include\mlir\Dialect\Transform\IR\TransformInterfaces.h:839
    #15 0x7ff76c066301 in mlir::transform::ForeachOp::apply(class mlir::transform::TransformRewriter &, class mlir::transform::TransformResults &, class mlir::transform::TransformState &) llvm-project\mlir\lib\Dialect\Transform\IR\TransformOps.cpp:832
...

Yes, the full ASAN output would be useful. I think this is not deleting the operation itself, but the data structure of Operation * that getPayloadOps is iterating over. That data structure is part of the Mappings. The realloc will copy+remove that data structure. I think I'm starting to see the issue here....

I was thinking of ways to avoid the shared_ptr, but this may actually be the simplest way to fix the issue here.

I think this is not deleting the operation itself, but the data structure of Operation * that getPayloadOps is iterating over.

I see, yeah, that makes sense. Here's the full ASAN output: https://pastebin.com/raw/SZautzim.

Actually can you try if a unique ptr works? Or is that what you meant when you said you couldn’t make the move semantics work?

Using unique pointer worked! Earlier, I was just trying to add a move
constructor to the Mappings struct, but since I ran into more problems, I had
used a shared pointer instead.

springerm accepted this revision.Jul 7 2023, 1:55 AM

Thanks for fixing this. This would be huge pain for the next person who runs into it.

mlir/include/mlir/Dialect/Transform/IR/TransformInterfaces.h
441

You can return *it->second.get() and return a Mappings & from this function (and all the other functions), then this change will be really small.

687–690

Let's explain in the comment why a pointer indirection is necessary: The mappings are being iterated over while nested transform ops are applied and a reallocation inside the MapVector must not invalidate the iterators.

This revision is now accepted and ready to land.Jul 7 2023, 1:55 AM

Updated to make getMapping() return a reference instead of a pointer.

Thanks for your help! The next time something like this happens, I should be
able to use ASAN to submit better bug reports.

ashay-github marked 2 inline comments as done.Jul 7 2023, 7:59 AM