Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[MemCpyOpt] implement multi BB stack-move optimization
ClosedPublic

Authored by khei4 on Jul 16 2023, 8:36 AM.

Details

Summary

This patch extends https://reviews.llvm.org/D153453 to muti-BB case, aims to remove memcpy like https://internals.rust-lang.org/t/the-way-to-see-how-memcpy-and-alloca-introduced-by-move-is-optimized/19148/4

Single-BB version was like the following

(1) The src is fully copied to dest by memcpy, memmove or store&load.
(2) src and dest are 1) static alloca, which is to avoid stacksave/stackrestore handling, 2) unescaped allocas.
(3) The dest has no Mod Ref except full-sized lifetime intrinsic and copy itself, from the alloca to the store.
(4) If the dest has no Mod => src has no Ref, dest has no Ref => src has no Mod from the store to the end of BB.

This patch changes the original single-BB patch's ModRef checking conditions (3) and (4) by using reachability and dominator tree query as the following

(3) Mod/Ref for dest doesn't exist if it's reachable to full-size copy
(4) All uses not post dominated by the full-size copy satisfies (dest-Mod NAND src-Ref) AND (dest-Ref NAND src-Mod),

NOTE: For above conditions, this is more conservative than pcwalton's live analysis approach, and we can transform if all execution passes have no conflict, like cache coherency for the src and dest. It's safe for unescaped static contexts, if any ref, after the other one's mod and before the self-full-size Mod. This Mod could be called Def like the original pcwalton's patch. To do this, need a liveness analysis like the original patch or similar things. Also, unescaped alloca and byval args have almost the same semantics, so we can extend the way also.

pre-commit test: https://reviews.llvm.org/D155422
compile-time regression: https://llvm-compile-time-tracker.com/compare.php?from=2975ccb4b06b3d3aedd86ab21729146e441521d7&to=83fcffd325e79dd89e2b932b053945f868659f56&stat=instructions:u

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
nikic added inline comments.Aug 20 2023, 3:04 AM
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1625

user -> a user

1633

This should use PDomB->getFirstInsertionPoint(). Can you add a test where we need to insert in a block with phi nodes? In that case, the insertion has to be after the phi nodes, not before.

khei4 updated this revision to Diff 551904.Aug 20 2023, 10:43 PM
khei4 marked 5 inline comments as done.

apply feedbacks

  • refine comments
  • use getFirstInsertionPt
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1434–1437

Thanks! Sorry, I didn't notice the reverted change...

1542

This meant adding the Instruction version isPotentiallyReachableFromMany (currently only for BasicBlocks) might be helpful. I copied the code from CFG.h. Although even for the BasicBlocks version, a user is only a few cases.

1633

Thanks! It seems right, I added a test.

This revision is now accepted and ready to land.Aug 21 2023, 7:22 AM
khei4 updated this revision to Diff 552150.Aug 21 2023, 3:02 PM

rebased for test update

nikic added a comment.Aug 22 2023, 3:50 AM

I think the patch description could use an update to focus more on what this patch changes (e.g. the static alloca condition is not new, and just one of many old conditions).

khei4 reopened this revision.Aug 24 2023, 6:50 AM
This revision is now accepted and ready to land.Aug 24 2023, 6:50 AM
khei4 added a comment.EditedAug 24 2023, 6:52 AM

@nikic Thanks! I'll reduce the descriptions!

It seems like a sanitizer crash happened on the createMemoryAccessBefore again. https://lab.llvm.org/buildbot/#/builders/37/builds/24650 I'll revert first!

#4 0x000055b2d243cc22 llvm::MemorySSAUpdater::createMemoryAccessBefore(llvm::Instruction*, llvm::MemoryAccess*, llvm::MemoryUseOrDef*) (/b/sanitizer-x86_64-linux-autoconf/build/tsan_debug_build/bin/clang+++0x5d80c22)
#5 0x000055b2d4bb8be6 llvm::MemCpyOptPass::performStackMoveOptzn(llvm::Instruction*, llvm::Instruction*, llvm::AllocaInst*, llvm::AllocaInst*, unsigned long, llvm::BatchAAResults&) (/b/sanitizer-x86_64-linux-autoconf/build/tsan_debug_build/bin/clang+++0x84fcbe6)
#6 0x000055b2d4bb9714 llvm::MemCpyOptPass::processMemCpy(llvm::MemCpyInst*, llvm::ilist_iterator<llvm::ilist_detail::node_options<llvm::Instruction, false, false, void>, false, false>&) (.part.0) MemCpyOptimizer.cpp:0:0
#7 0x000055b2d4bbe301 llvm::MemCpyOptPass::iterateOnFunction(llvm::Function&) (/b/sanitizer-x86_64-linux-autoconf/build/tsan_debug_build/bin/clang+++0x8502301)

Edit) I'm now searching the failure, I also believe this is similar to the last user insertion patterns.

The reproducer is the case dominator doesn't have any memory access for src. As the following

define void @multi_bb_dom_test0(i1 %b) {
  %src = alloca %struct.Foo, align 4
  %dest = alloca %struct.Foo, align 4
  br i1 %b, label %bb0, label %bb1

bb0:
  store %struct.Foo { i32 10, i32 20, i32 30 }, ptr %src
  br label %bb2

bb1:
  store %struct.Foo { i32 40, i32 50, i32 60 }, ptr %src
  br label %bb2

bb2:
  call void @llvm.lifetime.start.p0(i64 12, ptr nocapture %dest)
  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %dest, ptr align 4 %src, i64 12, i1 false); 1
  %1 = call i32 @use_nocapture(ptr noundef nocapture %dest)

  ret void
}

I'll add a few tests and fix it.

khei4 edited the summary of this revision. (Show Details)Aug 24 2023, 7:16 AM
khei4 updated this revision to Diff 553150.Aug 24 2023, 9:01 AM

implement and use findNearestCommonDominator for InsertionPt

nikic added inline comments.Aug 26 2023, 12:13 PM
llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1434

Shouldn't this be I1?

khei4 updated this revision to Diff 553757.Aug 26 2023, 1:26 PM

tweak comment

llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1434

In this case, I use BasicBlock to represent the terminator of the block, while it's first non-phi in the PDom. This might be confusing, so I commented on it just in time.

nikic accepted this revision.Aug 26 2023, 1:33 PM

LGTM

llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1434

Oh yeah, that makes sense.

This revision was landed with ongoing or failed builds.Aug 26 2023, 2:50 PM
This revision was automatically updated to reflect the committed changes.
khei4 added a comment.EditedAug 26 2023, 3:48 PM

https://lab.llvm.org/buildbot/#/builders/36/builds/37042

Hmm, seems like the assertion failed somehow. (Is this due to the DominatorTree update being forgotten somewhere?)
https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/GenericDomTree.h#L500

 gh -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -std=c++17 -MD -MT lib/Object/CMakeFiles/LLVMObject.dir/MachOObjectFile.cpp.o -MF lib/Object/CMakeFiles/LLVMObject.dir/MachOObjectFile.cpp.o.d -o lib/Object/CMakeFiles/LLVMObject.dir/MachOObjectFile.cpp.o -c /home/buildbots/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/lib/Object/MachOObjectFile.cpp
clang++: /home/buildbots/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/llvm-project/llvm/include/llvm/Support/GenericDomTree.h:500: NodeT* llvm::DominatorTreeBase<NodeT, IsPostDom>::findNearestCommonDominator(NodeT*, NodeT*) const [with NodeT = llvm::BasicBlock; bool IsPostDom = false]: Assertion `NodeA && "A must be in the tree"' failed.
...
 #6 0x00007fffbaeddc90 __assert_fail_base (/lib64/libc.so.6+0x3dc90)
 #7 0x00007fffbaeddd34 __assert_fail (/lib64/libc.so.6+0x3dd34)
 #8 0x00000000131ae08c llvm::DominatorTreeBase<llvm::BasicBlock, false>::findNearestCommonDominator(llvm::BasicBlock*, llvm::BasicBlock*) const (/home/buildbots/ppc64le-lld-multistage-test/ppc64le-lld-multistage-test/install/stage1/bin/clang+++0x131ae08c)
 #9 0x0000000017b08920 llvm::MemCpyOptPass::performStackMoveOptzn(llvm::Instruction*, llvm::Instruction*, llvm::AllocaInst*, llvm::AllocaInst*, unsigned long, llvm::BatchAAResults&)::'lambda0'(llvm::Instruction*, llvm::function_ref<bool (llvm::Instruction*)>)::operator()(llvm::Instruction*, llvm::function_ref<bool (llvm::Instruction*)>) const MemCpyOptimizer.cpp:0:0
#10 0x0000000017b09308 llvm::MemCpyOptPass::performStackMoveOptzn(llvm::Instruction*, llvm::Instruction*, llvm::AllocaInst*, llvm::AllocaInst*, unsigned long,
vitalybuka reopened this revision.Aug 26 2023, 7:26 PM
This revision is now accepted and ready to land.Aug 26 2023, 7:26 PM

@khei4 This happens when one of the instructions is in an unreachable block (disconnected from the CFG and not part of the DT). Missing the special cases here: https://github.com/llvm/llvm-project/blob/e41d383440883fa2f87012b765fcc2d450f6fd55/llvm/lib/IR/Dominators.cpp#L350-L353

khei4 updated this revision to Diff 553843.Aug 27 2023, 10:56 PM

@nikic Thanks! fixed.

nikic added a comment.Mon, Aug 28, 3:33 AM

Is it necessary to handle this for the PDT case as well, or can it not happen there? (Maybe all blocks get connected to the virtual exit block there?)

Is it necessary to handle this for the PDT case, or can it not happen there? (Maybe all blocks get connected to the virtual exit block there?)

@nikic Thanks! It sounds like a good point! I couldn't follow completely, but I do believe so; DT's entry is PDT's leaves, so I guess we can calculate PDom if it's unreachable from entry, but I'll add the test for the case that contains a cycle unreachable from entry.

khei4 updated this revision to Diff 554019.Mon, Aug 28, 9:50 PM

rebase for the test

nikic accepted this revision.Tue, Aug 29, 12:07 AM

LGTM

llvm/test/Transforms/MemCpyOpt/stack-move.ll
631

There are these weird ; 1 comments in some of the tests.

khei4 updated this revision to Diff 554212.Tue, Aug 29, 1:27 AM

remove test noises

This revision was landed with ongoing or failed builds.Tue, Aug 29, 3:40 AM
This revision was automatically updated to reflect the committed changes.
bgraur added a subscriber: bgraur.Thu, Sep 7, 7:09 AM
bgraur added a comment.Thu, Sep 7, 7:39 AM

@khei4 the last commit is causing again lots of false positive for tests executed under asan.

We (the google compilers team) got several tests with no asan findings when built before rG3a1409f93da32bf626f76257e0aac71716f2f67e that trigger stack-use-after-scope when built with this commit.

Coming up with a reproducer is pretty time consuming, could you please revert this until one is available?

vitalybuka reopened this revision.Thu, Sep 7, 11:19 AM
vitalybuka added a subscriber: vitalybuka.

@khei4 the last commit is causing again lots of false positive for tests executed under asan.

We (the google compilers team) got several tests with no asan findings when built before rG3a1409f93da32bf626f76257e0aac71716f2f67e that trigger stack-use-after-scope when built with this commit.

Coming up with a reproducer is pretty time consuming, could you please revert this until one is available?

reverted in efe8aa2e618122e8050af10cc5d6ad83f24ef557

If this optimization is critical, maybe you can split this patch in two?

  1. BB stack-move optimization which strips all lifetime markers from src alloca
  2. insert alloca on dst

Then we can continue related/revert patch 2 as needed.

This revision is now accepted and ready to land.Thu, Sep 7, 11:19 AM
nikic added a comment.Fri, Sep 8, 3:56 AM

Conceptually, this optimization can only cause false negatives with asan, not false positives. We'll need a reproducer to understand what's going wrong.

Conceptually, this optimization can only cause false negatives with asan, not false positives. We'll need a reproducer to understand what's going wrong.

Yes, that's why I think that something is definitely wrong with lifetimes, so better to revert ASAP.

And I believe @joanahalili is working on reproducer. Let me know if I can help.

khei4 added a comment.Mon, Sep 11, 3:30 AM

@vitalybuka @bgraur @joanahalili
Thank you for reporting this!

FWIW, this transformation itself ensures 

  1. allocas are never captured on IR level. (This is the requirements for this transformation.)
  2. newly inserted lifetime.start dominates all use of memory location, and inserted end post-dominates those.

So, at least on the IR level, I believe this doesn't introduce any problematic use for the memory location, but certainly, this changes the memory use of the original source code, and it may introduce the unexpected use of the address, out of the scope that original source code had, by inserting (post)dominating lifetime intrinsics. I assumed lifetime markers are just overwriting the memory location by undef values, and the original lifetime ranges (scope?) for the memory location could be changed by the optimization if we correctly attach lifetime intrinsics, but I think it was wrong for the asan, right? (I'm not sure about the actual false positive case, any concrete counter-example will be appreciated.)
I should have cared about the original lifetime to kindly introduce this optimization, as vitaly says https://reviews.llvm.org/D153453#4556386. I believe this optimization is effective practically e.g. for the Rust programs that introduce memcpy for move, so splitting the patch sounds reasonable.

Assuming that inserting the new lifetime marker(or use) for an address out of the original scope would be problematic for asan, I want to try to use the single range lifetime completely covered other one, as vitaly suggested, for the first step!

khei4 updated this revision to Diff 556518.Mon, Sep 11, 8:29 PM

strip the src lifetime if the transform suceed

khei4 added a comment.Mon, Sep 11, 8:29 PM

@vitalybuka Ah, sorry, I was confused and not sure what you said. Did you mean just stripping the src lifetime work?

@khei4 the last commit is causing again lots of false positive for tests executed under asan.

We (the google compilers team) got several tests with no asan findings when built before rG3a1409f93da32bf626f76257e0aac71716f2f67e that trigger stack-use-after-scope when built with this commit.

Coming up with a reproducer is pretty time consuming, could you please revert this until one is available?

reverted in efe8aa2e618122e8050af10cc5d6ad83f24ef557

If this optimization is critical, maybe you can split this patch in two?

  1. BB stack-move optimization which strips all lifetime markers from src alloca
  2. insert alloca on dst

Then we can continue related/revert patch 2 as needed.

vitalybuka added a comment.EditedWed, Sep 13, 1:32 PM

@vitalybuka Ah, sorry, I was confused and not sure what you said. Did you mean just stripping the src lifetime work?

Yes.

  1. First you land the MemCpyOpt patch which remove lifetime from the preserved alloca. (assuming perf improvement from lifetime markers is smaller then from MemCpyOpt)
  2. Then you land another patch with "correct" lifetimes. Looks like this is most tricky part so we can iterate here.

We have preliminary repro, which I will share by tomorrow.

But I still recommend to split the patch as above. Even with repro, I would not be surprised we hit another case and have to revert.

@khei4 the last commit is causing again lots of false positive for tests executed under asan.

We (the google compilers team) got several tests with no asan findings when built before rG3a1409f93da32bf626f76257e0aac71716f2f67e that trigger stack-use-after-scope when built with this commit.

Coming up with a reproducer is pretty time consuming, could you please revert this until one is available?

reverted in efe8aa2e618122e8050af10cc5d6ad83f24ef557

If this optimization is critical, maybe you can split this patch in two?

  1. BB stack-move optimization which strips all lifetime markers from src alloca
  2. insert alloca on dst

Then we can continue related/revert patch 2 as needed.

vitalybuka accepted this revision.Wed, Sep 13, 1:39 PM

Yes, like this.

LGTM,

If you can wait a few hours I try the patch on our internal code.

llvm/lib/Transforms/Scalar/MemCpyOptimizer.cpp
1612–1613

Please drop TODO: Reconstruct merged lifetime markers.

Please update llvm/test/Transforms/MemCpyOpt/lifetime-missing.ll

khei4 updated this revision to Diff 556741.Wed, Sep 13, 8:45 PM
  • update lifetime-missing.ll
  • drop TODO comment.
khei4 added a comment.Wed, Sep 13, 8:47 PM

@vitalybuka Thank you a lot! I'll try to land this in a day!

If you can wait a few hours I try the patch on our internal code.

I don't see any regressions.