This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/
-
llvm/
-
Transforms/
-
Scalar/
-
SROA.h
-
Utils/
-
PromoteMemToReg.h
-
lib/Transforms/
-
Transforms/
-
Scalar/
4
SROA.cpp
-
Utils/
3
PromoteMemoryToRegister.cpp
-
test/Transforms/
-
Transforms/
-
PhaseOrdering/
-
lifetime-sanitizer.ll
-
SROA/
-
non-capturing-call.ll

Differential D109749

Experimental Partial Mem2Reg
AbandonedPublic

Authored by huntergr on Sep 14 2021, 2:32 AM.

Download Raw Diff

Details

Reviewers

chandlerc
jdoerfert
kiranchandramohan
Meinersbur
ftynse
lebedev.ri

Summary

Clang's current lowering for OpenMP parallel worksharing loops with a reduction clause prevents lots of optimization opportunities because the address of the stack variable for the reduction is passed to an OpenMP runtime function after the loop; this causes SROA/mem2reg to skip over promoting it to SSA form.

The intent of this work is to partially promote the reduction variable to SSA form before the runtime call takes place for a loop like the following so that optimizations (like vectorization) can be performed.

int loop(int data[restrict 128U]) {
  int retval = 0;

#pragma omp parallel for simd schedule(simd:static) default(none) shared(data) reduction(+:retval)
  for (int i = 0; i < 128; i++) {
    int n = 0;

    if (data[i]) {
      n = 1;
      retval += n;
    }
  }
  return retval;
}

The code as it is right now was written to avoid clashing too much with other code in order to reduce maintenance costs downstream; I expect I'll need to refactor it considerably but I would like to hear from reviewers before undertaking that work.

I have a few questions to resolve first:

Is this feature something the community wants, or am I just overcomplicating things? Is there an easier way to get the above loop to vectorize?
I've been a bit paranoid about ensuring ordering here and used the PostDominatorTree; I think it may be possible to do this with a modification to the IDF algorithm used in mem2reg, but I haven't worked through it yet. Does anyone have more experience with it to help guide that?
This is currently a separate pass, but could be implemented as part of the normal SROA/mem2reg optimization pass. Would this be preferred? Does the outcome of the previous question about PostDom trees affect that?

Diff Detail

Event Timeline

huntergr created this revision.Sep 14 2021, 2:32 AM

Herald added subscribers: mgrang, hiraditya, mgorny. · View Herald TranscriptSep 14 2021, 2:32 AM

huntergr requested review of this revision.Sep 14 2021, 2:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2021, 2:32 AM

Herald added a subscriber: sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B123818: Diff 372440.Sep 14 2021, 3:24 AM

I have seen cases where this would be beneficial,
some of those are just due to lack of inlining, but not all.

I strongly believe this should be part of SROA,
it should analyze the alloca's ignoring captures,
and if it is otherwise promoteable, it should:

duplicate the original alloca (only for simplicity, this is fine since we know the old alloca goes away)
before each capture, load contents of the old alloca, and store it into new alloca
after each capture, load contents of the new alloca, and store it into old alloca
change captures to refer to the new alloca
run AggLoadStoreRewriter on the new alloca - so that all the uses of old alloca we've just introduced are analyzeable by SROA
proceed with normal handling of the old alloca - mem2reg will now succeed

kiranchandramohan added reviewers: Meinersbur, ftynse.Sep 14 2021, 3:49 AM

I agree this should be part of mem2reg/SROA unless there is a specific reason against it (e.g. computational complexity higher s.t. that it should not also run with every occurance of SROA/mem2reg in the default pipeline).

Your motivational code looks like it should be processable by LICM, s.t. it is promoted to registers while in the loop, then vectorized. Do you know why this doesn't happen?

I scanned the diff for nosync without hits. I doubt any of this reasoning is valid if I can have synchronization between threads.

That said, I think we need to use the fact that we know the value stored in the alloca is not captured. There was an email thread on this problem and email threads on how we could encode that it is not captured.
Given that this occurs in the OpenMP context, nosync is probably not an alternative.

In D109749#2999389, @lebedev.ri wrote:

I have seen cases where this would be beneficial,
some of those are just due to lack of inlining, but not all.

I strongly believe this should be part of SROA,
it should analyze the alloca's ignoring captures,
and if it is otherwise promoteable, it should:

duplicate the original alloca (only for simplicity, this is fine since we know the old alloca goes away)

before each capture, load contents of the old alloca, and store it into new alloca

after each capture, load contents of the new alloca, and store it into old alloca

change captures to refer to the new alloca

run AggLoadStoreRewriter on the new alloca - so that all the uses of old alloca we've just introduced are analyzeable by SROA

proceed with normal handling of the old alloca - mem2reg will now succeed

Hi, thanks for the suggestion (and sorry for the delay in responding).

I've implemented something similar to what you've suggested, but with a slight difference to make it fit the problem at hand -- the openmp reduction present in the loop. There's a key difference which I didn't state in my initial summary (though was present in the unit test), which is the way the alloca is captured -- it's not directly passed as an argument to the function, but the pointer is instead stored into another local memory address first and the pointer for the second memory address is then passed to __kmpc_reduce_nowait. This leads to the code being somewhat messy, as I have to check that the store of the pointer dominates the call, that there aren't other uses of the second alloca that might interfere with conversion, etc.

The way that's done makes me wonder whether libomp needs a lighter-weight interface for reductions involving a single scalar value, rather than just a single generic interface which accepts an arbitrary number of reduction variables. (For comparison, I looked into what gcc does -- it passes a pointer to a shared reduction variable into the outlined function, and it just performs the atomic operation directly instead of calling to the runtime).

So I think that I'll repurpose this patch to only cover the direct case of an alloca being used in a call and separate out the libomp side of things for another patch. I'll update the diff once I've implemented that.

In D109749#2999682, @Meinersbur wrote:

I agree this should be part of mem2reg/SROA unless there is a specific reason against it (e.g. computational complexity higher s.t. that it should not also run with every occurance of SROA/mem2reg in the default pipeline).

Your motivational code looks like it should be processable by LICM, s.t. it is promoted to registers while in the loop, then vectorized. Do you know why this doesn't happen?

mem2reg handles promotion to registers, but for LICM specifically there's a couple of things which would stop it.

Although the address is loop invariant, the data isn't.
For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

In D109749#2999818, @jdoerfert wrote:

I scanned the diff for nosync without hits. I doubt any of this reasoning is valid if I can have synchronization between threads.

That's part of the reason my original patch only changed uses before a capture (the other being possible aliasing within a thread -- a terrible idea, but someone somewhere has probably written something which relies on it). I could restrict it to avoid converting any allocas which use atomic operations.

In D109749#2999837, @jdoerfert wrote:

That said, I think we need to use the fact that we know the value stored in the alloca is not captured. There was an email thread on this problem and email threads on how we could encode that it is not captured.
Given that this occurs in the OpenMP context, nosync is probably not an alternative.

I think we can use Roman's approach when the alloca is passed as a 'nocapture' argument at least, which will give us some benefit even if it doesn't solve all of my initial problem. Do you agree?

I'm not sure about the best way of marking the store of the first alloca pointer into the second alloca's memory as nocapture, though. If we have a way of doing it then I can extend the work in a later patch to cover that case, or if not maybe we can change the way clang and libomp handle openmp reductions to make it easier to optimize outlined functions.

In D109749#3077297, @huntergr wrote:

Although the address is loop invariant, the data isn't.

LICM does scalar promotion (controlled by -disable-licm-promotion), as in "promote memory location to register". It doesn't matter whether the value at the location is invariant. Whether this belongs into a pass called "Loop Invariant Code Motion" is a different question.

For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

This patch adds another pass, not make mem2reg do it. LICM currently does not handle conditional control flow for scalar promotion, but it should require much less code to change that. See the use of isGuaranteedToExecute in llvm::promoteLoopAccessesToScalars.

PartialMemToReg uses isAllocaPromotable to ensure that the target is write-accessible and no bit is needed, why not do the same for LICM?

llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
92–93	This is not a sufficient condition for captures. I doubt that we can detect that something has been generated from a CapturedStmt just be looking at the IR.
llvm/test/Transforms/Mem2Reg/partial-mem2reg.ll
3 ↗	(On Diff #372440)	This tests too many passes at once

jdoerfert added inline comments.Oct 21 2021, 8:32 AM

llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
637	I doubt this logic works in loops. H: I = use(alloca); C: store alloca into mem if (...) goto H; Capture (C) post dominates the user (I) but it is executed after and before the use, just not in the same iteration of the loop defined by H. Once the alloca is captured you cannot judge anymore without a lot more analysis (incl. nosync). To salvage this, reachability, not post-dominaince, is what you are looking for. All that said, I still believe the problem at hand should be solved by marking the reduction thing as not capturing.

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

In D109749#3078046, @Meinersbur wrote:

In D109749#3077297, @huntergr wrote:

Although the address is loop invariant, the data isn't.

LICM does scalar promotion (controlled by -disable-licm-promotion), as in "promote memory location to register". It doesn't matter whether the value at the location is invariant. Whether this belongs into a pass called "Loop Invariant Code Motion" is a different question.

For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

This patch adds another pass, not make mem2reg do it. LICM currently does not handle conditional control flow for scalar promotion, but it should require much less code to change that. See the use of isGuaranteedToExecute in llvm::promoteLoopAccessesToScalars.

Sorry, I should have made it more clear -- I'm dropping the new pass and using Roman's suggestion of improving SROA. I have implemented that but found the code a bit messy due to the store -> call separation.

In D109749#3078176, @jdoerfert wrote:

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

Ah, the 'nocapture_storage' metadata is what I've been missing, thanks. I'll update the diff once I've added that and adjusted the tests.

In D109749#3078247, @huntergr wrote:

In D109749#3078176, @jdoerfert wrote:

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

Ah, the 'nocapture_storage' metadata is what I've been missing, thanks. I'll update the diff once I've added that and adjusted the tests.

Technically, this is not yet something we have in the IR. We can reply to the old thread in which different solutions were discussed and
propose this one again. Then modify Clang to emit the metadata for the reduction case and land the diff I posted. All that said, it works
for your case, right?

Updated the diff based on the suggestion from @lebedev.ri

This patch now only deals with the case of an alloca being passed directly to a call which doesn't capture it.

In D109749#3085994, @jdoerfert wrote:
In D109749#3078247, @huntergr wrote:
In D109749#3078176, @jdoerfert wrote:
Does this work for you:
diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from
And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file
Ah, the 'nocapture_storage' metadata is what I've been missing, thanks. I'll update the diff once I've added that and adjusted the tests.
Technically, this is not yet something we have in the IR. We can reply to the old thread in which different solutions were discussed and
propose this one again. Then modify Clang to emit the metadata for the reduction case and land the diff I posted. All that said, it works
for your case, right?

It does, yes. I'll have a look for the mailing list thread.

Nice! I like this, but now i have a fundamental concern.

llvm/lib/Transforms/Scalar/SROA.cpp
4709–4710
4715–4720	This isn't right. What if not the `alloca`, but `gep(alloca)`, is passed into the function?
4736–4739	Same, this should recurse down the uses of an alloca.
llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
129–130

This revision now requires changes to proceed.Nov 2 2021, 3:41 AM

llvm/lib/Transforms/Scalar/SROA.cpp
4715–4720	Ah, good catch. I'll fix that and add some more test cases.

Harbormaster completed remote builds in B131924: Diff 384025.Nov 2 2021, 3:52 AM

Added the ability to look through GEPs for a call. I've limited this to using indexes of all 0, and only for single value types, so I'm not sure how often we'll encounter that. The limitation on the indexes does match the existing checks for whether promotion is allowed.

It should be possible to perform this optimization for a single field in a struct and allow sroa to replace the rest of the struct, but I think there's at least some C code which relies on being able to cast back to the struct (not sure if that's fully legal in C, but such code does exist). Maybe going further will need analysis of the callee function to see whether it just treats the pointer as a single value or assumes it can access more than that.

Harbormaster completed remote builds in B132643: Diff 385009.Nov 5 2021, 4:35 AM

FWIW, I restarted the thread [0] in order to get that solution in for APIs. Doesn't mean we cannot teach SROA new tricks though.

[0] https://lists.llvm.org/pipermail/llvm-dev/2021-November/153622.html

lebedev.ri mentioned this in rG1000245e3a4f: [NFC][SROA] Precommit tests for promotion-with-spilling.Nov 9 2021, 2:54 PM

lebedev.ri mentioned this in D113520: [SROA] Maintain shadow/backing alloca when some slices are noncapturnig read-only calls to allow alloca partitioning/promotion.Nov 9 2021, 3:01 PM

Superseded by @lebedev.ri 's patches. I'll continue looking at the metadata angle separately.

a.elovikov added a subscriber: a.elovikov.Nov 30 2021, 11:27 AM

lebedev.ri mentioned this in rG703240c71fd6: [SROA] Maintain shadow/backing alloca when some slices are noncapturnig read….Mar 4 2022, 10:09 AM

lebedev.ri mentioned this in rGadc0984d81f5: Reland [SROA] Maintain shadow/backing alloca when some slices are noncapturnig….Mar 4 2022, 1:14 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Scalar/

SROA.h

1 line

Utils/

PromoteMemToReg.h

3 lines

lib/

Transforms/

Scalar/

SROA.cpp

83 lines

Utils/

PromoteMemoryToRegister.cpp

32 lines

test/

Transforms/

PhaseOrdering/

lifetime-sanitizer.ll

2 lines

SROA/

non-capturing-call.ll

127 lines

Diff 385009

llvm/include/llvm/Transforms/Scalar/SROA.h

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	private:			private:
	friend class sroa::AllocaSliceRewriter;			friend class sroa::AllocaSliceRewriter;
	friend class sroa::SROALegacyPass;			friend class sroa::SROALegacyPass;

	/// Helper used by both the public run method and by the legacy pass.			/// Helper used by both the public run method and by the legacy pass.
	PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,			PreservedAnalyses runImpl(Function &F, DominatorTree &RunDT,
	AssumptionCache &RunAC);			AssumptionCache &RunAC);

				bool isolateNonCapturingCalls(Function &F);
	bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);			bool presplitLoadsAndStores(AllocaInst &AI, sroa::AllocaSlices &AS);
	AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,			AllocaInst *rewritePartition(AllocaInst &AI, sroa::AllocaSlices &AS,
	sroa::Partition &P);			sroa::Partition &P);
	bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);			bool splitAlloca(AllocaInst &AI, sroa::AllocaSlices &AS);
	bool runOnAlloca(AllocaInst &AI);			bool runOnAlloca(AllocaInst &AI);
	void clobberUse(Use &U);			void clobberUse(Use &U);
	bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);			bool deleteDeadInstructions(SmallPtrSetImpl<AllocaInst *> &DeletedAllocas);
	bool promoteAllocas(Function &F);			bool promoteAllocas(Function &F);
	};			};

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_TRANSFORMS_SCALAR_SROA_H			#endif // LLVM_TRANSFORMS_SCALAR_SROA_H

llvm/include/llvm/Transforms/Utils/PromoteMemToReg.h

	Show All 21 Lines
	class AssumptionCache;			class AssumptionCache;

	/// Return true if this alloca is legal for promotion.			/// Return true if this alloca is legal for promotion.
	///			///
	/// This is true if there are only loads, stores, and lifetime markers			/// This is true if there are only loads, stores, and lifetime markers
	/// (transitively) using this alloca. This also enforces that there is only			/// (transitively) using this alloca. This also enforces that there is only
	/// ever one layer of bitcasts or GEPs between the alloca and the lifetime			/// ever one layer of bitcasts or GEPs between the alloca and the lifetime
	/// markers.			/// markers.
	bool isAllocaPromotable(const AllocaInst *AI);			bool isAllocaPromotable(const AllocaInst *AI,
				bool AllowNonCapturingCalls = false);

	/// Promote the specified list of alloca instructions into scalar			/// Promote the specified list of alloca instructions into scalar
	/// registers, inserting PHI nodes as appropriate.			/// registers, inserting PHI nodes as appropriate.
	///			///
	/// This function makes use of DominanceFrontier information. This function			/// This function makes use of DominanceFrontier information. This function
	/// does not modify the CFG of the function at all. All allocas must be from			/// does not modify the CFG of the function at all. All allocas must be from
	/// the same function.			/// the same function.
	///			///
	void PromoteMemToReg(ArrayRef<AllocaInst *> Allocas, DominatorTree &DT,			void PromoteMemToReg(ArrayRef<AllocaInst *> Allocas, DominatorTree &DT,
	AssumptionCache *AC = nullptr);			AssumptionCache *AC = nullptr);

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Transforms/Scalar/SROA.cpp

Show First 20 Lines • Show All 4,696 Lines • ▼ Show 20 Lines bool SROA::promoteAllocas(Function &F) {

NumPromoted += PromotableAllocas.size(); NumPromoted += PromotableAllocas.size();

LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n"); LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");

PromoteMemToReg(PromotableAllocas, *DT, AC); PromoteMemToReg(PromotableAllocas, *DT, AC);

PromotableAllocas.clear(); PromotableAllocas.clear();

return true; return true;

} }

bool SROA::isolateNonCapturingCalls(Function &F) {

BasicBlock &EntryBB = F.getEntryBlock();

bool Changed = false;

SmallVector<User *, 20> WorkList;

SmallVector<GetElementPtrInst *, 4> GepList;

lebedev.riUnsubmitted

Not Done

bool Changed = false;

- for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());

- I != E; ++I) {

+ for (Instruction& I : *EntryBB) {

AllocaInst *AI = dyn_cast<AllocaInst>(I);

lebedev.ri:

SmallVector<CallInst *, 4> CallsToConvert;

for (Instruction &I : EntryBB) {

AllocaInst *AI = dyn_cast<AllocaInst>(&I);

if (!AI || !AI->getAllocatedType()->isSingleValueType())

continue;

WorkList.clear();

WorkList.append(AI->user_begin(), AI->user_end());

GepList.clear();

lebedev.riUnsubmitted

Not Done

This isn't right. What if not the alloca, but gep(alloca), is passed into the function?

lebedev.ri: This isn't right. What if not the `alloca`, but `gep(alloca)`, is passed into the function?

huntergrAuthorUnsubmitted

Not Done

Ah, good catch. I'll fix that and add some more test cases.

huntergr: Ah, good catch. I'll fix that and add some more test cases.

CallsToConvert.clear();

while (!WorkList.empty()) {

Instruction *I = dyn_cast<Instruction>(WorkList.pop_back_val());

if (!I)

continue;

// Look through GEPs to see if the base address of the alloca is

// eventually used in a call.

if (auto *GEP = dyn_cast<GetElementPtrInst>(I))

if (GEP->hasAllZeroIndices()) {

GepList.push_back(GEP);

WorkList.append(GEP->user_begin(), GEP->user_end());

continue;

}

// We're only interested if the alloca is used by a non-intrinsic

// call instruction without operand bundles...

if (auto *CI = dyn_cast<CallInst>(I))

lebedev.riUnsubmitted

Not Done

Same, this should recurse down the uses of an alloca.

lebedev.ri: Same, this should recurse down the uses of an alloca.

if (!isa<IntrinsicInst>(I) && !CI->hasOperandBundles())

CallsToConvert.push_back(CI);

}

if (CallsToConvert.empty())

continue;

// ...and if the alloca would otherwise be promotable.

if (!isAllocaPromotable(AI, /*AllowNonCapturingCalls=*/true))

continue;

Changed = true;

LLVM_DEBUG(dbgs() << "SROA: Isolating calls using alloca: " << *AI << "\n");

// Create a new alloca, then replace users around the call(s).

IRBuilderTy Builder(AI);

Type *PointeeTy = AI->getType()->getPointerElementType();

AllocaInst *NewAI = Builder.CreateAlloca(PointeeTy, nullptr,

AI->getName() + ".sroa.isolate");

LLVM_DEBUG(dbgs() << "\tCreating new alloca: " << *NewAI << "\n");

for (CallInst *CI : CallsToConvert) {

LLVM_DEBUG(dbgs() << "\tIsolating call: " << *CI << "\n");

Builder.SetInsertPoint(CI);

LoadInst *Load = Builder.CreateLoad(PointeeTy, AI);

Builder.CreateStore(Load, NewAI);

Builder.SetInsertPoint(CI->getNextNonDebugInstruction());

Load = Builder.CreateLoad(PointeeTy, NewAI);

Builder.CreateStore(Load, AI);

for (unsigned i = 0; i < CI->arg_size(); ++i)

if (CI->getArgOperand(i) == AI ||

is_contained(GepList, CI->getArgOperand(i)))

CI->setArgOperand(i, NewAI);

}

return Changed;

}

PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT, PreservedAnalyses SROA::runImpl(Function &F, DominatorTree &RunDT,

AssumptionCache &RunAC) { AssumptionCache &RunAC) {

LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n"); LLVM_DEBUG(dbgs() << "SROA function: " << F.getName() << "\n");

C = &F.getContext(); C = &F.getContext();

DT = &RunDT; DT = &RunDT;

AC = &RunAC; AC = &RunAC;

// First look for allocas which are used in a call but not captured, and

// add a new alloca to cover uses before the first capture and store the

// value to the old alloca; this may enable additional optimizations for

// the uncaptured alloca.

bool Changed = isolateNonCapturingCalls(F);

BasicBlock &EntryBB = F.getEntryBlock(); BasicBlock &EntryBB = F.getEntryBlock();

for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end()); for (BasicBlock::iterator I = EntryBB.begin(), E = std::prev(EntryBB.end());

I != E; ++I) { I != E; ++I) {

if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) { if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {

if (isa<ScalableVectorType>(AI->getAllocatedType())) { if (isa<ScalableVectorType>(AI->getAllocatedType())) {

if (isAllocaPromotable(AI)) if (isAllocaPromotable(AI))

PromotableAllocas.push_back(AI); PromotableAllocas.push_back(AI);

} else { } else {

Worklist.insert(AI); Worklist.insert(AI);

} }

bool Changed = false;

// A set of deleted alloca instruction pointers which should be removed from // A set of deleted alloca instruction pointers which should be removed from

// the list of promotable allocas. // the list of promotable allocas.

SmallPtrSet<AllocaInst *, 4> DeletedAllocas; SmallPtrSet<AllocaInst *, 4> DeletedAllocas;

do { do {

while (!Worklist.empty()) { while (!Worklist.empty()) {

Changed |= runOnAlloca(*Worklist.pop_back_val()); Changed |= runOnAlloca(*Worklist.pop_back_val());

Changed |= deleteDeadInstructions(DeletedAllocas); Changed |= deleteDeadInstructions(DeletedAllocas);

▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

#define DEBUG_TYPE "mem2reg"

STATISTIC(NumLocalPromoted, "Number of alloca's promoted within one block");

STATISTIC(NumSingleStore, "Number of alloca's promoted with a single store");

STATISTIC(NumDeadAlloca, "Number of dead alloca's removed");

STATISTIC(NumPHIInsert, "Number of PHI nodes inserted");

bool llvm::isAllocaPromotable(const AllocaInst *AI) {

static bool isCallNonCapturing(const CallInst *CI, const Instruction *I) {

// Reject calls with operand bundles for now.

if (CI->hasOperandBundles())

return false;

// Only allow a call if all uses of the alloca do not capture the pointer.

bool NoCaptures = true;

for (unsigned i = 0; i < CI->arg_size(); ++i)

if (CI->getArgOperand(i) == I)

NoCaptures &= CI->paramHasAttr(i, Attribute::NoCapture);

return NoCaptures;

}

bool llvm::isAllocaPromotable(const AllocaInst *AI,

bool AllowNonCapturingCalls) {

// Only allow direct and non-volatile loads and stores...

for (const User *U : AI->users()) {

if (const LoadInst *LI = dyn_cast<LoadInst>(U)) {

// Note that atomic loads can be transformed; atomic semantics do

// not have any meaning for a local alloca.

if (LI->isVolatile())

return false;

} else if (const StoreInst *SI = dyn_cast<StoreInst>(U)) {

if (SI->getValueOperand() == AI ||

SI->getValueOperand()->getType() != AI->getAllocatedType())

return false; // Don't allow a store OF the AI, only INTO the AI.

// Note that atomic stores can be transformed; atomic semantics do

// not have any meaning for a local alloca.

if (SI->isVolatile())

MeinersburUnsubmitted

Not Done

This is not a sufficient condition for captures. I doubt that we can detect that something has been generated from a CapturedStmt just be looking at the IR.

Meinersbur: This is not a sufficient condition for captures. I doubt that we can detect that something has…

return false;

} else if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U)) {

if (!II->isLifetimeStartOrEnd() && !II->isDroppable())

return false;

} else if (const BitCastInst *BCI = dyn_cast<BitCastInst>(U)) {

if (!onlyUsedByLifetimeMarkersOrDroppableInsts(BCI))

return false;

} else if (const GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(U)) {

if (!GEPI->hasAllZeroIndices())

return false;

if (!onlyUsedByLifetimeMarkersOrDroppableInsts(GEPI))

for (const User *U : GEPI->users()) {

const CallInst *CI = dyn_cast<CallInst>(U);

if (CI && !isa<IntrinsicInst>(CI) && AllowNonCapturingCalls &&

isCallNonCapturing(CI, GEPI))

continue;

const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U);

if (II && (II->isLifetimeStartOrEnd() || II->isDroppable()))

continue;

return false;

}

} else if (const AddrSpaceCastInst *ASCI = dyn_cast<AddrSpaceCastInst>(U)) {

if (!onlyUsedByLifetimeMarkers(ASCI))

return false;

} else if (const CallInst *CI = dyn_cast<CallInst>(U)) {

if (!AllowNonCapturingCalls || !isCallNonCapturing(CI, AI))

return false;

} else {

return false;

}

return true;

}

namespace {

lebedev.riUnsubmitted

Not Done

bool NoCaptures = true;

- for (unsigned i = 0; i < CI->arg_size(); ++i)

- if (CI->getArgOperand(i) == AI)

+ for (Value*Arg : CI->args())

+ if (Arg == AI)

NoCaptures &= CI->paramHasAttr(i, Attribute::NoCapture);

lebedev.ri:

struct AllocaInfo {

using DbgUserVec = SmallVector<DbgVariableIntrinsic *, 1>;

SmallVector<BasicBlock *, 32> DefiningBlocks;

SmallVector<BasicBlock *, 32> UsingBlocks;

StoreInst *OnlyStore;

▲ Show 20 Lines • Show All 490 Lines • ▼ Show 20 Lines

if (BBNumbers.empty()) {

unsigned ID = 0;

for (auto &BB : F)

BBNumbers[&BB] = ID++;

}

// Remember the dbg.declare intrinsic describing this alloca, if any.

if (!Info.DbgUsers.empty())

AllocaDbgUsers[AllocaNum] = Info.DbgUsers;

jdoerfertUnsubmitted

Not Done

I doubt this logic works in loops.

H: 
   I = use(alloca);
C: store alloca into mem
if (...) goto H;

Capture (C) post dominates the user (I) but it is executed *after* and *before* the use, just not in the same iteration of the loop defined by H.

Once the alloca is captured you cannot judge anymore without a lot more analysis (incl. nosync). To salvage this, reachability, not post-dominaince, is what you are looking for.

All that said, I still believe the problem at hand should be solved by marking the reduction thing as not capturing.

jdoerfert: I doubt this logic works in loops. ``` H: I = use(alloca); C: store alloca into mem if (...

// Keep the reverse mapping of the 'Allocas' array for the rename pass.

AllocaLookup[Allocas[AllocaNum]] = AllocaNum;

// Unique the set of defining blocks for efficient lookup.

SmallPtrSet<BasicBlock *, 32> DefBlocks(Info.DefiningBlocks.begin(),

Info.DefiningBlocks.end());

// Determine which blocks the value is live in. These are blocks which lead

▲ Show 20 Lines • Show All 404 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/lifetime-sanitizer.ll

	; RUN: opt < %s -O0 -S \| FileCheck %s			; RUN: opt < %s -O0 -S \| FileCheck %s
	; RUN: opt < %s -O1 -S \| FileCheck %s			; RUN: opt < %s -O1 -S \| FileCheck %s
	; RUN: opt < %s -O2 -S \| FileCheck %s			; RUN: opt < %s -O2 -S \| FileCheck %s
	; RUN: opt < %s -O3 -S \| FileCheck %s			; RUN: opt < %s -O3 -S \| FileCheck %s
	; RUN: opt < %s -passes='default<O0>' -S \| FileCheck %s			; RUN: opt < %s -passes='default<O0>' -S \| FileCheck %s
	; RUN: opt < %s -passes='default<O1>' -S \| FileCheck %s			; RUN: opt < %s -passes='default<O1>' -S \| FileCheck %s
	; RUN: opt < %s -passes='default<O2>' -S \| FileCheck %s			; RUN: opt < %s -passes='default<O2>' -S \| FileCheck %s
	; RUN: opt < %s -passes='default<O3>' -S \| FileCheck %s			; RUN: opt < %s -passes='default<O3>' -S \| FileCheck %s

	declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture)			declare void @llvm.lifetime.start.p0i8(i64, i8* nocapture)
	declare void @llvm.lifetime.end.p0i8(i64, i8* nocapture)			declare void @llvm.lifetime.end.p0i8(i64, i8* nocapture)
	declare void @foo(i8* nocapture)			declare void @foo(i8*)

	define void @asan() sanitize_address {			define void @asan() sanitize_address {
	entry:			entry:
	; CHECK-LABEL: @asan(			; CHECK-LABEL: @asan(
	%text = alloca i8, align 1			%text = alloca i8, align 1

	call void @llvm.lifetime.start.p0i8(i64 1, i8* %text)			call void @llvm.lifetime.start.p0i8(i64 1, i8* %text)
	call void @llvm.lifetime.end.p0i8(i64 1, i8* %text)			call void @llvm.lifetime.end.p0i8(i64 1, i8* %text)
	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/test/Transforms/SROA/non-capturing-call.ll

This file was added.

				; RUN: opt < %s -sroa -S \| FileCheck %s
				; RUN: opt < %s -passes=sroa -S \| FileCheck %s

				; CHECK-LABEL: @alloca_used_in_call
				define i32 @alloca_used_in_call(i32* nocapture nonnull readonly %data, i64 %n) {
				; CHECK-NOT: %retval = alloca
				; CHECK: %retval.sroa.isolate = alloca i32, align 4
				entry:
				%retval = alloca i32, align 4
				store i32 0, i32* %retval, align 4
				br label %loop

				; CHECK-LABEL: loop:
				; CHECK: %retval.0 = phi i32 [ 0, %entry ], [ %rdx.inc, %loop ]
				loop:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%arrayidx = getelementptr inbounds i32, i32* %data, i64 %indvars.iv
				%ld = load i32, i32* %arrayidx, align 4
				%rdx = load i32, i32* %retval, align 4
				%rdx.inc = add nsw i32 %rdx, %ld
				store i32 %rdx.inc, i32* %retval, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %n
				br i1 %exitcond, label %loop, label %exit

				; CHECK-LABEL: exit:
				; CHECK: store i32 %rdx.inc, i32* %retval.sroa.isolate, align 4
				; CHECK: %0 = call i32 @user_of_alloca(i32* nocapture nonnull %retval.sroa.isolate)
				; CHECK: %1 = load i32, i32* %retval.sroa.isolate, align 4
				; CHECK: ret i32 %1
				exit:
				%0 = call i32 @user_of_alloca(i32* nocapture nonnull %retval)
				%1 = load i32, i32* %retval, align 4
				ret i32 %1
				}

				; CHECK-LABEL: @alloca_captured_in_call
				define i32 @alloca_captured_in_call(i32* nocapture nonnull readonly %data, i64 %n) {
				; CHECK-NOT: %retval.sroa.isolate = alloca i32, align 4
				entry:
				%retval = alloca i32, align 4
				store i32 0, i32* %retval, align 4
				br label %loop

				; CHECK-LABEL: loop:
				; CHECK-NOT: %retval.0 = phi i32 [ 0, %entry ], [ %rdx.inc, %loop ]
				loop:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%arrayidx = getelementptr inbounds i32, i32* %data, i64 %indvars.iv
				%ld = load i32, i32* %arrayidx, align 4
				%rdx = load i32, i32* %retval, align 4
				%rdx.inc = add nsw i32 %rdx, %ld
				store i32 %rdx.inc, i32* %retval, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %n
				br i1 %exitcond, label %loop, label %exit

				exit:
				%0 = call i32 @capture_of_alloca(i32* nonnull %retval)
				%1 = load i32, i32* %retval, align 4
				ret i32 %1
				}

				; CHECK-LABEL: @alloca_with_gep_used_in_call
				define i32 @alloca_with_gep_used_in_call(i32* nocapture nonnull readonly %data, i64 %n) {
				; CHECK-NOT: %retval = alloca
				; CHECK: %retval.sroa.isolate = alloca i32, align 4
				entry:
				%retval = alloca i32, align 4
				store i32 0, i32* %retval, align 4
				br label %loop

				; CHECK-LABEL: loop:
				; CHECK: %retval.0 = phi i32 [ 0, %entry ], [ %rdx.inc, %loop ]
				loop:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%arrayidx = getelementptr inbounds i32, i32* %data, i64 %indvars.iv
				%ld = load i32, i32* %arrayidx, align 4
				%rdx = load i32, i32* %retval, align 4
				%rdx.inc = add nsw i32 %rdx, %ld
				store i32 %rdx.inc, i32* %retval, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %n
				br i1 %exitcond, label %loop, label %exit

				; CHECK-LABEL: exit:
				; CHECK: store i32 %rdx.inc, i32* %retval.sroa.isolate, align 4
				; CHECK: %0 = call i32 @user_of_alloca(i32* nocapture nonnull %retval.sroa.isolate)
				; CHECK: %1 = load i32, i32* %retval.sroa.isolate, align 4
				; CHECK: ret i32 %1
				exit:
				%gep = getelementptr i32, i32* %retval, i32 0
				%0 = call i32 @user_of_alloca(i32* nocapture nonnull %gep)
				%1 = load i32, i32* %retval, align 4
				ret i32 %1
				}

				; CHECK-LABEL: @alloca_captured_second_arg
				define i32 @alloca_captured_second_arg(i32* nocapture nonnull readonly %data, i64 %n) {
				; CHECK-NOT: %retval.sroa.isolate = alloca i32, align 4
				entry:
				%retval = alloca i32, align 4
				store i32 0, i32* %retval, align 4
				br label %loop

				; CHECK-LABEL: loop:
				; CHECK-NOT: %retval.0 = phi i32 [ 0, %entry ], [ %rdx.inc, %loop ]
				loop:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %loop ]
				%arrayidx = getelementptr inbounds i32, i32* %data, i64 %indvars.iv
				%ld = load i32, i32* %arrayidx, align 4
				%rdx = load i32, i32* %retval, align 4
				%rdx.inc = add nsw i32 %rdx, %ld
				store i32 %rdx.inc, i32* %retval, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %n
				br i1 %exitcond, label %loop, label %exit

				exit:
				%0 = call i32 @capture_with_multiple_args(i32* nocapture nonnull %retval, i32* nonnull %retval)
				%1 = load i32, i32* %retval, align 4
				ret i32 %1
				}

				declare dso_local i32 @user_of_alloca(i32* nocapture nonnull)
				declare dso_local i32 @capture_of_alloca(i32 *nonnull)
				declare dso_local i32 @capture_with_multiple_args(i32* nocapture nonnull, i32* nonnull)
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

Experimental Partial Mem2RegAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 385009

llvm/include/llvm/Transforms/Scalar/SROA.h

llvm/include/llvm/Transforms/Utils/PromoteMemToReg.h

llvm/lib/Transforms/Scalar/SROA.cpp

llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp

llvm/test/Transforms/PhaseOrdering/lifetime-sanitizer.ll

llvm/test/Transforms/SROA/non-capturing-call.ll

Experimental Partial Mem2Reg
AbandonedPublic