Page MenuHomePhabricator

Experimental Partial Mem2Reg
Needs ReviewPublic

Authored by huntergr on Sep 14 2021, 2:32 AM.

Details

Summary

Clang's current lowering for OpenMP parallel worksharing loops with a reduction clause prevents lots of optimization opportunities because the address of the stack variable for the reduction is passed to an OpenMP runtime function after the loop; this causes SROA/mem2reg to skip over promoting it to SSA form.

The intent of this work is to partially promote the reduction variable to SSA form before the runtime call takes place for a loop like the following so that optimizations (like vectorization) can be performed.

int loop(int data[restrict 128U]) {
  int retval = 0;

#pragma omp parallel for simd schedule(simd:static) default(none) shared(data) reduction(+:retval)
  for (int i = 0; i < 128; i++) {
    int n = 0;

    if (data[i]) {
      n = 1;
      retval += n;
    }
  }
  return retval;
}

The code as it is right now was written to avoid clashing too much with other code in order to reduce maintenance costs downstream; I expect I'll need to refactor it considerably but I would like to hear from reviewers before undertaking that work.

I have a few questions to resolve first:

  • Is this feature something the community wants, or am I just overcomplicating things? Is there an easier way to get the above loop to vectorize?
  • I've been a bit paranoid about ensuring ordering here and used the PostDominatorTree; I think it may be possible to do this with a modification to the IDF algorithm used in mem2reg, but I haven't worked through it yet. Does anyone have more experience with it to help guide that?
  • This is currently a separate pass, but could be implemented as part of the normal SROA/mem2reg optimization pass. Would this be preferred? Does the outcome of the previous question about PostDom trees affect that?

Diff Detail

Event Timeline

huntergr created this revision.Sep 14 2021, 2:32 AM
huntergr requested review of this revision.Sep 14 2021, 2:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 14 2021, 2:32 AM
Herald added a subscriber: sstefan1. · View Herald Transcript

I have seen cases where this would be beneficial,
some of those are just due to lack of inlining, but not all.

I strongly believe this should be part of SROA,
it should analyze the alloca's ignoring captures,
and if it is otherwise promoteable, it should:

  1. duplicate the original alloca (only for simplicity, this is fine since we know the old alloca goes away)
  2. before each capture, load contents of the old alloca, and store it into new alloca
  3. after each capture, load contents of the new alloca, and store it into old alloca
  4. change captures to refer to the new alloca
  5. run AggLoadStoreRewriter on the new alloca - so that all the uses of old alloca we've just introduced are analyzeable by SROA
  6. proceed with normal handling of the old alloca - mem2reg will now succeed

I agree this should be part of mem2reg/SROA unless there is a specific reason against it (e.g. computational complexity higher s.t. that it should not also run with every occurance of SROA/mem2reg in the default pipeline).

Your motivational code looks like it should be processable by LICM, s.t. it is promoted to registers while in the loop, then vectorized. Do you know why this doesn't happen?

I scanned the diff for nosync without hits. I doubt any of this reasoning is valid if I can have synchronization between threads.

That said, I think we need to use the fact that we know the value stored in the alloca is not captured. There was an email thread on this problem and email threads on how we could encode that it is not captured.
Given that this occurs in the OpenMP context, nosync is probably not an alternative.

I have seen cases where this would be beneficial,
some of those are just due to lack of inlining, but not all.

I strongly believe this should be part of SROA,
it should analyze the alloca's ignoring captures,
and if it is otherwise promoteable, it should:

  1. duplicate the original alloca (only for simplicity, this is fine since we know the old alloca goes away)
  2. before each capture, load contents of the old alloca, and store it into new alloca
  3. after each capture, load contents of the new alloca, and store it into old alloca
  4. change captures to refer to the new alloca
  5. run AggLoadStoreRewriter on the new alloca - so that all the uses of old alloca we've just introduced are analyzeable by SROA
  6. proceed with normal handling of the old alloca - mem2reg will now succeed

Hi, thanks for the suggestion (and sorry for the delay in responding).

I've implemented something similar to what you've suggested, but with a slight difference to make it fit the problem at hand -- the openmp reduction present in the loop. There's a key difference which I didn't state in my initial summary (though was present in the unit test), which is the way the alloca is captured -- it's not directly passed as an argument to the function, but the pointer is instead stored into another local memory address first and the pointer for the second memory address is then passed to __kmpc_reduce_nowait. This leads to the code being somewhat messy, as I have to check that the store of the pointer dominates the call, that there aren't other uses of the second alloca that might interfere with conversion, etc.

The way that's done makes me wonder whether libomp needs a lighter-weight interface for reductions involving a single scalar value, rather than just a single generic interface which accepts an arbitrary number of reduction variables. (For comparison, I looked into what gcc does -- it passes a pointer to a shared reduction variable into the outlined function, and it just performs the atomic operation directly instead of calling to the runtime).

So I think that I'll repurpose this patch to only cover the direct case of an alloca being used in a call and separate out the libomp side of things for another patch. I'll update the diff once I've implemented that.

I agree this should be part of mem2reg/SROA unless there is a specific reason against it (e.g. computational complexity higher s.t. that it should not also run with every occurance of SROA/mem2reg in the default pipeline).

Your motivational code looks like it should be processable by LICM, s.t. it is promoted to registers while in the loop, then vectorized. Do you know why this doesn't happen?

mem2reg handles promotion to registers, but for LICM specifically there's a couple of things which would stop it.

  1. Although the address is loop invariant, the data isn't.
  2. For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

I scanned the diff for nosync without hits. I doubt any of this reasoning is valid if I can have synchronization between threads.

That's part of the reason my original patch only changed uses before a capture (the other being possible aliasing within a thread -- a terrible idea, but someone somewhere has probably written something which relies on it). I could restrict it to avoid converting any allocas which use atomic operations.

That said, I think we need to use the fact that we know the value stored in the alloca is not captured. There was an email thread on this problem and email threads on how we could encode that it is not captured.
Given that this occurs in the OpenMP context, nosync is probably not an alternative.

I think we can use Roman's approach when the alloca is passed as a 'nocapture' argument at least, which will give us some benefit even if it doesn't solve all of my initial problem. Do you agree?

I'm not sure about the best way of marking the store of the first alloca pointer into the second alloca's memory as nocapture, though. If we have a way of doing it then I can extend the work in a later patch to cover that case, or if not maybe we can change the way clang and libomp handle openmp reductions to make it easier to optimize outlined functions.

  1. Although the address is loop invariant, the data isn't.

LICM does scalar promotion (controlled by -disable-licm-promotion), as in "promote memory location to register". It doesn't matter whether the value at the location is invariant. Whether this belongs into a pass called "Loop Invariant Code Motion" is a different question.

  1. For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

This patch adds another pass, not make mem2reg do it. LICM currently does not handle conditional control flow for scalar promotion, but it should require much less code to change that. See the use of isGuaranteedToExecute in llvm::promoteLoopAccessesToScalars.

PartialMemToReg uses isAllocaPromotable to ensure that the target is write-accessible and no bit is needed, why not do the same for LICM?

llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
77–78

This is not a sufficient condition for captures. I doubt that we can detect that something has been generated from a CapturedStmt just be looking at the IR.

llvm/test/Transforms/Mem2Reg/partial-mem2reg.ll
3

This tests too many passes at once

jdoerfert added inline comments.Thu, Oct 21, 8:32 AM
llvm/lib/Transforms/Utils/PromoteMemoryToRegister.cpp
634

I doubt this logic works in loops.

H: 
   I = use(alloca);
C: store alloca into mem
if (...) goto H;

Capture (C) post dominates the user (I) but it is executed *after* and *before* the use, just not in the same iteration of the loop defined by H.

Once the alloca is captured you cannot judge anymore without a lot more analysis (incl. nosync). To salvage this, reachability, not post-dominaince, is what you are looking for.

All that said, I still believe the problem at hand should be solved by marking the reduction thing as not capturing.

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

  1. Although the address is loop invariant, the data isn't.

LICM does scalar promotion (controlled by -disable-licm-promotion), as in "promote memory location to register". It doesn't matter whether the value at the location is invariant. Whether this belongs into a pass called "Loop Invariant Code Motion" is a different question.

  1. For this loop in particular, the store is conditional so might never happen. We *could* add a second boolean reduction to determine whether or not to actually perform a store after the loop, but that's a bit more complicated than just letting mem2reg do what it should.

This patch adds another pass, not make mem2reg do it. LICM currently does not handle conditional control flow for scalar promotion, but it should require much less code to change that. See the use of isGuaranteedToExecute in llvm::promoteLoopAccessesToScalars.

Sorry, I should have made it more clear -- I'm dropping the new pass and using Roman's suggestion of improving SROA. I have implemented that but found the code a bit messy due to the store -> call separation.

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

Ah, the 'nocapture_storage' metadata is what I've been missing, thanks. I'll update the diff once I've added that and adjusted the tests.

Does this work for you:

diff --git a/llvm/lib/Analysis/CaptureTracking.cpp b/llvm/lib/Analysis/CaptureTracking.cpp
index 8955658cb9e7..41251d2676e6 100644
--- a/llvm/lib/Analysis/CaptureTracking.cpp
+++ b/llvm/lib/Analysis/CaptureTracking.cpp
@@ -373,9 +373,13 @@ void llvm::PointerMayBeCaptured(const Value *V, CaptureTracker *Tracker,
     case Instruction::Store:
       // Stored the pointer - conservatively assume it may be captured.
       // Volatile stores make the address observable.
-      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile())
+      if (U->getOperandNo() == 0 || cast<StoreInst>(I)->isVolatile()) {
+        if (auto *AI = dyn_cast<AllocaInst>(I->getOperand(1)->stripInBoundsOffsets()))
+          if (AI->hasMetadata("nocapture_storage"))
+            break;
         if (Tracker->captured(U))
           return;
+      }
       break;
     case Instruction::AtomicRMW: {
       // atomicrmw conceptually includes both a load and store from

And then add !nocapture_storage !0 after the alloca in your example as well as !0 = !{!0} in the end of that file

Ah, the 'nocapture_storage' metadata is what I've been missing, thanks. I'll update the diff once I've added that and adjusted the tests.

Technically, this is not yet something we have in the IR. We can reply to the old thread in which different solutions were discussed and
propose this one again. Then modify Clang to emit the metadata for the reduction case and land the diff I posted. All that said, it works
for your case, right?