Page MenuHomePhabricator

[LoopVectorize] Support reductions that store intermediary result
AcceptedPublic

Authored by igor.kirillov on Sep 22 2021, 5:51 AM.

Details

Summary

Adds ability to vectorize loops containing a store to a loop-invariant address as part of a reduction that isn't converted to SSA form due to lack of aliasing info. Runtime checks are generated to ensure the store does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Tagging some more potential reviewers.

llvm/test/Transforms/LoopVectorize/reduction.ll
539

Given that it is not supported, such a test would demonstrate that your bailout logic for this case is working as intended.

fhahn added a comment.Oct 11 2021, 2:37 AM

to SSA form due to lack of aliasing info

It would be good to mention *why* those particular stores can be sunk during loop vectorization. IIUC the patch relies on the runtime checks generated to ensure the store does not alias other accesses in the loop.

It seems like the main problem is that we potentially bail out too early at the moment when checking for reductions due to the store, but once we generate runtime checks, sinking the store may become legal (see inline comment about loads to the same address)? If that's the case, ideally we'd just sink any such loads/stores before detecting reductions once we know they can be sunk due to runtime checks, but unfortunately I do not think that's possible with the current structure/ordering.

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
642

Are the stores invariant or to a uniform address? InvariantStores implies they are invariant, which may not be the case?

llvm/include/llvm/Transforms/Utils/LoopUtils.h
402

Is LoopUtils.h the right place for this? Is there anything loop related?

llvm/lib/Analysis/IVDescriptors.cpp
301

What about existing users of isReductionPHI which currently may rely on the fact that all instruction in the loop must either be phis or reduction operations? Also, with respect to the store restriction, the important bit is that the final value is also stored, right?

310

You are only checking for loop-invariant addresses, so should this be loop invariant memory location?

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

What about loads to the same address in the loop? At the moment, LAA cannot analyze dependences with invariant addresses. But if this limitation gets removed the code here may become incorrect, because it relies on this limitation IIUC?

david-arm added inline comments.Oct 11 2021, 3:51 AM
llvm/include/llvm/Analysis/LoopAccessAnalysis.h
642

I assume here that InvariantStores refers to stores to an invariant (i.e. uniform?) address, rather than storing an invariant value to variant address. If so, perhaps it could be named StoresToInvariantAddress or something like that?

david-arm added inline comments.Oct 11 2021, 6:53 AM
llvm/test/Transforms/LoopVectorize/reduction.ll
471

Is it possible to add a simple floating point test with "fadd fast"?

It seems like the main problem is that we potentially bail out too early at the moment when checking for reductions due to the store, but once we generate runtime checks, sinking the store may become legal (see inline comment about loads to the same address)? If that's the case, ideally we'd just sink any such loads/stores before detecting reductions once we know they can be sunk due to runtime checks, but unfortunately I do not think that's possible with the current structure/ordering.

Is it worth for me to explore if it possible to do other way around or we should work on with the solution from this merge?

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
642

@fhah Just to make sure we are on the same line - for me uniform and invariant are just synonyms, isn't it so?

llvm/include/llvm/Transforms/Utils/LoopUtils.h
402

What do you think if I make this function a public member of class ScalarEvolution? Or is there a better place for it?

llvm/lib/Analysis/IVDescriptors.cpp
301

I checked and only LoopInterchangeLegality is using this function and it is not affected by stores. Anyway, I can add a parameter to RecurrenceDescriptor::isReductionPHI or a member to RecurrenceDescriptor allowing or not to handle stores. What do you think about it?

The comment is to be updated, yes.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

Loads from or stores to the same address in the loop? I'm sorry could you clarify what the problem is. As it is I don't understand the message.

Hi @igor.kirillov, is it also possible to get this working for ordered reductions, i.e.

float sum = 0;
for(i=0..N) {
  sum += src[i];
  dst[42] = sum;
}

when building with -O3? I think it might mean updating checkOrderedReductions to look through the store. If it looks too difficult to do as part of this patch we can always follow-up with a patch later.

igor.kirillov marked 4 inline comments as done.

Update commit message and comments
Move storeToSameAddress to ScalarEvolution
Add fadd fast test
Do not apply patch for ordered fadd reductions

igor.kirillov marked an inline comment as done.Sun, Nov 7, 5:01 AM

Hi @igor.kirillov, is it also possible to get this working for ordered reductions, i.e.

float sum = 0;
for(i=0..N) {
  sum += src[i];
  dst[42] = sum;
}

when building with -O3? I think it might mean updating checkOrderedReductions to look through the store. If it looks too difficult to do as part of this patch we can always follow-up with a patch later.

Yes, I added a check to LoopVectorizationLegality::canVectorizeFPMath so as not to allow this reduction when math is strict. Enabling it requires some work and it is better to do it separately.

llvm/test/Transforms/LoopVectorize/reduction.ll
471

Added! see reduc_store_fadd_fast function

igor.kirillov edited the summary of this revision. (Show Details)Sun, Nov 7, 5:02 AM
igor.kirillov marked an inline comment as done.
david-arm added inline comments.Mon, Nov 8, 6:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
964

Maybe this can be folded into the all_of case below, i.e.

return (all_of(getReductionVars(), [&](auto &Reduction) -> bool {
  const RecurrenceDescriptor &RdxDesc = Reduction.second;
  return !RdxDesc.hasExactFPMath() ||
         (RdxDesc.isOrdered() && !RdxDesc.IntermediateStore);
}));

Also, the problem with this code at the moment is that you could have a mixture of fast and ordered reductions in the same loop. There could be an intermediate store for one of the fast reductions, but not for the ordered ones. At the moment with your code we will just bail out in this case.

Update intermediate store check for ordered fadd vectorization

igor.kirillov added inline comments.Tue, Nov 9, 7:24 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
964

You are right! It can be done much simpler and the check is needed only when ordered reduction is present.

david-arm added inline comments.Wed, Nov 10, 5:40 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
902

Hi @igor.kirillov, I think you're missing a test case here. Suppose we have two stores in the loop to the same invariant address. If the first store is predicated, but the second isn't we should still vectorise because the second store wins and the first one can be removed. At least the code here suggests that. I couldn't find any test that exercised this code path.

910

I think you might be missing a test case for this. Can you make sure this code path is exercised please by your existing tests please?

llvm/test/Transforms/LoopVectorize/reduction.ll
477

Can you put all the CHECK lines in the same place near the top of the function please to be consistent with the existing tests?

517

Hi @igor.kirillov, this doesn't look right. The test is called @reduc_store_fadd_fast, but there is not fast keyword on the fadd instruction. I don't think we should even be vectorising this because it requires ordered reductions, which you haven't enabled for this test. Can you change the name of this to @reduc_store_fadd_ordered and investigate why we are vectorising this?

Can you also add a separate test called @reduc_store_fadd_fast that actually has the fast keyword too?

igor.kirillov marked 2 inline comments as done and an inline comment as not done.Thu, Nov 18, 3:06 AM
igor.kirillov added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
902

Added reduc_store_final_store_predicated test that address both requests

910

It is now covered by new reduc_store_final_store_predicated test and an old test reduc_store_inside_unrolled also executes this path

llvm/test/Transforms/LoopVectorize/reduction.ll
517

I missed the fast keyword there. As for why this code gets vectorized - it happens because of Hints->allowReordering() returning true in LoopVectorizationLegality::canVectorizeFPMath. As you can see the test specifies vector width (-force-vector-width=4) and llvm allows to process ordered instructions in unordered manner (see also LoopVectorizeHints::allowReordering function) in that case.

I added a new test with fast keyword to AArch64/strict-fadd.ll and the loop is not vectorized there.

david-arm added inline comments.Thu, Nov 18, 3:10 AM
llvm/test/Transforms/LoopVectorize/reduction.ll
656

Can you also add a test where the first store is predicated and the second one isn't? According to the code changes in this patch we should vectorise this case because the second one overrides the first.

Add test

llvm/test/Transforms/LoopVectorize/reduction.ll
656

Added! See reduc_store_middle_store_predicated

fhahn added inline comments.Thu, Nov 18, 3:44 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

The case I was thinking about was something like the snippet below, where we have a load of the invariant address in the loop (%lv = load... in the example below).

define void @reduc_store(i32* %dst, i32* readonly %src, i32* noalias %dst.2) {
entry:
  %arrayidx = getelementptr inbounds i32, i32* %dst, i64 42
  store i32 0, i32* %arrayidx, align 4
  br label %for.body
for.body:
  %0 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx1 = getelementptr inbounds i32, i32* %src, i64 %indvars.iv
  %1 = load i32, i32* %arrayidx1, align 4
  %add = add nsw i32 %0, %1
  %lv = load i32, i32* %arrayidx
  store i32 %add, i32* %arrayidx, align 4
  
  %gep.dst.2  = getelementptr inbounds i32, i32* %dst.2, i64 %indvars.iv
  store i32 %lv, i32* %gep.dst.2,

  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, 1000
  br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup:
  ret void
}
llvm/test/Transforms/LoopVectorize/reduction.ll
468

Could you add a brief textual explanation of what the test covers?

482

perhaps rename to %sum or something like that, to make it a bit easier to read the test?

487

The names for the 2 different GEPs in the tests are very similar. Could you rename them to make it easier to distinguish them (e.g. something like %gep.src/%gep.dst)..

Hi @igor.kirillov, thanks for all the changes and the patch looks good now! I just had a couple of minor comments.

llvm/lib/Analysis/IVDescriptors.cpp
454

nit: Could you remove an extra level of indentation here before merging the patch?

i.e. this can be written like this:

if (isa<PHINode>(UI))
  PHIs.push_back(UI);
else if (auto *SI = dyn_cast<StoreInst>(UI)) {
  ...
} else
  NonPHIs.push_back(UI);
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
973

nit: I think you can you might be able to simplify the code here by removing FoundMatchingRecurrence . Then, lower down instead of the break you can just do

if (DSI && (DSI == SI)) {
  *IsPredicated = blockNeedsPredication(DSI->getParent());
  return true;
}

and at the bottom of the function just do:

return false;
igor.kirillov marked an inline comment as done.

Refactoring a bit

igor.kirillov marked 8 inline comments as done.Thu, Nov 18, 7:22 AM
igor.kirillov added inline comments.
llvm/lib/Analysis/IVDescriptors.cpp
454

I made an extra step and simplified it even more

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

In that case we do not vectorize, the rejection happens in LoopAccessInfo::analyzeLoop when loads ape processed:

  for (LoadInst *LD : Loads) {
    Value *Ptr = LD->getPointerOperand();
...
    // See if there is an unsafe dependency between a load to a uniform address and
    // store to the same uniform address.
    if (UniformStores.count(Ptr)) {
      LLVM_DEBUG(dbgs() << "LAA: Found an unsafe dependency between a uniform "
                           "load and uniform store to the same address!\n");
      HasDependenceInvolvingLoopInvariantAddress = true;
    }
...

I added reduc_store_load test anyway

igor.kirillov marked an inline comment as done.

Add scalable reduction test

Little extension in a new test

david-arm accepted this revision.Tue, Nov 23, 2:27 AM

LGTM! Thanks for making all the changes @igor.kirillov.

llvm/test/Transforms/LoopVectorize/reduction.ll
466

nit: I think this should be invariant

This revision is now accepted and ready to land.Tue, Nov 23, 2:27 AM
igor.kirillov marked an inline comment as done.Tue, Nov 23, 2:48 AM
fhahn added inline comments.Tue, Nov 23, 2:48 AM
llvm/include/llvm/Analysis/IVDescriptors.h
245

nit: should be a doc-comment?

llvm/include/llvm/Analysis/LoopAccessAnalysis.h
583

nit: could return ArrayRef, not leaking any details about the underlying container.

642

nit: Is there a reason for choosing 5 for the size? if not, nowadays SmallVector can pick a good size automatically.

llvm/include/llvm/Analysis/ScalarEvolution.h
1120 ↗(On Diff #388906)

Do you anticipate this to be used outside Transforms/Vectorize? If not I'm not sure if it makes sense to live here, and extending the API interface of ScalarEvolution. Can this instead be defined somewhere in Transforms/Vectorize?

llvm/lib/Analysis/IVDescriptors.cpp
238

needs documentation?

301

I guess it would be good to have at least a test case for loop-interchange to make sure it can deal with the change properly?

488

nit: for consistency with other code avoid using != nullptr. This is more in line with the style use in LLVM in general (and you use ==/!= nullptr in the adjacent code here).

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

Yeah but this is only due to some limitations that currently existing in LAA, right? I think we should at least make this clear somewhere, e.g. in a comment.

906

nit: no need to use llvm::

976

nit: redundant ().

977

could we instead just do the check at the call site or is there a benefit of doing it here?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3052

We effectively sink the store outside the loop. In that case, I don't think we should create a recipe and also we should not consider its cost.

mgabka added a subscriber: mgabka.Mon, Nov 29, 1:56 AM
igor.kirillov marked 9 inline comments as done.

Lots of updates related to the recent comments

igor.kirillov marked 5 inline comments as done.Mon, Nov 29, 11:09 AM
igor.kirillov added inline comments.
llvm/lib/Analysis/IVDescriptors.cpp
301

You are right. We actually should not do loop interchange is an invariant address is present. Otherwise it may introduce incorrect result.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
900

Added comment

igor.kirillov marked 2 inline comments as done.Mon, Nov 29, 11:09 AM