Page MenuHomePhabricator

[LoopVectorize] Support reductions that store intermediary result

Authored by igor.kirillov on Sep 22 2021, 5:51 AM.



Adds ability to vectorize loops containing a store to a loop-invariant address as part of a reduction that isn't converted to SSA form due to lack of aliasing info. Runtime checks are generated to ensure the store does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update intermediate store check for ordered fadd vectorization

igor.kirillov added inline comments.Nov 9 2021, 7:24 AM

You are right! It can be done much simpler and the check is needed only when ordered reduction is present.

david-arm added inline comments.Nov 10 2021, 5:40 AM

Hi @igor.kirillov, I think you're missing a test case here. Suppose we have two stores in the loop to the same invariant address. If the first store is predicated, but the second isn't we should still vectorise because the second store wins and the first one can be removed. At least the code here suggests that. I couldn't find any test that exercised this code path.


I think you might be missing a test case for this. Can you make sure this code path is exercised please by your existing tests please?


Can you put all the CHECK lines in the same place near the top of the function please to be consistent with the existing tests?


Hi @igor.kirillov, this doesn't look right. The test is called @reduc_store_fadd_fast, but there is not fast keyword on the fadd instruction. I don't think we should even be vectorising this because it requires ordered reductions, which you haven't enabled for this test. Can you change the name of this to @reduc_store_fadd_ordered and investigate why we are vectorising this?

Can you also add a separate test called @reduc_store_fadd_fast that actually has the fast keyword too?

igor.kirillov marked 2 inline comments as done and an inline comment as not done.Nov 18 2021, 3:06 AM
igor.kirillov added inline comments.

Added reduc_store_final_store_predicated test that address both requests


It is now covered by new reduc_store_final_store_predicated test and an old test reduc_store_inside_unrolled also executes this path


I missed the fast keyword there. As for why this code gets vectorized - it happens because of Hints->allowReordering() returning true in LoopVectorizationLegality::canVectorizeFPMath. As you can see the test specifies vector width (-force-vector-width=4) and llvm allows to process ordered instructions in unordered manner (see also LoopVectorizeHints::allowReordering function) in that case.

I added a new test with fast keyword to AArch64/strict-fadd.ll and the loop is not vectorized there.

david-arm added inline comments.Nov 18 2021, 3:10 AM

Can you also add a test where the first store is predicated and the second one isn't? According to the code changes in this patch we should vectorise this case because the second one overrides the first.

Add test


Added! See reduc_store_middle_store_predicated

fhahn added inline comments.Nov 18 2021, 3:44 AM

The case I was thinking about was something like the snippet below, where we have a load of the invariant address in the loop (%lv = load... in the example below).

define void @reduc_store(i32* %dst, i32* readonly %src, i32* noalias %dst.2) {
  %arrayidx = getelementptr inbounds i32, i32* %dst, i64 42
  store i32 0, i32* %arrayidx, align 4
  br label %for.body
  %0 = phi i32 [ 0, %entry ], [ %add, %for.body ]
  %indvars.iv = phi i64 [ 0, %entry ], [, %for.body ]
  %arrayidx1 = getelementptr inbounds i32, i32* %src, i64 %indvars.iv
  %1 = load i32, i32* %arrayidx1, align 4
  %add = add nsw i32 %0, %1
  %lv = load i32, i32* %arrayidx
  store i32 %add, i32* %arrayidx, align 4
  %gep.dst.2  = getelementptr inbounds i32, i32* %dst.2, i64 %indvars.iv
  store i32 %lv, i32* %gep.dst.2, = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64, 1000
  br i1 %exitcond, label %for.cond.cleanup, label %for.body

  ret void

Could you add a brief textual explanation of what the test covers?


perhaps rename to %sum or something like that, to make it a bit easier to read the test?


The names for the 2 different GEPs in the tests are very similar. Could you rename them to make it easier to distinguish them (e.g. something like %gep.src/%gep.dst)..

Hi @igor.kirillov, thanks for all the changes and the patch looks good now! I just had a couple of minor comments.


nit: Could you remove an extra level of indentation here before merging the patch?

i.e. this can be written like this:

if (isa<PHINode>(UI))
else if (auto *SI = dyn_cast<StoreInst>(UI)) {
} else

nit: I think you can you might be able to simplify the code here by removing FoundMatchingRecurrence . Then, lower down instead of the break you can just do

if (DSI && (DSI == SI)) {
  *IsPredicated = blockNeedsPredication(DSI->getParent());
  return true;

and at the bottom of the function just do:

return false;
igor.kirillov marked an inline comment as done.

Refactoring a bit

igor.kirillov marked 8 inline comments as done.Nov 18 2021, 7:22 AM
igor.kirillov added inline comments.

I made an extra step and simplified it even more


In that case we do not vectorize, the rejection happens in LoopAccessInfo::analyzeLoop when loads ape processed:

  for (LoadInst *LD : Loads) {
    Value *Ptr = LD->getPointerOperand();
    // See if there is an unsafe dependency between a load to a uniform address and
    // store to the same uniform address.
    if (UniformStores.count(Ptr)) {
      LLVM_DEBUG(dbgs() << "LAA: Found an unsafe dependency between a uniform "
                           "load and uniform store to the same address!\n");
      HasDependenceInvolvingLoopInvariantAddress = true;

I added reduc_store_load test anyway

igor.kirillov marked an inline comment as done.

Add scalable reduction test

Little extension in a new test

david-arm accepted this revision.Nov 23 2021, 2:27 AM

LGTM! Thanks for making all the changes @igor.kirillov.


nit: I think this should be invariant

This revision is now accepted and ready to land.Nov 23 2021, 2:27 AM
igor.kirillov marked an inline comment as done.Nov 23 2021, 2:48 AM
fhahn added inline comments.Nov 23 2021, 2:48 AM

nit: should be a doc-comment?


nit: could return ArrayRef, not leaking any details about the underlying container.


nit: Is there a reason for choosing 5 for the size? if not, nowadays SmallVector can pick a good size automatically.


Do you anticipate this to be used outside Transforms/Vectorize? If not I'm not sure if it makes sense to live here, and extending the API interface of ScalarEvolution. Can this instead be defined somewhere in Transforms/Vectorize?


needs documentation?


I guess it would be good to have at least a test case for loop-interchange to make sure it can deal with the change properly?


nit: for consistency with other code avoid using != nullptr. This is more in line with the style use in LLVM in general (and you use ==/!= nullptr in the adjacent code here).


Yeah but this is only due to some limitations that currently existing in LAA, right? I think we should at least make this clear somewhere, e.g. in a comment.


nit: no need to use llvm::


nit: redundant ().


could we instead just do the check at the call site or is there a benefit of doing it here?


We effectively sink the store outside the loop. In that case, I don't think we should create a recipe and also we should not consider its cost.

mgabka added a subscriber: mgabka.Nov 29 2021, 1:56 AM
igor.kirillov marked 9 inline comments as done.

Lots of updates related to the recent comments

igor.kirillov marked 5 inline comments as done.Nov 29 2021, 11:09 AM
igor.kirillov added inline comments.

You are right. We actually should not do loop interchange is an invariant address is present. Otherwise it may introduce incorrect result.


Added comment

igor.kirillov marked 2 inline comments as done.Nov 29 2021, 11:09 AM

Hi, @fhahn! Since there have been several quite serious changes in the patch after your last review, I would be happy to receive you approval before merge (even though status is accepted now).

fhahn added a comment.Dec 10 2021, 1:34 AM

Hi, @fhahn! Since there have been several quite serious changes in the patch after your last review, I would be happy to receive you approval before merge (even though status is accepted now).

sure, I'll try to take another look by end-of-day Monday

fhahn added a comment.Wed, Jan 5, 1:53 AM

Thanks for the latest update! I left some more additional comments. It might be good to pre-commit the tests and only include the diff in the patch. Another thing to consider is moving the reduction-store tests to a separate file, as reduction.ll is already quite large.

One thing to note on the overall approach is that it's a bit of a workaround some current limitations in our modeling, but I don't think there's an alternative in the short term and this is clearly a desirable case to support, so that seems fine to me in general. There's work in progress to model the pre-header and exit block in VPlan, which allows us to do the sinking of the store more easily, so we can improve things further then :)


From an interface perspective, I think it would be better to make SE an optional argument and only perform the the store analysis when it is passed. This would also require users to explicitly opt-in, which would at least partially guarding against people ignoring IntermediateStore.


It would probably be good to make clear what is handled here exactly and why we can handle those stores. IIUC this applies only to invariant stores that store reduction results and is safe because runtime checks guarantee that it won't alias with other objects. The store won't get vectorized, but sunk to the exit block during codegen.


either deleted or go outside the loop sounds a bit unclear. Aren't they moved to the vector exit block and store the final reduction value?


I think it would also be good to include at least some information in the VPlan that the store is handled as part of the reduction. Perhaps VPReductionRecipe should print if the result is stored after the loop? Please add a test case to vplan-printing.ll


FWIW for such compact IR test cases, the pseudo code doesn't add much value in my personal opinion. Better to strive to make the tests as readable/compact as possible and have a comment explaining what it tests when needed.


nit: newline.


not needed?


nit: is this needed? Can just pass %n as i64


Here (and for the other tests it would be good to check at least the vector reduction sequence and that the correct value is stored.


move exit block to end of function?


move exit block to end of function?


move exit block to end of function?


move exit block to end of function?

Not sure if you saw, there's a somewhat related bugreport:
Not sure if this already supports that pattern though.

igor.kirillov marked 11 inline comments as done.

Update tests, move invariant store tests into separate file, add more comments


I'm not sure how to display this information properly. This is how output of VReductionRecipe::print looks like now:

REDUCE ir<> = ir<%red> + fast reduce.fadd (ir<%lv>)

I could add something to the end but it doesn't seem to fit there. Also I have not found any proper V.*Recipe where I could place something like DELETE store .*.


When I look at pseudo-code I immediately understand what it is about, whereas looking at IR takes at least 30 seconds of cognitive exertions. And it is also easier to see the difference between and purpose of all those quite similar tests. But I delete it, of course, if you insist :)

@fhahn I created the review with tests only here - Once it is merged I'll update this one.

@lebedev.ri This patch addresses a different problem. Here we try to handle when a reduction value is stored in some address and if this address is invariant plus other lucky conditions are satisfied we manage to vectorize. In your example address where value is stored is not invariant at all. Nevertheless, the case is interesting.