Adds ability to vectorize loops containing a store to a loop-invariant address as part of a reduction that isn't converted to SSA form due to lack of aliasing info. Runtime checks are generated to ensure the store does not alias any other accesses in the loop.
Ordered fadd reductions are not yet supported.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
941 | What about loads to the same address in the loop? At the moment, LAA cannot analyze dependences with invariant addresses. But if this limitation gets removed the code here may become incorrect, because it relies on this limitation IIUC? |
llvm/include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
643 | I assume here that InvariantStores refers to stores to an invariant (i.e. uniform?) address, rather than storing an invariant value to variant address. If so, perhaps it could be named StoresToInvariantAddress or something like that? |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
---|---|---|
471 ↗ | (On Diff #378503) | Is it possible to add a simple floating point test with "fadd fast"? |
It seems like the main problem is that we potentially bail out too early at the moment when checking for reductions due to the store, but once we generate runtime checks, sinking the store may become legal (see inline comment about loads to the same address)? If that's the case, ideally we'd just sink any such loads/stores before detecting reductions once we know they can be sunk due to runtime checks, but unfortunately I do not think that's possible with the current structure/ordering.
Is it worth for me to explore if it possible to do other way around or we should work on with the solution from this merge?
llvm/include/llvm/Analysis/LoopAccessAnalysis.h | ||
---|---|---|
643 | @fhah Just to make sure we are on the same line - for me uniform and invariant are just synonyms, isn't it so? | |
llvm/include/llvm/Transforms/Utils/LoopUtils.h | ||
402 ↗ | (On Diff #378503) | What do you think if I make this function a public member of class ScalarEvolution? Or is there a better place for it? |
llvm/lib/Analysis/IVDescriptors.cpp | ||
322 | I checked and only LoopInterchangeLegality is using this function and it is not affected by stores. Anyway, I can add a parameter to RecurrenceDescriptor::isReductionPHI or a member to RecurrenceDescriptor allowing or not to handle stores. What do you think about it? The comment is to be updated, yes. | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
941 | Loads from or stores to the same address in the loop? I'm sorry could you clarify what the problem is. As it is I don't understand the message. |
Hi @igor.kirillov, is it also possible to get this working for ordered reductions, i.e.
float sum = 0; for(i=0..N) { sum += src[i]; dst[42] = sum; }
when building with -O3? I think it might mean updating checkOrderedReductions to look through the store. If it looks too difficult to do as part of this patch we can always follow-up with a patch later.
Update commit message and comments
Move storeToSameAddress to ScalarEvolution
Add fadd fast test
Do not apply patch for ordered fadd reductions
Yes, I added a check to LoopVectorizationLegality::canVectorizeFPMath so as not to allow this reduction when math is strict. Enabling it requires some work and it is better to do it separately.
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
---|---|---|
471 ↗ | (On Diff #378503) | Added! see reduc_store_fadd_fast function |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1011 | Maybe this can be folded into the all_of case below, i.e. return (all_of(getReductionVars(), [&](auto &Reduction) -> bool { const RecurrenceDescriptor &RdxDesc = Reduction.second; return !RdxDesc.hasExactFPMath() || (RdxDesc.isOrdered() && !RdxDesc.IntermediateStore); })); Also, the problem with this code at the moment is that you could have a mixture of fast and ordered reductions in the same loop. There could be an intermediate store for one of the fast reductions, but not for the ordered ones. At the moment with your code we will just bail out in this case. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
1011 | You are right! It can be done much simpler and the check is needed only when ordered reduction is present. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
943 | Hi @igor.kirillov, I think you're missing a test case here. Suppose we have two stores in the loop to the same invariant address. If the first store is predicated, but the second isn't we should still vectorise because the second store wins and the first one can be removed. At least the code here suggests that. I couldn't find any test that exercised this code path. | |
951 | I think you might be missing a test case for this. Can you make sure this code path is exercised please by your existing tests please? | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
477 ↗ | (On Diff #385810) | Can you put all the CHECK lines in the same place near the top of the function please to be consistent with the existing tests? |
517 ↗ | (On Diff #385810) | Hi @igor.kirillov, this doesn't look right. The test is called @reduc_store_fadd_fast, but there is not fast keyword on the fadd instruction. I don't think we should even be vectorising this because it requires ordered reductions, which you haven't enabled for this test. Can you change the name of this to @reduc_store_fadd_ordered and investigate why we are vectorising this? Can you also add a separate test called @reduc_store_fadd_fast that actually has the fast keyword too? |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
943 | Added reduc_store_final_store_predicated test that address both requests | |
951 | It is now covered by new reduc_store_final_store_predicated test and an old test reduc_store_inside_unrolled also executes this path | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
517 ↗ | (On Diff #385810) | I missed the fast keyword there. As for why this code gets vectorized - it happens because of Hints->allowReordering() returning true in LoopVectorizationLegality::canVectorizeFPMath. As you can see the test specifies vector width (-force-vector-width=4) and llvm allows to process ordered instructions in unordered manner (see also LoopVectorizeHints::allowReordering function) in that case. I added a new test with fast keyword to AArch64/strict-fadd.ll and the loop is not vectorized there. |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
---|---|---|
656 ↗ | (On Diff #388126) | Can you also add a test where the first store is predicated and the second one isn't? According to the code changes in this patch we should vectorise this case because the second one overrides the first. |
Add test
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
---|---|---|
656 ↗ | (On Diff #388126) | Added! See reduc_store_middle_store_predicated |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
941 | The case I was thinking about was something like the snippet below, where we have a load of the invariant address in the loop (%lv = load... in the example below). define void @reduc_store(i32* %dst, i32* readonly %src, i32* noalias %dst.2) { entry: %arrayidx = getelementptr inbounds i32, i32* %dst, i64 42 store i32 0, i32* %arrayidx, align 4 br label %for.body for.body: %0 = phi i32 [ 0, %entry ], [ %add, %for.body ] %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ] %arrayidx1 = getelementptr inbounds i32, i32* %src, i64 %indvars.iv %1 = load i32, i32* %arrayidx1, align 4 %add = add nsw i32 %0, %1 %lv = load i32, i32* %arrayidx store i32 %add, i32* %arrayidx, align 4 %gep.dst.2 = getelementptr inbounds i32, i32* %dst.2, i64 %indvars.iv store i32 %lv, i32* %gep.dst.2, %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond = icmp eq i64 %indvars.iv.next, 1000 br i1 %exitcond, label %for.cond.cleanup, label %for.body for.cond.cleanup: ret void } | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
468 ↗ | (On Diff #388148) | Could you add a brief textual explanation of what the test covers? |
482 ↗ | (On Diff #388148) | perhaps rename to %sum or something like that, to make it a bit easier to read the test? |
487 ↗ | (On Diff #388148) | The names for the 2 different GEPs in the tests are very similar. Could you rename them to make it easier to distinguish them (e.g. something like %gep.src/%gep.dst).. |
Hi @igor.kirillov, thanks for all the changes and the patch looks good now! I just had a couple of minor comments.
llvm/lib/Analysis/IVDescriptors.cpp | ||
---|---|---|
501 | nit: Could you remove an extra level of indentation here before merging the patch? i.e. this can be written like this: if (isa<PHINode>(UI)) PHIs.push_back(UI); else if (auto *SI = dyn_cast<StoreInst>(UI)) { ... } else NonPHIs.push_back(UI); | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
1021 | nit: I think you can you might be able to simplify the code here by removing FoundMatchingRecurrence . Then, lower down instead of the break you can just do if (DSI && (DSI == SI)) { *IsPredicated = blockNeedsPredication(DSI->getParent()); return true; } and at the bottom of the function just do: return false; |
llvm/lib/Analysis/IVDescriptors.cpp | ||
---|---|---|
501 | I made an extra step and simplified it even more | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
941 | In that case we do not vectorize, the rejection happens in LoopAccessInfo::analyzeLoop when loads ape processed: for (LoadInst *LD : Loads) { Value *Ptr = LD->getPointerOperand(); ... // See if there is an unsafe dependency between a load to a uniform address and // store to the same uniform address. if (UniformStores.count(Ptr)) { LLVM_DEBUG(dbgs() << "LAA: Found an unsafe dependency between a uniform " "load and uniform store to the same address!\n"); HasDependenceInvolvingLoopInvariantAddress = true; } ... I added reduc_store_load test anyway |
LGTM! Thanks for making all the changes @igor.kirillov.
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
---|---|---|
466 ↗ | (On Diff #388188) | nit: I think this should be invariant |
llvm/include/llvm/Analysis/IVDescriptors.h | ||
---|---|---|
268 | nit: should be a doc-comment? | |
llvm/include/llvm/Analysis/LoopAccessAnalysis.h | ||
579 | nit: could return ArrayRef, not leaking any details about the underlying container. | |
643 | nit: Is there a reason for choosing 5 for the size? if not, nowadays SmallVector can pick a good size automatically. | |
llvm/include/llvm/Analysis/ScalarEvolution.h | ||
1120 ↗ | (On Diff #388906) | Do you anticipate this to be used outside Transforms/Vectorize? If not I'm not sure if it makes sense to live here, and extending the API interface of ScalarEvolution. Can this instead be defined somewhere in Transforms/Vectorize? |
llvm/lib/Analysis/IVDescriptors.cpp | ||
250 | needs documentation? | |
322 | I guess it would be good to have at least a test case for loop-interchange to make sure it can deal with the change properly? | |
530 | nit: for consistency with other code avoid using != nullptr. This is more in line with the style use in LLVM in general (and you use ==/!= nullptr in the adjacent code here). | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
941 | Yeah but this is only due to some limitations that currently existing in LAA, right? I think we should at least make this clear somewhere, e.g. in a comment. | |
947 | nit: no need to use llvm:: | |
1024 | nit: redundant (). | |
1025 | could we instead just do the check at the call site or is there a benefit of doing it here? | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2788 | We effectively sink the store outside the loop. In that case, I don't think we should create a recipe and also we should not consider its cost. |
Hi, @fhahn! Since there have been several quite serious changes in the patch after your last review, I would be happy to receive you approval before merge (even though status is accepted now).
Thanks for the latest update! I left some more additional comments. It might be good to pre-commit the tests and only include the diff in the patch. Another thing to consider is moving the reduction-store tests to a separate file, as reduction.ll is already quite large.
One thing to note on the overall approach is that it's a bit of a workaround some current limitations in our modeling, but I don't think there's an alternative in the short term and this is clearly a desirable case to support, so that seems fine to me in general. There's work in progress to model the pre-header and exit block in VPlan, which allows us to do the sinking of the store more easily, so we can improve things further then :)
llvm/include/llvm/Analysis/IVDescriptors.h | ||
---|---|---|
175–176 | From an interface perspective, I think it would be better to make SE an optional argument and only perform the the store analysis when it is passed. This would also require users to explicitly opt-in, which would at least partially guarding against people ignoring IntermediateStore. | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
8978 | either deleted or go outside the loop sounds a bit unclear. Aren't they moved to the vector exit block and store the final reduction value? | |
8980 | I think it would also be good to include at least some information in the VPlan that the store is handled as part of the reduction. Perhaps VPReductionRecipe should print if the result is stored after the loop? Please add a test case to vplan-printing.ll | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
469 ↗ | (On Diff #390414) | FWIW for such compact IR test cases, the pseudo code doesn't add much value in my personal opinion. Better to strive to make the tests as readable/compact as possible and have a comment explaining what it tests when needed. |
546 ↗ | (On Diff #390414) | nit: newline. |
628 ↗ | (On Diff #390414) | Here (and for the other tests it would be good to check at least the vector reduction sequence and that the correct value is stored. |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
936 | It would probably be good to make clear what is handled here exactly and why we can handle those stores. IIUC this applies only to invariant stores that store reduction results and is safe because runtime checks guarantee that it won't alias with other objects. The store won't get vectorized, but sunk to the exit block during codegen. | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
582 ↗ | (On Diff #390414) | not needed? |
585 ↗ | (On Diff #390414) | nit: is this needed? Can just pass %n as i64 |
672 ↗ | (On Diff #390414) | move exit block to end of function? |
711 ↗ | (On Diff #390414) | move exit block to end of function? |
753 ↗ | (On Diff #390414) | move exit block to end of function? |
803 ↗ | (On Diff #390414) | move exit block to end of function? |
Not sure if you saw, there's a somewhat related bugreport: https://github.com/llvm/llvm-project/issues/50286
Not sure if this already supports that pattern though.
Update tests, move invariant store tests into separate file, add more comments
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
8980 | I'm not sure how to display this information properly. This is how output of VReductionRecipe::print looks like now: REDUCE ir<%red.next> = ir<%red> + fast reduce.fadd (ir<%lv>) I could add something to the end but it doesn't seem to fit there. Also I have not found any proper V.*Recipe where I could place something like DELETE store .*. | |
llvm/test/Transforms/LoopVectorize/reduction.ll | ||
469 ↗ | (On Diff #390414) | When I look at pseudo-code I immediately understand what it is about, whereas looking at IR takes at least 30 seconds of cognitive exertions. And it is also easier to see the difference between and purpose of all those quite similar tests. But I delete it, of course, if you insist :) |
@fhahn I created the review with tests only here - https://reviews.llvm.org/D117213. Once it is merged I'll update this one.
@lebedev.ri This patch addresses a different problem. Here we try to handle when a reduction value is stored in some address and if this address is invariant plus other lucky conditions are satisfied we manage to vectorize. In your example address where value is stored is not invariant at all. Nevertheless, the case is interesting.
Hi @igor.kirillov, the patch looks good now! I just had one question about an odd-looking CHECK line in one of the test files.
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
---|---|---|
455 | nit: I think it's fine (and cleaner) to just do %[[PHI:%.*]] = %[[ADDR:%.*]] = here. Also, can you add the incoming value to the phi just to make sure it's the correct value? I think it should just be phi i32 {{ [[TMP]], %middle.block }} | |
483 | This looks a bit odd. I'm not sure why it's needed? If there is a CHECK-LABEL for every function I don't think this should happen. |
Thanks for the update!
llvm/lib/Analysis/IVDescriptors.cpp | ||
---|---|---|
365 | Do we have to make sure that the stored value is feeding the phi again and not an earlier value in the cycle? | |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
179 | I think it would be good to check the full reduction cycle & store here. I don't think this test case is handled correctly at the moment as the IR seems out of sync with the pseudo code (which is why personally I think the C pseudo code is a bit distracting) Note that in the IR the reduction cycle is %sum -> sum.1 -> %sum and not %sum -> %sum.1 -> %sum.2 -> %sum. When this gets vectorized, the final value written to dst[42] is the final value of %sum.1, not %sum.2 as it should be I think. |
Add fix that prevents vectorization for the case when not a final reduction value stored in a loop invariant address.
Add test for this case.
Remove some unused code and rename a couple of IR variables in the reduction-with-invariant-store.ll test.
Fix incorrect phi node in reduc_store_inside_unrolled and reduc_double_invariant_store tests.
llvm/lib/Analysis/IVDescriptors.cpp | ||
---|---|---|
365 | Yes! I added this check and also a test case (reduc_store_not_final_value) | |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
179 | The IR was incorrect actually because of the error I made during renaming and the pseudo-code was showing my original plot for the test-case. Luckily it helped to expose a bug, so I added an extra check to IVDescriptors.cpp and now reduc_store_not_final_value function is testing this case (when we have IntermediateStore but no ExitInstruction and value stored is not actually a final value). |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
---|---|---|
179 | I think it would still be good to test that the full reduction cycle is generated correctly (probably worth checking the full vector body + middle.block), at least for some of the tests in the file, to make sure the full sequence is generated correctly. |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
---|---|---|
179 |
This seems to indicate that the pseudo code may be distracting (: |
Add full vector.body and middle.block checks for reduc_store and reduc_store_inside_unrolled tests
Thanks for the update with the tests! I think there might still be a correctness issue. More details inline.
llvm/include/llvm/Analysis/IVDescriptors.h | ||
---|---|---|
177 | nit: addresses ? | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
965 | I'm not sure if the comment matches the code. There's no guarantee that all stores to the same address come before the store of the reduction AFAICT. E.g. the test below has a store to the same address after the store of the reduction. I think this may get mis-compiled, as the store to 0 gets replaced with the store of the final value of the reduction. define void @reduc_store(i32* %dst, i32* readonly %src) { entry: %gep.dst = getelementptr inbounds i32, i32* %dst, i64 42 store i32 1, i32* %gep.dst, align 4 br label %for.body for.body: %sum = phi i32 [ 0, %entry ], [ %add, %for.body ] %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ] %gep.src = getelementptr inbounds i32, i32* %src, i64 %iv %0 = load i32, i32* %gep.src, align 4 %add = add nsw i32 %sum, %0 store i32 %add, i32* %gep.dst, align 4 store i32 0, i32* %gep.dst, align 4 %iv.next = add nuw nsw i64 %iv, 1 %exitcond = icmp eq i64 %iv.next, 1000 br i1 %exitcond, label %exit, label %for.body exit: ret void } | |
1023 | nit: can drop DSI, as SI is guaranteed to be non-null? |
Fix incorrectly processed case when reduction value stored in invariant could be overwritten inside loop
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
965 | Yeah, it was processed incorrectly. I added this test along with the fix. |
Thanks for the update! I'll do a bit more testing and will take another look soon, but I *think* it should be good now.
@fhahn, @david-arm ping. Also I run lnt test-suite and everything is fine and 17 more loops were vectorized according to -stats
LGTM! Thanks for addressing all the comments and fixing bugs, adding new tests, etc. I have a couple of nits you can address before merging. I think given that you've run the LLVM test suite without seeing failures and we know more loops are vectorising then that gives us a good level of confidence. :)
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
---|---|---|
256 | nit: Should be i++, not i+=2. | |
303 | nit: Looks like an unnecessary change from %sum.1 -> %sum.2. |
llvm/include/llvm/Analysis/IVDescriptors.h | ||
---|---|---|
273 | nit: would be good to also document what this means, what the properties of such stores are. | |
llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h | ||
311 | nit: could be more specific, the increment of the reduction is used as stored operand, right? Also, they only handle reductions right? Then using Reduction instead of Recurrence seems more descriptive, .e.g. isInvariantStoreOfReduction? (same for isRecurringInvariantAddress | |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
453 | With opaque pointers, this code may behave differently. It's possible to have store i32 0, ptr %x store i8 0, ptr %x In that case the pointer operands will be the same, but different store widths. I *think* we should also check that the types of the stored values match for now, as we use this to remove earlier stores. This should only be correct if the later store writes at least as many bits as the earlier stores. | |
1020 | nit: could be any_of | |
1030 | nit: could be any_of. | |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
412 | Would it be possible to add runtime tests for the mis-compiles fixed to https://github.com/llvm/llvm-test-suite/tree/main/SingleSource/UnitTests/Vectorizer? |
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp | ||
---|---|---|
453 | I think if this check is added here, then the purpose of the function would be different (address is still the same even if values have different size). So, instead of that I added a check to a place where those pointers are really processed - see LoopVectorizationLegality.cpp:975. There we now make sure that values stored in an invariant address are of the same type. | |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
303 | sum.1 is sum + src[i] and sum.2 is sum + src[i] + src[i+1]. Looks like everything is fine for me. |
llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll | ||
---|---|---|
412 | Added a simple test here - https://reviews.llvm.org/D124609 |
Hello,
I wrote
https://github.com/llvm/llvm-project/issues/57572
about a verifier complaint/crash that started happening with this patch.
Drive by note: This new include does not look necessary.