Introduce a new VPWidenPrimaryInductionRecipe to generate a vector primary
induction for use in fold-tail-with-masking when a scalar primary induction is
missing.
Follows approach (1) discussed in D76838.
Differential D77635
[LV] Vectorize with FoldTail when Primary Induction is absent Ayal on Apr 7 2020, 2:27 AM. Authored by
Details Introduce a new VPWidenPrimaryInductionRecipe to generate a vector primary Follows approach (1) discussed in D76838.
Diff Detail
Unit Tests Event TimelineComment Actions @SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed? Comment Actions Thanks for the patch, Ayal! I've put up D77577 yesterday to allow mapping the primary IV used to a different IR value during codegen as alternative, but I think adding the recipe is more straight forward in the end, as there is a single place we need to use the primary IV.
Comment Actions I was also drafting a patch to implement this yesterday, and it was pretty much this! So I guess that's a good sign. :-)
The purpose of this test was to catch a regression that we were seeing when tail-predication was rejected, but then incorrectly vectorisation also wasn't happening. Comment Actions I've applied the patch locally, and I'm a bit confused that test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll doesn't fail for me, I will double check to see what's going on.
Comment Actions Sorry for the confusion, was referring to the test under test/Transforms/LoopVectorize/ rather than the one under test/Transforms/LoopVectorize/ARM OK, modified test to force vectorization with VF=4. (Relevant for picking up later:) Note that llvm/test/Transforms/LoopVectorize/X86/small-size.ll also checks that a loop with reverse iv gets vectorized with fold-tail.
Comment Actions Address comments, use CanonicalIV instead of overloading the original PrimaryInduction term, update some comments. Comment Actions Thanks for the patch, and I think CanonicalIV is an improvement. This LGTM too.
Comment Actions Hello @Ayal, unfortunately this patch causes the functional regression. The reproducer: ; ModuleID = './repro.ll' source_filename = "./repro.ll" target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2" target triple = "x86_64-unknown-linux-gnu" @global = external global i8* define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) { bb: %tmp = load atomic i8*, i8** @global unordered, align 8 %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852 br label %bb3 bb3: ; preds = %bb12, %bb %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ] %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ] br label %bb7 bb6: ; preds = %bb12 ret void bb7: ; preds = %bb7, %bb3 %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ] %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ] %tmp10 = add nuw nsw i32 %tmp9, 1 %tmp11 = icmp ugt i32 %tmp9, 5 br i1 %tmp11, label %bb12, label %bb7 bb12: ; preds = %bb7 %tmp13 = mul i32 %tmp8, %tmp4 %tmp14 = trunc i32 %tmp13 to i8 fence release store atomic i8 %tmp14, i8* %tmp2 unordered, align 1 fence seq_cst %tmp15 = add nuw nsw i32 %tmp4, 1 %tmp16 = icmp ult i32 %tmp4, 240 br i1 %tmp16, label %bb3, label %bb6 } ran as > opt -passes=loop-vectorize -S -o res.ll ./repro.ll Comment Actions Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR. Comment Actions
Hi Ayal, thank you for information. Comment Actions FYI: I have extracted the ARM test changes from D76838 and have committed that in 9633fc14aef7. Just saying also just in case you do end up reverting this, then that would need some changes. Probably the easiest is to change the check-prefixes to --check-prefixes=COMMON, because then it won't be checking much which is fine in that case; I will pick this up later, also because they will need some work later. I have a suspicion sgt_no_loopguard is miscompiled, but I need to look closer. Anyway, these were the tests that I thought were good to have, and with them in tree it is easier to talk about them. Some of these tests also show that this change does not support some of the cases that was supported by D76838, and I will be looking at that, hence I will be touching test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll again sooner or later. Comment Actions I wrote a PR about a crash that started happening with this patch: |
nit: perhaps this comments now needs to be moved to line 6782, and we need to say something new/extra about the primary IV?