This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorize with FoldTail when Primary Induction is absent
ClosedPublic

Authored by Ayal on Apr 7 2020, 2:27 AM.

Details

Summary

Introduce a new VPWidenPrimaryInductionRecipe to generate a vector primary
induction for use in fold-tail-with-masking when a scalar primary induction is
missing.

Follows approach (1) discussed in D76838.

Diff Detail

Event Timeline

Ayal created this revision.Apr 7 2020, 2:27 AM
Ayal added a comment.Apr 7 2020, 2:31 AM

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

fhahn added a comment.Apr 7 2020, 2:58 AM

Thanks for the patch, Ayal!

I've put up D77577 yesterday to allow mapping the primary IV used to a different IR value during codegen as alternative, but I think adding the recipe is more straight forward in the end, as there is a single place we need to use the primary IV.

llvm/lib/Transforms/Vectorize/VPlan.h
333

Maybe include Vector in the name, e.g. VectorInduction, to avoid confusion with Legal's PrimaryInduction

1171

Comment needs updating.

I was also drafting a patch to implement this yesterday, and it was pretty much this! So I guess that's a good sign. :-)

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

The purpose of this test was to catch a regression that we were seeing when tail-predication was rejected, but then incorrectly vectorisation also wasn't happening.
In this case, I think it is good to force vectorisation with a vectorisation vector of 4 or something along those lines. I was also modifying this test, but that will do for now, and then I will pick it up later.

I've applied the patch locally, and I'm a bit confused that test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll doesn't fail for me, I will double check to see what's going on.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6773

nit: perhaps this comments now needs to be moved to line 6782, and we need to say something new/extra about the primary IV?

llvm/lib/Transforms/Vectorize/VPlan.h
1158

nit: just curious, do we actually need this?

Ayal marked 3 inline comments as done.Apr 8 2020, 4:59 AM

I was also drafting a patch to implement this yesterday, and it was pretty much this! So I guess that's a good sign. :-)

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

The purpose of this test was to catch a regression that we were seeing when tail-predication was rejected, but then incorrectly vectorisation also wasn't happening.
In this case, I think it is good to force vectorisation with a vectorisation vector of 4 or something along those lines. I was also modifying this test, but that will do for now, and then I will pick it up later.

Sorry for the confusion, was referring to the test under test/Transforms/LoopVectorize/ rather than the one under test/Transforms/LoopVectorize/ARM

OK, modified test to force vectorization with VF=4.

(Relevant for picking up later:) Note that llvm/test/Transforms/LoopVectorize/X86/small-size.ll also checks that a loop with reverse iv gets vectorized with fold-tail.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6773

Added a line saying "Start by constructing the desired canonical IV."

llvm/lib/Transforms/Vectorize/VPlan.h
333

Agreed. Updated comment and changed name to VectorLoopScalarIV. Sounds better/ok?

Maybe better to avoid overloading the "PrimaryInduction" name throughout, keeping it with its original meaning only, and use "Canonical" instead.

1158

we need a VPValue; it could alternatively be owned by Plan, but seems better to hold it locally here.

Ayal updated this revision to Diff 255969.Apr 8 2020, 5:05 AM
Ayal marked an inline comment as done.

Addressed comments.

SjoerdMeijer added inline comments.Apr 8 2020, 5:54 AM
llvm/lib/Transforms/Vectorize/VPlan.h
333

We are well into bikeshedding territory now, but just one quick question. VectorLoopScalarIV describes well what this is I think, but was just wondering that if we refer to this as the canonical IV in comments, should this not be named something with CanonicalIV in it?

fhahn accepted this revision.Apr 8 2020, 11:13 AM

LGTM, thanks.

llvm/lib/Transforms/Vectorize/VPlan.h
1153

nit: private not needed here, right? Classes should default to private

1154

Could be a unique_ptr?

This revision is now accepted and ready to land.Apr 8 2020, 11:13 AM
fhahn added inline comments.Apr 8 2020, 11:25 AM
llvm/lib/Transforms/Vectorize/VPlan.h
1154

actually, the lifetime is directly tied to the recipe, right? So maybe no pointer is needed at all and we can just add a VPValue Val member?

Ayal marked 3 inline comments as done.Apr 8 2020, 1:01 PM
Ayal added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.h
333

Yes, I was wondering above if we should use (Scalar/Vector) CanonicalIV throughout, instead of overloading the original PrimaryInduction, which is absent... uploading another version to see how it looks.

1153

right, trying to be consistent...
They should all be dropped in a separate NFC patch.

1154

Right! Lifetimes are indeed tied; when the recipe turns into a VPInstruction, the VPValue will coincide with 'this'. Good catch.

Ayal updated this revision to Diff 256097.Apr 8 2020, 1:18 PM

Address comments, use CanonicalIV instead of overloading the original PrimaryInduction term, update some comments.

SjoerdMeijer accepted this revision.Apr 8 2020, 3:17 PM

Thanks for the patch, and I think CanonicalIV is an improvement.

This LGTM too.

llvm/test/Transforms/LoopVectorize/X86/small-size.ll
171–172

typo: it's -> its

This revision was automatically updated to reflect the committed changes.

Hello @Ayal, unfortunately this patch causes the functional regression.
For the test below, vectorizer decided to vectorize inner loop by 32 while it has only a couple of iteration and it causes a miscompile.
Please fix it quickly or revert the patch.

The reproducer:

; ModuleID = './repro.ll'
source_filename = "./repro.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

@global = external global i8*

define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) {
bb:
  %tmp = load atomic i8*, i8** @global unordered, align 8
  %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852
  br label %bb3

bb3:                                              ; preds = %bb12, %bb
  %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ]
  %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ]
  br label %bb7

bb6:                                              ; preds = %bb12
  ret void

bb7:                                              ; preds = %bb7, %bb3
  %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ]
  %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ]
  %tmp10 = add nuw nsw i32 %tmp9, 1
  %tmp11 = icmp ugt i32 %tmp9, 5
  br i1 %tmp11, label %bb12, label %bb7

bb12:                                             ; preds = %bb7
  %tmp13 = mul i32 %tmp8, %tmp4
  %tmp14 = trunc i32 %tmp13 to i8
  fence release
  store atomic i8 %tmp14, i8* %tmp2 unordered, align 1
  fence seq_cst
  %tmp15 = add nuw nsw i32 %tmp4, 1
  %tmp16 = icmp ult i32 %tmp4, 240
  br i1 %tmp16, label %bb3, label %bb6
}

ran as

> opt -passes=loop-vectorize -S -o res.ll ./repro.ll
Ayal added a comment.Apr 14 2020, 1:39 AM

Hello @Ayal, unfortunately this patch causes the functional regression.
For the test below, vectorizer decided to vectorize inner loop by 32 while it has only a couple of iteration and it causes a miscompile.
Please fix it quickly or revert the patch.

The reproducer:

; ModuleID = './repro.ll'
source_filename = "./repro.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

@global = external global i8*

define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) {
bb:
  %tmp = load atomic i8*, i8** @global unordered, align 8
  %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852
  br label %bb3

bb3:                                              ; preds = %bb12, %bb
  %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ]
  %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ]
  br label %bb7

bb6:                                              ; preds = %bb12
  ret void

bb7:                                              ; preds = %bb7, %bb3
  %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ]
  %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ]
  %tmp10 = add nuw nsw i32 %tmp9, 1
  %tmp11 = icmp ugt i32 %tmp9, 5
  br i1 %tmp11, label %bb12, label %bb7

bb12:                                             ; preds = %bb7
  %tmp13 = mul i32 %tmp8, %tmp4
  %tmp14 = trunc i32 %tmp13 to i8
  fence release
  store atomic i8 %tmp14, i8* %tmp2 unordered, align 1
  fence seq_cst
  %tmp15 = add nuw nsw i32 %tmp4, 1
  %tmp16 = icmp ult i32 %tmp4, 240
  br i1 %tmp16, label %bb3, label %bb6
}

ran as

> opt -passes=loop-vectorize -S -o res.ll ./repro.ll

Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR.
Continuing to investigate.

Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR.
Continuing to investigate.

Hi Ayal, thank you for information.
Indeed, I've reduced an original reproducer and found this patch as a first commit exposing the bug.
Would very appreciate if you could fix the real problem!

FYI: I have extracted the ARM test changes from D76838 and have committed that in 9633fc14aef7. Just saying also just in case you do end up reverting this, then that would need some changes. Probably the easiest is to change the check-prefixes to --check-prefixes=COMMON, because then it won't be checking much which is fine in that case; I will pick this up later, also because they will need some work later. I have a suspicion sgt_no_loopguard is miscompiled, but I need to look closer. Anyway, these were the tests that I thought were good to have, and with them in tree it is easier to talk about them.

Some of these tests also show that this change does not support some of the cases that was supported by D76838, and I will be looking at that, hence I will be touching test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll again sooner or later.

I wrote a PR about a crash that started happening with this patch:
https://bugs.llvm.org/show_bug.cgi?id=51614

rkruppe removed a subscriber: rkruppe.Aug 25 2021, 8:05 AM