This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
1/2
LoopVectorize.cpp
6/12
VPlan.h
-
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
1
small-size.ll
-
tail-folding-counting-down.ll

Differential D77635

[LV] Vectorize with FoldTail when Primary Induction is absent
ClosedPublic

Authored by Ayal on Apr 7 2020, 2:27 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
fhahn
gilr
samparker
dmgreen

Commits

rG16784892347b: [LV] FoldTail w/o Primary Induction

Summary

Introduce a new VPWidenPrimaryInductionRecipe to generate a vector primary
induction for use in fold-tail-with-masking when a scalar primary induction is
missing.

Follows approach (1) discussed in D76838.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Ayal created this revision.Apr 7 2020, 2:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 7 2020, 2:27 AM

Herald added subscribers: llvm-commits, rogfer01, rkruppe, hiraditya. · View Herald Transcript

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

Thanks for the patch, Ayal!

I've put up D77577 yesterday to allow mapping the primary IV used to a different IR value during codegen as alternative, but I think adding the recipe is more straight forward in the end, as there is a single place we need to use the primary IV.

llvm/lib/Transforms/Vectorize/VPlan.h
333	Maybe include Vector in the name, e.g. VectorInduction, to avoid confusion with Legal's PrimaryInduction
1171	Comment needs updating.

Ayal mentioned this in D76838: [LV][LoopInfo] Transform counting-down loops to counting-up loop.Apr 7 2020, 3:30 AM

Harbormaster failed remote builds in B52129: Diff 255617!Apr 7 2020, 3:45 AM

I was also drafting a patch to implement this yesterday, and it was pretty much this! So I guess that's a good sign. :-)

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

The purpose of this test was to catch a regression that we were seeing when tail-predication was rejected, but then incorrectly vectorisation also wasn't happening.
In this case, I think it is good to force vectorisation with a vectorisation vector of 4 or something along those lines. I was also modifying this test, but that will do for now, and then I will pick it up later.

I've applied the patch locally, and I'm a bit confused that test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll doesn't fail for me, I will double check to see what's going on.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6773	nit: perhaps this comments now needs to be moved to line 6782, and we need to say something new/extra about the primary IV?
llvm/lib/Transforms/Vectorize/VPlan.h
1158	nit: just curious, do we actually need this?

In D77635#1966625, @SjoerdMeijer wrote:

I was also drafting a patch to implement this yesterday, and it was pretty much this! So I guess that's a good sign. :-)

@SjoerdMeijer , test tail-folding-counting-down.ll introduced in D72324 now fails, as it can be vectorized with fold-tail, but is not vectorized due to cost. What's the intention of this test and how should it be changed?

The purpose of this test was to catch a regression that we were seeing when tail-predication was rejected, but then incorrectly vectorisation also wasn't happening.
In this case, I think it is good to force vectorisation with a vectorisation vector of 4 or something along those lines. I was also modifying this test, but that will do for now, and then I will pick it up later.

Sorry for the confusion, was referring to the test under test/Transforms/LoopVectorize/ rather than the one under test/Transforms/LoopVectorize/ARM

OK, modified test to force vectorization with VF=4.

(Relevant for picking up later:) Note that llvm/test/Transforms/LoopVectorize/X86/small-size.ll also checks that a loop with reverse iv gets vectorized with fold-tail.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6773	Added a line saying "Start by constructing the desired canonical IV."
llvm/lib/Transforms/Vectorize/VPlan.h
333	Agreed. Updated comment and changed name to VectorLoopScalarIV. Sounds better/ok? Maybe better to avoid overloading the "PrimaryInduction" name throughout, keeping it with its original meaning only, and use "Canonical" instead.
1158	we need a VPValue; it could alternatively be owned by Plan, but seems better to hold it locally here.

Addressed comments.

SjoerdMeijer added inline comments.Apr 8 2020, 5:54 AM

llvm/lib/Transforms/Vectorize/VPlan.h
333	We are well into bikeshedding territory now, but just one quick question. VectorLoopScalarIV describes well what this is I think, but was just wondering that if we refer to this as the canonical IV in comments, should this not be named something with CanonicalIV in it?

LGTM, thanks.

llvm/lib/Transforms/Vectorize/VPlan.h
1153	nit: private not needed here, right? Classes should default to private
1154	Could be a unique_ptr?

This revision is now accepted and ready to land.Apr 8 2020, 11:13 AM

fhahn added inline comments.Apr 8 2020, 11:25 AM

llvm/lib/Transforms/Vectorize/VPlan.h
1154	actually, the lifetime is directly tied to the recipe, right? So maybe no pointer is needed at all and we can just add a `VPValue Val` member?

Ayal marked 3 inline comments as done.Apr 8 2020, 1:01 PM

Ayal added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.h
333	Yes, I was wondering above if we should use (Scalar/Vector) `CanonicalIV` throughout, instead of overloading the original `PrimaryInduction`, which is absent... uploading another version to see how it looks.
1153	right, trying to be consistent... They should all be dropped in a separate NFC patch.
1154	Right! Lifetimes are indeed tied; when the recipe turns into a VPInstruction, the VPValue will coincide with 'this'. Good catch.

Address comments, use CanonicalIV instead of overloading the original PrimaryInduction term, update some comments.

Thanks for the patch, and I think CanonicalIV is an improvement.

This LGTM too.

llvm/test/Transforms/LoopVectorize/X86/small-size.ll
171–172	typo: it's -> its

Closed by commit rG16784892347b: [LV] FoldTail w/o Primary Induction (authored by Ayal). · Explain WhyApr 9 2020, 8:08 AM

This revision was automatically updated to reflect the committed changes.

Hello @Ayal, unfortunately this patch causes the functional regression.
For the test below, vectorizer decided to vectorize inner loop by 32 while it has only a couple of iteration and it causes a miscompile.
Please fix it quickly or revert the patch.

The reproducer:

; ModuleID = './repro.ll'
source_filename = "./repro.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

@global = external global i8*

define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) {
bb:
  %tmp = load atomic i8*, i8** @global unordered, align 8
  %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852
  br label %bb3

bb3:                                              ; preds = %bb12, %bb
  %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ]
  %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ]
  br label %bb7

bb6:                                              ; preds = %bb12
  ret void

bb7:                                              ; preds = %bb7, %bb3
  %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ]
  %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ]
  %tmp10 = add nuw nsw i32 %tmp9, 1
  %tmp11 = icmp ugt i32 %tmp9, 5
  br i1 %tmp11, label %bb12, label %bb7

bb12:                                             ; preds = %bb7
  %tmp13 = mul i32 %tmp8, %tmp4
  %tmp14 = trunc i32 %tmp13 to i8
  fence release
  store atomic i8 %tmp14, i8* %tmp2 unordered, align 1
  fence seq_cst
  %tmp15 = add nuw nsw i32 %tmp4, 1
  %tmp16 = icmp ult i32 %tmp4, 240
  br i1 %tmp16, label %bb3, label %bb6
}

ran as

> opt -passes=loop-vectorize -S -o res.ll ./repro.ll

In D77635#1979648, @skatkov wrote:

The reproducer:

; ModuleID = './repro.ll'
source_filename = "./repro.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

@global = external global i8*

define void @hoge(i8* nonnull align 8 dereferenceable_or_null(8) %arg, i8* align 8 dereferenceable_or_null(16) %arg1) {
bb:
  %tmp = load atomic i8*, i8** @global unordered, align 8
  %tmp2 = getelementptr inbounds i8, i8* %tmp, i64 852
  br label %bb3

bb3:                                              ; preds = %bb12, %bb
  %tmp4 = phi i32 [ 1, %bb ], [ %tmp15, %bb12 ]
  %tmp5 = phi i32 [ 0, %bb ], [ %tmp8, %bb12 ]
  br label %bb7

bb6:                                              ; preds = %bb12
  ret void

bb7:                                              ; preds = %bb7, %bb3
  %tmp8 = phi i32 [ %tmp5, %bb3 ], [ %tmp10, %bb7 ]
  %tmp9 = phi i32 [ 1, %bb3 ], [ %tmp10, %bb7 ]
  %tmp10 = add nuw nsw i32 %tmp9, 1
  %tmp11 = icmp ugt i32 %tmp9, 5
  br i1 %tmp11, label %bb12, label %bb7

bb12:                                             ; preds = %bb7
  %tmp13 = mul i32 %tmp8, %tmp4
  %tmp14 = trunc i32 %tmp13 to i8
  fence release
  store atomic i8 %tmp14, i8* %tmp2 unordered, align 1
  fence seq_cst
  %tmp15 = add nuw nsw i32 %tmp4, 1
  %tmp16 = icmp ult i32 %tmp4, 240
  br i1 %tmp16, label %bb3, label %bb6
}

ran as

> opt -passes=loop-vectorize -S -o res.ll ./repro.ll

Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR.
Continuing to investigate.

Thanks @skatkov. The test compiles for me, and the part that this patch introduces looks correct, but there seems to be a problem with how %tmp8 is handled - as a live-out first-order-recurrence which fold-tail does not handle (the compare it introduces is not used by anyone). To reproduce the bug w/o this patch, transform the loop iv %tmp9 to start at 0 and exit the loop when equal to 4 (instead of starting at 1 and exiting at 5), and add 1 to %tmp8. Would be good to open a PR.
Continuing to investigate.

Hi Ayal, thank you for information.
Indeed, I've reduced an original reproducer and found this patch as a first commit exposing the bug.
Would very appreciate if you could fix the real problem!

PR filed: https://bugs.llvm.org/show_bug.cgi?id=45526

FYI: I have extracted the ARM test changes from D76838 and have committed that in 9633fc14aef7. Just saying also just in case you do end up reverting this, then that would need some changes. Probably the easiest is to change the check-prefixes to --check-prefixes=COMMON, because then it won't be checking much which is fine in that case; I will pick this up later, also because they will need some work later. I have a suspicion sgt_no_loopguard is miscompiled, but I need to look closer. Anyway, these were the tests that I thought were good to have, and with them in tree it is easier to talk about them.

Some of these tests also show that this change does not support some of the cases that was supported by D76838, and I will be looking at that, hence I will be touching test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll again sooner or later.

SjoerdMeijer mentioned this in rG9633fc14aef7: [LV][ARM] Add tail-folding tests for MVE. NFC..Apr 14 2020, 8:33 AM

SjoerdMeijer mentioned this in D79175: [ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC.Jun 10 2020, 8:57 AM

I wrote a PR about a crash that started happening with this patch:
https://bugs.llvm.org/show_bug.cgi?id=51614

Herald added a subscriber: vkmr. · View Herald TranscriptAug 24 2021, 11:51 PM

rkruppe removed a subscriber: rkruppe.Aug 25 2021, 8:05 AM

Ayal mentioned this in D116123: [VPlan] Handle IV vector splat using VPWidenCanonicalIV..Dec 26 2021, 9:51 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

9 lines

LoopVectorize.cpp

23 lines

VPlan.h

34 lines

VPlan.cpp

25 lines

test/

Transforms/

LoopVectorize/

X86/

small-size.ll

20 lines

tail-folding-counting-down.ll

11 lines

Diff 256315

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 1,227 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize(bool UseVPlanNativePath) {
// no restrictions.		// no restrictions.
return Result;		return Result;
}		}

bool LoopVectorizationLegality::prepareToFoldTailByMasking() {		bool LoopVectorizationLegality::prepareToFoldTailByMasking() {

LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");		LLVM_DEBUG(dbgs() << "LV: checking if tail can be folded by masking.\n");

if (!PrimaryInduction) {
reportVectorizationFailure(
"No primary induction, cannot fold tail by masking",
"Missing a primary induction variable in the loop, which is "
"needed in order to fold tail by masking as required.",
"NoPrimaryInduction", ORE, TheLoop);
return false;
}

SmallPtrSet<const Value *, 8> ReductionLiveOuts;		SmallPtrSet<const Value *, 8> ReductionLiveOuts;

for (auto &Reduction : getReductionVars())		for (auto &Reduction : getReductionVars())
ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());		ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());

// TODO: handle non-reduction outside users when tail is folded by masking.		// TODO: handle non-reduction outside users when tail is folded by masking.
for (auto *AE : AllowedExit) {		for (auto *AE : AllowedExit) {
// Check that all users of allowed exit values are inside the loop or		// Check that all users of allowed exit values are inside the loop or
Show All 36 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,579 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::executePlan(InnerLoopVectorizer &ILV,
// 1. Create a new empty loop. Unlink the old loop and connect the new one.		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
VPCallbackILV CallbackILV(ILV);		VPCallbackILV CallbackILV(ILV);

VPTransformState State{BestVF, BestUF, LI,		VPTransformState State{BestVF, BestUF, LI,
DT, ILV.Builder, ILV.VectorLoopValueMap,		DT, ILV.Builder, ILV.VectorLoopValueMap,
&ILV, CallbackILV};		&ILV, CallbackILV};
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
State.TripCount = ILV.getOrCreateTripCount(nullptr);		State.TripCount = ILV.getOrCreateTripCount(nullptr);
		State.CanonicalIV = ILV.Induction;

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
// Notice: any optimization or new instruction that go		// Notice: any optimization or new instruction that go
// into the code below should also be implemented in		// into the code below should also be implemented in
// the cost-model.		// the cost-model.
//		//
//===------------------------------------------------===//		//===------------------------------------------------===//
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
// All-one mask is modelled as no-mask following the convention for masked		// All-one mask is modelled as no-mask following the convention for masked
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredication(BB))		if (!CM.blockNeedsPredication(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: perhaps this comments now needs to be moved to line 6782, and we need to say something new/extra about the primary IV? SjoerdMeijer: nit: perhaps this comments now needs to be moved to line 6782, and we need to say something…
		AyalAuthorUnsubmitted Done Reply Inline Actions Added a line saying "Start by constructing the desired canonical IV." Ayal: Added a line saying "Start by constructing the desired canonical IV."
// This is used instead of IV < TC because TC may wrap, unlike BTC.		// This is used instead of IV < TC because TC may wrap, unlike BTC.
VPValue *IV = Plan->getVPValue(Legal->getPrimaryInduction());		// Start by constructing the desired canonical IV.
		VPValue *IV = nullptr;
		if (Legal->getPrimaryInduction())
		IV = Plan->getVPValue(Legal->getPrimaryInduction());
		else {
		auto IVRecipe = new VPWidenCanonicalIVRecipe();
		Builder.getInsertBlock()->appendRecipe(IVRecipe);
		IV = IVRecipe->getVPValue();
		}
VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
}		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
for (auto *Predecessor : predecessors(BB)) {		for (auto *Predecessor : predecessors(BB)) {
VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);		VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);
▲ Show 20 Lines • Show All 342 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::buildVPlansWithVPRecipes(unsigned MinVF,
for (BasicBlock *BB : OrigLoop->blocks()) {		for (BasicBlock *BB : OrigLoop->blocks()) {
if (BB == Latch)		if (BB == Latch)
continue;		continue;
BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());		BranchInst *Branch = dyn_cast<BranchInst>(BB->getTerminator());
if (Branch && Branch->isConditional())		if (Branch && Branch->isConditional())
NeedDef.insert(Branch->getCondition());		NeedDef.insert(Branch->getCondition());
}		}

// If the tail is to be folded by masking, the primary induction variable		// If the tail is to be folded by masking, the primary induction variable, if
// needs to be represented in VPlan for it to model early-exit masking.		// exists needs to be represented in VPlan for it to model early-exit masking.
// Also, both the Phi and the live-out instruction of each reduction are		// Also, both the Phi and the live-out instruction of each reduction are
// required in order to introduce a select between them in VPlan.		// required in order to introduce a select between them in VPlan.
if (CM.foldTailByMasking()) {		if (CM.foldTailByMasking()) {
		if (Legal->getPrimaryInduction())
NeedDef.insert(Legal->getPrimaryInduction());		NeedDef.insert(Legal->getPrimaryInduction());
for (auto &Reduction : Legal->getReductionVars()) {		for (auto &Reduction : Legal->getReductionVars()) {
NeedDef.insert(Reduction.first);		NeedDef.insert(Reduction.first);
NeedDef.insert(Reduction.second.getLoopExitInstr());		NeedDef.insert(Reduction.second.getLoopExitInstr());
}		}
}		}

// Collect instructions from the original loop that will become trivially dead		// Collect instructions from the original loop that will become trivially dead
// in the vectorized loop. We don't need to vectorize these instructions. For		// in the vectorized loop. We don't need to vectorize these instructions. For
▲ Show 20 Lines • Show All 420 Lines • ▼ Show 20 Lines	static ScalarEpilogueLowering getScalarEpilogueLowering(
// don't look at hints or options, and don't request a scalar epilogue.		// don't look at hints or options, and don't request a scalar epilogue.
if (OptSize && Hints.getForce() != LoopVectorizeHints::FK_Enabled)		if (OptSize && Hints.getForce() != LoopVectorizeHints::FK_Enabled)
return CM_ScalarEpilogueNotAllowedOptSize;		return CM_ScalarEpilogueNotAllowedOptSize;

bool PredicateOptDisabled = PreferPredicateOverEpilog.getNumOccurrences() &&		bool PredicateOptDisabled = PreferPredicateOverEpilog.getNumOccurrences() &&
!PreferPredicateOverEpilog;		!PreferPredicateOverEpilog;

// 2) Next, if disabling predication is requested on the command line, honour		// 2) Next, if disabling predication is requested on the command line, honour
// this and request a scalar epilogue. Also do this if we don't have a		// this and request a scalar epilogue.
// primary induction variable, which is required for predication.		if (PredicateOptDisabled)
if (PredicateOptDisabled \|\| !LVL.getPrimaryInduction())
return CM_ScalarEpilogueAllowed;		return CM_ScalarEpilogueAllowed;

// 3) and 4) look if enabling predication is requested on the command line,		// 3) and 4) look if enabling predication is requested on the command line,
// with a loop hint, or if the TTI hook indicates this is profitable, request		// with a loop hint, or if the TTI hook indicates this is profitable, request
// predication .		// predication .
if (PreferPredicateOverEpilog \|\|		if (PreferPredicateOverEpilog \|\|
Hints.getPredicate() == LoopVectorizeHints::FK_Enabled \|\|		Hints.getPredicate() == LoopVectorizeHints::FK_Enabled \|\|
(TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT,		(TTI->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT,
▲ Show 20 Lines • Show All 466 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 323 Lines • ▼ Show 20 Lines	struct VPTransformState {
/// Hold a reference to the Value state information used when generating the		/// Hold a reference to the Value state information used when generating the
/// Values of the output IR.		/// Values of the output IR.
VectorizerValueMap &ValueMap;		VectorizerValueMap &ValueMap;

/// Hold a reference to a mapping between VPValues in VPlan and original		/// Hold a reference to a mapping between VPValues in VPlan and original
/// Values they correspond to.		/// Values they correspond to.
VPValue2ValueTy VPValue2Value;		VPValue2ValueTy VPValue2Value;

		/// Hold the canonical scalar IV of the vector loop (start=0, step=VF*UF).
		Value *CanonicalIV = nullptr;
		fhahnUnsubmitted Not Done Reply Inline Actions Maybe include Vector in the name, e.g. VectorInduction, to avoid confusion with Legal's PrimaryInduction fhahn: Maybe include Vector in the name, e.g. VectorInduction, to avoid confusion with Legal's…
		AyalAuthorUnsubmitted Done Reply Inline Actions Agreed. Updated comment and changed name to VectorLoopScalarIV. Sounds better/ok? Maybe better to avoid overloading the "PrimaryInduction" name throughout, keeping it with its original meaning only, and use "Canonical" instead. Ayal: Agreed. Updated comment and changed name to VectorLoopScalarIV. Sounds better/ok? Maybe better…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions We are well into bikeshedding territory now, but just one quick question. VectorLoopScalarIV describes well what this is I think, but was just wondering that if we refer to this as the canonical IV in comments, should this not be named something with CanonicalIV in it? SjoerdMeijer: We are well into bikeshedding territory now, but just one quick question. VectorLoopScalarIV…
		AyalAuthorUnsubmitted Done Reply Inline Actions Yes, I was wondering above if we should use (Scalar/Vector) `CanonicalIV` throughout, instead of overloading the original `PrimaryInduction`, which is absent... uploading another version to see how it looks. Ayal: Yes, I was wondering above if we should use (Scalar/Vector) `CanonicalIV` throughout, instead…

/// Hold the trip count of the scalar loop.		/// Hold the trip count of the scalar loop.
Value *TripCount = nullptr;		Value *TripCount = nullptr;

/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.		/// Hold a pointer to InnerLoopVectorizer to reuse its IR generation methods.
InnerLoopVectorizer *ILV;		InnerLoopVectorizer *ILV;

VPCallback &Callback;		VPCallback &Callback;
};		};
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	public:
using VPRecipeTy = enum {		using VPRecipeTy = enum {
VPBlendSC,		VPBlendSC,
VPBranchOnMaskSC,		VPBranchOnMaskSC,
VPInstructionSC,		VPInstructionSC,
VPInterleaveSC,		VPInterleaveSC,
VPPredInstPHISC,		VPPredInstPHISC,
VPReplicateSC,		VPReplicateSC,
VPWidenCallSC,		VPWidenCallSC,
		VPWidenCanonicalIVSC,
VPWidenGEPSC,		VPWidenGEPSC,
VPWidenIntOrFpInductionSC,		VPWidenIntOrFpInductionSC,
VPWidenMemoryInstructionSC,		VPWidenMemoryInstructionSC,
VPWidenPHISC,		VPWidenPHISC,
VPWidenSC,		VPWidenSC,
};		};

VPRecipeBase(const unsigned char SC) : SubclassID(SC) {}		VPRecipeBase(const unsigned char SC) : SubclassID(SC) {}
▲ Show 20 Lines • Show All 518 Lines • ▼ Show 20 Lines	public:
/// Generate the wide load/store.		/// Generate the wide load/store.
void execute(VPTransformState &State) override;		void execute(VPTransformState &State) override;

/// Print the recipe.		/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,		void print(raw_ostream &O, const Twine &Indent,
VPSlotTracker &SlotTracker) const override;		VPSlotTracker &SlotTracker) const override;
};		};

		/// A Recipe for widening the canonical induction variable of the vector loop.
		class VPWidenCanonicalIVRecipe : public VPRecipeBase {
		private:
		fhahnUnsubmitted Not Done Reply Inline Actions nit: private not needed here, right? Classes should default to private fhahn: nit: private not needed here, right? Classes should default to private
		AyalAuthorUnsubmitted Done Reply Inline Actions right, trying to be consistent... They should all be dropped in a separate NFC patch. Ayal: right, trying to be consistent... They should all be dropped in a separate NFC patch.
		/// A VPValue representing the canonical vector IV.
		fhahnUnsubmitted Not Done Reply Inline Actions Could be a unique_ptr? fhahn: Could be a unique_ptr?
		fhahnUnsubmitted Not Done Reply Inline Actions actually, the lifetime is directly tied to the recipe, right? So maybe no pointer is needed at all and we can just add a `VPValue Val` member? fhahn: actually, the lifetime is directly tied to the recipe, right? So maybe no pointer is needed at…
		AyalAuthorUnsubmitted Done Reply Inline Actions Right! Lifetimes are indeed tied; when the recipe turns into a VPInstruction, the VPValue will coincide with 'this'. Good catch. Ayal: Right! Lifetimes are indeed tied; when the recipe turns into a VPInstruction, the VPValue will…
		VPValue Val;

		public:
		VPWidenCanonicalIVRecipe() : VPRecipeBase(VPWidenCanonicalIVSC) {}
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions nit: just curious, do we actually need this? SjoerdMeijer: nit: just curious, do we actually need this?
		AyalAuthorUnsubmitted Done Reply Inline Actions we need a VPValue; it could alternatively be owned by Plan, but seems better to hold it locally here. Ayal: we need a VPValue; it could alternatively be owned by Plan, but seems better to hold it locally…
		~VPWidenCanonicalIVRecipe() override = default;

		/// Return the VPValue representing the canonical vector induction variable of
		/// the vector loop.
		const VPValue *getVPValue() const { return &Val; }
		VPValue *getVPValue() { return &Val; }

		/// Method to support type inquiry through isa, cast, and dyn_cast.
		static inline bool classof(const VPRecipeBase *V) {
		return V->getVPRecipeID() == VPRecipeBase::VPWidenCanonicalIVSC;
		}

		/// Generate a canonical vector induction variable of the vector loop, with
		fhahnUnsubmitted Done Reply Inline Actions Comment needs updating. fhahn: Comment needs updating.
		/// start = {<PartVF, PartVF+1, ..., Part*VF+VF-1> for 0 <= Part < UF}, and
		/// step = <VFUF, VFUF, ..., VF*UF>.
		void execute(VPTransformState &State) override;

		/// Print the recipe.
		void print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const override;
		};

/// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It		/// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It
/// holds a sequence of zero or more VPRecipe's each representing a sequence of		/// holds a sequence of zero or more VPRecipe's each representing a sequence of
/// output IR instructions.		/// output IR instructions.
class VPBasicBlock : public VPBlockBase {		class VPBasicBlock : public VPBlockBase {
public:		public:
using RecipeListTy = iplist<VPRecipeBase>;		using RecipeListTy = iplist<VPRecipeBase>;

private:		private:
▲ Show 20 Lines • Show All 724 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 796 Lines • ▼ Show 20 Lines	void VPWidenMemoryInstructionRecipe::print(raw_ostream &O, const Twine &Indent,
VPValue *Mask = getMask();		VPValue *Mask = getMask();
if (Mask) {		if (Mask) {
O << ", ";		O << ", ";
Mask->printAsOperand(O, SlotTracker);		Mask->printAsOperand(O, SlotTracker);
}		}
O << "\\l\"";		O << "\\l\"";
}		}

		void VPWidenCanonicalIVRecipe::execute(VPTransformState &State) {
		Value *CanonicalIV = State.CanonicalIV;
		Type *STy = CanonicalIV->getType();
		IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
		Value *VStart = Builder.CreateVectorSplat(State.VF, CanonicalIV, "broadcast");
		for (unsigned Part = 0, UF = State.UF; Part < UF; ++Part) {
		SmallVector<Constant *, 8> Indices;
		for (unsigned Lane = 0, VF = State.VF; Lane < VF; ++Lane)
		Indices.push_back(ConstantInt::get(STy, Part * VF + Lane));
		Constant *VStep = ConstantVector::get(Indices);
		// Add the consecutive indices to the vector value.
		Value *CanonicalVectorIV = Builder.CreateAdd(VStart, VStep, "vec.iv");
		State.set(getVPValue(), CanonicalVectorIV, Part);
		}
		}

		void VPWidenCanonicalIVRecipe::print(raw_ostream &O, const Twine &Indent,
		VPSlotTracker &SlotTracker) const {
		O << " +\n" << Indent << "\"EMIT ";
		getVPValue()->printAsOperand(O, SlotTracker);
		O << " = WIDEN-CANONICAL-INDUCTION \\l\"";
		}

template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);		template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);

void VPValue::replaceAllUsesWith(VPValue *New) {		void VPValue::replaceAllUsesWith(VPValue *New) {
for (VPUser *User : users())		for (VPUser *User : users())
for (unsigned I = 0, E = User->getNumOperands(); I < E; ++I)		for (unsigned I = 0, E = User->getNumOperands(); I < E; ++I)
if (User->getOperand(I) == this)		if (User->getOperand(I) == this)
User->setOperand(I, New);		User->setOperand(I, New);
}		}
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	void VPSlotTracker::assignSlots(const VPRegionBlock *Region) {
for (const VPBlockBase *Block : RPOT)		for (const VPBlockBase *Block : RPOT)
assignSlots(Block);		assignSlots(Block);
}		}

void VPSlotTracker::assignSlots(const VPBasicBlock *VPBB) {		void VPSlotTracker::assignSlots(const VPBasicBlock *VPBB) {
for (const VPRecipeBase &Recipe : *VPBB) {		for (const VPRecipeBase &Recipe : *VPBB) {
if (const auto *VPI = dyn_cast<VPInstruction>(&Recipe))		if (const auto *VPI = dyn_cast<VPInstruction>(&Recipe))
assignSlot(VPI);		assignSlot(VPI);
		else if (const auto *VPIV = dyn_cast<VPWidenCanonicalIVRecipe>(&Recipe))
		assignSlot(VPIV->getVPValue());
}		}
}		}

void VPSlotTracker::assignSlots(const VPlan &Plan) {		void VPSlotTracker::assignSlots(const VPlan &Plan) {

for (const VPValue *V : Plan.VPExternalDefs)		for (const VPValue *V : Plan.VPExternalDefs)
assignSlot(V);		assignSlot(V);

Show All 14 Lines

llvm/test/Transforms/LoopVectorize/X86/small-size.ll

Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	.lr.ph: ; preds = %.preheader, %.lr.ph
%indvars.iv.next = add i64 %indvars.iv, 1		%indvars.iv.next = add i64 %indvars.iv, 1
%11 = icmp eq i32 %4, 0		%11 = icmp eq i32 %4, 0
br i1 %11, label %._crit_edge, label %.lr.ph		br i1 %11, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %.preheader		._crit_edge: ; preds = %.lr.ph, %.preheader
ret void		ret void
}		}

; N is unknown, we need a tail. Can't vectorize because loop has no primary		; Loop has no primary induction as its integer IV has step -1 starting at
; induction.		; unknown N, but can still be vectorized.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions typo: it's -> its SjoerdMeijer: typo: it's -> its
;CHECK-LABEL: @example3(		;CHECK-LABEL: @example3(
		; CHECK: vector.ph:
		; CHECK: [[BROADCAST_SPLAT2:%.]] = shufflevector <4 x i64> {{.}}, <4 x i64> undef, <4 x i32> zeroinitializer
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0,
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> undef, i64 [[INDEX]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[VPIV:%.*]] = or <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
		; CHECK: {{.*}} = icmp ule <4 x i64> [[VPIV]], [[BROADCAST_SPLAT2]]
;CHECK-NOT: <4 x i32>		;CHECK-NOT: <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {		define void @example3(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) optsize {
%1 = icmp eq i32 %n, 0		%1 = icmp eq i32 %n, 0
br i1 %1, label %._crit_edge, label %.lr.ph		br i1 %1, label %._crit_edge, label %.lr.ph

.lr.ph: ; preds = %0, %.lr.ph		.lr.ph: ; preds = %0, %.lr.ph
%.05 = phi i32 [ %2, %.lr.ph ], [ %n, %0 ]		%.05 = phi i32 [ %2, %.lr.ph ], [ %n, %0 ]
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[NEXT_GEP]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[NEXT_GEP]] to <4 x i16>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i16> [[WIDE_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i16> [[WIDE_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 7, i32 7, i32 7, i32 7>		; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 7, i32 7, i32 7, i32 7>
; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[NEXT_GEP4]] to <4 x i32>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[NEXT_GEP4]] to <4 x i32>*
; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4		; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i64 [[INDEX_NEXT]], 256
; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !6		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !10
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[TMP7:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 true, label [[TMP7:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: br label [[TMP6:%.*]]		; CHECK-NEXT: br label [[TMP6:%.*]]
; CHECK: br i1 undef, label [[TMP7]], label [[TMP6]], !llvm.loop !7		; CHECK: br i1 undef, label [[TMP7]], label [[TMP6]], !llvm.loop !11
; CHECK: ret void		; CHECK: ret void
;		;
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
%i.02 = phi i32 [ 0, %0 ], [ %7, %1 ]		%i.02 = phi i32 [ 0, %0 ], [ %7, %1 ]
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[TMP30:%.*]] = shl nuw nsw i32 [[TMP29]], 7		; CHECK-NEXT: [[TMP30:%.*]] = shl nuw nsw i32 [[TMP29]], 7
; CHECK-NEXT: [[TMP31:%.*]] = or i64 [[INDEX]], 3		; CHECK-NEXT: [[TMP31:%.*]] = or i64 [[INDEX]], 3
; CHECK-NEXT: [[NEXT_GEP10:%.]] = getelementptr i32, i32 [[DST]], i64 [[TMP31]]		; CHECK-NEXT: [[NEXT_GEP10:%.]] = getelementptr i32, i32 [[DST]], i64 [[TMP31]]
; CHECK-NEXT: store i32 [[TMP30]], i32* [[NEXT_GEP10]], align 4		; CHECK-NEXT: store i32 [[TMP30]], i32* [[NEXT_GEP10]], align 4
; CHECK-NEXT: br label [[PRED_STORE_CONTINUE22]]		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE22]]
; CHECK: pred.store.continue22:		; CHECK: pred.store.continue22:
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
; CHECK-NEXT: [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], 260		; CHECK-NEXT: [[TMP32:%.*]] = icmp eq i64 [[INDEX_NEXT]], 260
; CHECK-NEXT: br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !8		; CHECK-NEXT: br i1 [[TMP32]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !12
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: br i1 true, label [[TMP34:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 true, label [[TMP34:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: br label [[TMP33:%.*]]		; CHECK-NEXT: br label [[TMP33:%.*]]
; CHECK: br i1 undef, label [[TMP34]], label [[TMP33]], !llvm.loop !9		; CHECK: br i1 undef, label [[TMP34]], label [[TMP33]], !llvm.loop !13
; CHECK: ret void		; CHECK: ret void
;		;
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]		%.04 = phi i16* [ %src, %0 ], [ %2, %1 ]
%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]		%.013 = phi i32* [ %dst, %0 ], [ %6, %1 ]
%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]		%i.02 = phi i64 [ 0, %0 ], [ %7, %1 ]
Show All 39 Lines

llvm/test/Transforms/LoopVectorize/tail-folding-counting-down.ll

	; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -force-vector-width=4 -S \| FileCheck %s

	; Check that when we can't predicate this loop that it is still vectorised (with			; Check that a counting-down loop which has no primary induction variable
	; an epilogue).			; is vectorized with preferred predication.
	; TODO: the reason this can't be predicated is because a primary induction
	; variable can't be found (not yet) for this counting down loop. But with that
	; fixed, this should be able to be predicated.

	; CHECK-LABEL: vector.body:			; CHECK-LABEL: vector.body:
				; CHECK-LABEL: middle.block:
				; CHECK-NEXT: br i1 true,

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"

	define dso_local void @foo(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i8* noalias nocapture %C, i32 %N) {			define dso_local void @foo(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i8* noalias nocapture %C, i32 %N) {
	entry:			entry:
	%cmp6 = icmp eq i32 %N, 0			%cmp6 = icmp eq i32 %N, 0
	br i1 %cmp6, label %while.end, label %while.body.preheader			br i1 %cmp6, label %while.end, label %while.body.preheader

	Show All 25 Lines