This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
4/17
MVETailPredication.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
basic-tail-pred.ll
-
tail-pred-const.ll

Differential D86074

[ARM][MVE] Tail-predication: check get.active.lane.mask's TC value
ClosedPublic

Authored by SjoerdMeijer on Aug 17 2020, 6:53 AM.

Download Raw Diff

Details

Reviewers

efriedma
samparker
dmgreen
samtebbs

Commits

rG676febc044ec: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value

Summary

This adds additional checks for the original scalar loop tripcount value, i.e. get.active.lane.mask second argument, and perform several sanity checks to see if it is of the form that we expect, similarly like we already do for the IV (the first argument of get.active.lane).

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Aug 17 2020, 6:53 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2020, 6:53 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Aug 17 2020, 6:53 AM

SjoerdMeijer added inline comments.Aug 17 2020, 9:03 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
409	I probably forgot to state what I didn't do here. Here, at this place, we could add more checks and cross reference the BTC obtained after IR pattern matching with the BTC that SCEV can calculate. For example, if we find a`%BTC = add %N, -1` for the get.active.lane intrinsic, then we could take the variable `%N` and see if that is used in the backedge taken count expression that SCEV can calculate for this loop. If we find `%N` as an operand in both expressions, we know both expressions are bound by the same variable, which is a good check to have. However, for the simple cases this is pretty simple, but as soon as we have a SCEV RecAddexpr, things get more complicated pretty fast. For example, if the pattern matched BTC instruction is described with: {(-1 + (sext i16 %N to i32)),+,-1}<nw><%for.body> and the BTC of the vectorised loop with a factor of 4 with: ((-4 + (4 * ({(3 + (sext i16 %N to i32))<nsw>,+,-1}<%for.body> /u 4))<nuw>) /u 4) Then extracting `%N` from both of these expressions and comparing this involves writing a mini scev visitor which I am a bit reluctant to do, may not be so generic, and I was hoping that the checks already performed are good enough smoke tests....

Added a related comment to D85737; probably makes sense to continue that discussion there.

llvm/lib/Target/ARM/MVETailPredication.cpp
609	Why are we searching the basic block here, instead of just using `dyn_cast<Instruction>(BTC)`?

SjoerdMeijer mentioned this in D86147: [LangRef] Revise semantics of get.active.lane.mask.Aug 18 2020, 9:26 AM

Thanks.

Following the discussion in D85737 and D86147, I am going to progress that first and change the BTC for the tripcount in the intrinsic. After that, I will return to this. We don't need to check the BTC, but need very similar checks for the tripcount.

SjoerdMeijer mentioned this in D86303: [ARM][MVE] Tail-predication: remove the BTC + 1 overflow checks.Aug 24 2020, 7:04 AM

SjoerdMeijer mentioned this in rG2002bb487898: [LangRef] Revise semantics of intrinsic get.active.lane.mask.Aug 25 2020, 8:24 AM

This is a (partial) rewrite of the patch after we changed the semantics of get.active.lane.mask to accept the loop tripcount as its second argument, and not the backedge-taken count. This now implements several checks to see if the tripcount belongs to this loop.

SjoerdMeijer added inline comments.Sep 9 2020, 7:47 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
609	There was a use for this in the previous version of this patch, to reuse some IR, but it's not necessary anymore in this version, so has been removed.

samparker added inline comments.Sep 10 2020, 1:33 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
390	nit: unnecessary parenthesis nesting.
422	It looks like this can be sunk into the if-statement that defines it.
424	isa<>
428	isa<>
432	nit: parenthesis
439	I guess we should return false for any other SCEVExpr type?

Cheers, comments addressed.

I think this okay, certainly for Unknown and SCEVAddRec, and I wouldn't be up for having a big pattern searching again to completely double check that everything is as we expect. Maybe something will crop up that requires that, but at least this is a good start.

This revision is now accepted and ready to land.Sep 14 2020, 2:45 AM

Thanks for that, and agreed with your remarks. I think this is already a bit more generic/flexible and thus better than what we had, but certainly isn't fully generic. I am willing to review this once that becomes important. Then, this logic has to be moved to Scalarevolution and be made generic.

Closed by commit rG676febc044ec: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value (authored by SjoerdMeijer). · Explain WhySep 14 2020, 3:32 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG676febc044ec: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value.

efriedma added inline comments.Sep 14 2020, 1:58 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	Why do we need this check? Emitting vctp32 should be okay even if we can't actually tail-predicate the loop. The overflow check should be enough to ensure that's it's safe to emit vctp32, I think? Or am I forgetting somthing?

SjoerdMeijer added inline comments.Sep 14 2020, 2:15 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	I could have a look where exactly, but if I am not mistaken you suggested on of the previous patches that we need to check that this tripcount/elementcount actually belongs to this loop. similarly like we already did for the IV. The reasoning was that for now get.active.lane.mask is emitted from the vector for nicely behaving loops, but it wouldn't be difficult to imagine that soon we will have a corresponding user-facing intrinsic. I think I am quoting that, if I remember well, and so these checks are needed. And if we emit the VCTP, then that represents tail-predication. I.e., the VCTP intrinsic can be picked up in the LoweroverheadLoop pass and turned into a tail-predicated loop (after additional checks).

SjoerdMeijer added inline comments.Sep 14 2020, 3:55 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	I did have a look because I was curious if I had starting imaging things. This is the remark I had in mind: https://reviews.llvm.org/D79175#2063586 This is remark is explicitly about "L" though. And I thought there was a similar remark about the 2nd argument when it still was the BTC (previous version of this patch), but I don't think I can't find that easily now.

efriedma added inline comments.Sep 16 2020, 6:34 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	Let me try to reformulate in a different way that might make it easier to understand. What you're doing here has two essential steps: Convert "llvm.get.active.lane.mask(X, Y)" to "llvm.arm.mve.vctp(Y - X)". Convert "Y - X" to a simpler induction variable. In theory, you could split these steps; (1) could be legal even without (2). Step 1 doesn't depend on the loop, or even that the statement is in a loop at all. The only requirement is that the subtraction itself doesn't overflow. Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. Neither of these are should be directly connected to the trip count of the loop, I think. The way the code is currently written, I think you're trying to prove more than you actually need to. If the induction variable has the "wrong" base or increment, ARMLowOverheadLoops will ultimately fail to tail-predicate, but I'm not sure that's actually a problem.

SjoerdMeijer added inline comments.Sep 17 2020, 7:15 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	Thanks again Eli for explaining/elaborating. Let me know what you prefer or think is best: rip this particular bit out (revert it), or leave it for the moment. I am asking because I will need some time to have a look at this: Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. This "Y - X" expression is a difficult one to analyse (it can be), and I need to see how to do that.

efriedma added inline comments.Sep 17 2020, 2:26 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	Let me know what you prefer or think is best: rip this particular bit out (revert it), or leave it for the moment. I am asking because I will need some time to have a look at this: Post the patches in whatever order you think makes sense for review. Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. This "Y - X" expression is a difficult one to analyse (it can be), and I need to see how to do that. Step 2 should be easy. In the cases you're interested in, the "Y - X" SCEV expression should look something like `{ElementCount,+,-VectorWidth}`. VectorWidth is a constant, and you don't really need to analyze ElementCount.

Sorry, I wrote a reply end of last week, but apparently forgot to push submit. So please see my reply inline, but I will open a new review soon, where it's probably best to continue this discussion and my reply.

llvm/lib/Target/ARM/MVETailPredication.cpp
376	I think I got a much better understanding of your suggestions now while I tried out a few things, but that's what I wanted to double check. Taking an example for `{ElementCount,+,-VectorWidth}`, and it is indeed easy to create a SCEV for that here, e.g.: (-4 + %N) At this point in he code here, we are not yet transforming the IR, but what we will generate is: vector.body: %7 = phi i32 [ %N, %vecItor.ph ], [ %9, %vector.body ] %9 = sub i32 %7, 4 br If I understand things correctly, you would like to sanity check that SCEV expression (-4 + %N) matches this IR, and thus that Phi %7 is a nice AddRec, which I think it is by defintion? I am not entirely sure what the added value would be of this check. Feels like that could be for example be an assert somewhere, and perhaps it is easier to do this in ARMLowOverheadLoops and not here as we don't have the transformed IR here. I think you're trying to prove more than you actually need to. I am kind of back to where I was before, and thinking that the current check makes some sense, but again I am of course perfectly happy to rip it out if we don't need it and let it be ARMLowOverheadLoops problem which indeed will probably not even trigger.

SjoerdMeijer mentioned this in D88086: [ARM][MVE] tail-predication: checks for the elementcount, cont'd.Sep 22 2020, 3:56 AM

efriedma added inline comments.Sep 22 2020, 12:11 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
376	The PHI check I was describing is essentially the existing `if (VectorWidth == StepValue)` check.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

MVETailPredication.cpp

81 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

basic-tail-pred.ll

189 lines

tail-pred-const.ll

62 lines

Diff 290948

llvm/lib/Target/ARM/MVETailPredication.cpp

Show First 20 Lines • Show All 113 Lines • ▼ Show 20 Lines
private:		private:
/// Perform the relevant checks on the loop and convert if possible.		/// Perform the relevant checks on the loop and convert if possible.
bool TryConvert(Value *TripCount);		bool TryConvert(Value *TripCount);

/// Return whether this is a vectorized loop, that contains masked		/// Return whether this is a vectorized loop, that contains masked
/// load/stores.		/// load/stores.
bool IsPredicatedVectorLoop();		bool IsPredicatedVectorLoop();

/// Perform checks on the arguments of @llvm.get.active.lane.mask		/// Perform several checks on the arguments of @llvm.get.active.lane.mask
/// intrinsic: check if the first is a loop induction variable, and for the		/// intrinsic. E.g., check that the loop induction variable and the element
/// the second check that no overflow can occur in the expression that use		/// count are of the form we expect, and also perform overflow checks for
/// this backedge-taken count.		/// the new expressions that are created.
bool IsSafeActiveMask(IntrinsicInst ActiveLaneMask, Value TripCount,		bool IsSafeActiveMask(IntrinsicInst ActiveLaneMask, Value TripCount,
FixedVectorType *VecTy);		FixedVectorType *VecTy);

/// Insert the intrinsic to represent the effect of tail predication.		/// Insert the intrinsic to represent the effect of tail predication.
void InsertVCTPIntrinsic(IntrinsicInst ActiveLaneMask, Value TripCount,		void InsertVCTPIntrinsic(IntrinsicInst ActiveLaneMask, Value TripCount,
FixedVectorType *VecTy);		FixedVectorType *VecTy);

/// Rematerialize the iteration count in exit blocks, which enables		/// Rematerialize the iteration count in exit blocks, which enables
▲ Show 20 Lines • Show All 234 Lines • ▼ Show 20 Lines
// 3) The IV must be an induction phi with an increment equal to the		// 3) The IV must be an induction phi with an increment equal to the
// vector width.		// vector width.
bool MVETailPredication::IsSafeActiveMask(IntrinsicInst *ActiveLaneMask,		bool MVETailPredication::IsSafeActiveMask(IntrinsicInst *ActiveLaneMask,
Value TripCount, FixedVectorType VecTy) {		Value TripCount, FixedVectorType VecTy) {
bool ForceTailPredication =		bool ForceTailPredication =
EnableTailPredication == TailPredication::ForceEnabledNoReductions \|\|		EnableTailPredication == TailPredication::ForceEnabledNoReductions \|\|
EnableTailPredication == TailPredication::ForceEnabled;		EnableTailPredication == TailPredication::ForceEnabled;

// 1) TODO: Check that the TripCount (TC) belongs to this loop (originally).		// 1) Check that the original scalar loop TripCount (TC) belongs to this loop.
		efriedmaUnsubmitted Not Done Reply Inline Actions Why do we need this check? Emitting vctp32 should be okay even if we can't actually tail-predicate the loop. The overflow check should be enough to ensure that's it's safe to emit vctp32, I think? Or am I forgetting somthing? efriedma: Why do we need this check? Emitting vctp32 should be okay even if we can't actually tail…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I could have a look where exactly, but if I am not mistaken you suggested on of the previous patches that we need to check that this tripcount/elementcount actually belongs to this loop. similarly like we already did for the IV. The reasoning was that for now get.active.lane.mask is emitted from the vector for nicely behaving loops, but it wouldn't be difficult to imagine that soon we will have a corresponding user-facing intrinsic. I think I am quoting that, if I remember well, and so these checks are needed. And if we emit the VCTP, then that represents tail-predication. I.e., the VCTP intrinsic can be picked up in the LoweroverheadLoop pass and turned into a tail-predicated loop (after additional checks). SjoerdMeijer: I could have a look where exactly, but if I am not mistaken you suggested on of the previous…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I did have a look because I was curious if I had starting imaging things. This is the remark I had in mind: https://reviews.llvm.org/D79175#2063586 This is remark is explicitly about "L" though. And I thought there was a similar remark about the 2nd argument when it still was the BTC (previous version of this patch), but I don't think I can't find that easily now. SjoerdMeijer: I did have a look because I was curious if I had starting imaging things. This is the remark I…
		efriedmaUnsubmitted Not Done Reply Inline Actions Let me try to reformulate in a different way that might make it easier to understand. What you're doing here has two essential steps: Convert "llvm.get.active.lane.mask(X, Y)" to "llvm.arm.mve.vctp(Y - X)". Convert "Y - X" to a simpler induction variable. In theory, you could split these steps; (1) could be legal even without (2). Step 1 doesn't depend on the loop, or even that the statement is in a loop at all. The only requirement is that the subtraction itself doesn't overflow. Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. Neither of these are should be directly connected to the trip count of the loop, I think. The way the code is currently written, I think you're trying to prove more than you actually need to. If the induction variable has the "wrong" base or increment, ARMLowOverheadLoops will ultimately fail to tail-predicate, but I'm not sure that's actually a problem. efriedma: Let me try to reformulate in a different way that might make it easier to understand. What…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Thanks again Eli for explaining/elaborating. Let me know what you prefer or think is best: rip this particular bit out (revert it), or leave it for the moment. I am asking because I will need some time to have a look at this: Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. This "Y - X" expression is a difficult one to analyse (it can be), and I need to see how to do that. SjoerdMeijer: Thanks again Eli for explaining/elaborating. Let me know what you prefer or think is best: rip…
		efriedmaUnsubmitted Not Done Reply Inline Actions Let me know what you prefer or think is best: rip this particular bit out (revert it), or leave it for the moment. I am asking because I will need some time to have a look at this: Post the patches in whatever order you think makes sense for review. Step 2 requires that "Y - X" is equivalent to the new induction variable: it's needs to be an AddRec for the loop you're inserting the PHI into, and the generated PHI has to have the same base and increment. This "Y - X" expression is a difficult one to analyse (it can be), and I need to see how to do that. Step 2 should be easy. In the cases you're interested in, the "Y - X" SCEV expression should look something like `{ElementCount,+,-VectorWidth}`. VectorWidth is a constant, and you don't really need to analyze ElementCount. efriedma: > Let me know what you prefer or think is best: rip this particular bit out (revert it), or…
		SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I think I got a much better understanding of your suggestions now while I tried out a few things, but that's what I wanted to double check. Taking an example for `{ElementCount,+,-VectorWidth}`, and it is indeed easy to create a SCEV for that here, e.g.: (-4 + %N) At this point in he code here, we are not yet transforming the IR, but what we will generate is: vector.body: %7 = phi i32 [ %N, %vecItor.ph ], [ %9, %vector.body ] %9 = sub i32 %7, 4 br If I understand things correctly, you would like to sanity check that SCEV expression (-4 + %N) matches this IR, and thus that Phi %7 is a nice AddRec, which I think it is by defintion? I am not entirely sure what the added value would be of this check. Feels like that could be for example be an assert somewhere, and perhaps it is easier to do this in ARMLowOverheadLoops and not here as we don't have the transformed IR here. I think you're trying to prove more than you actually need to. I am kind of back to where I was before, and thinking that the current check makes some sense, but again I am of course perfectly happy to rip it out if we don't need it and let it be ARMLowOverheadLoops problem which indeed will probably not even trigger. SjoerdMeijer: I think I got a much better understanding of your suggestions now while I tried out a few…
		efriedmaUnsubmitted Not Done Reply Inline Actions The PHI check I was describing is essentially the existing `if (VectorWidth == StepValue)` check. efriedma: The PHI check I was describing is essentially the existing `if (VectorWidth == StepValue)`…
// The scalar tripcount corresponds the number of elements processed by the		// The scalar tripcount corresponds the number of elements processed by the
// loop, so we will refer to that from this point on.		// loop, so we will refer to that from this point on.
auto *ElemCountVal = ActiveLaneMask->getOperand(1);		Value *ElemCount = ActiveLaneMask->getOperand(1);
		auto *EC= SE->getSCEV(ElemCount);
		auto *TC = SE->getSCEV(TripCount);
		int VectorWidth = VecTy->getNumElements();
		ConstantInt *ConstElemCount = nullptr;

		if (!SE->isLoopInvariant(EC, L)) {
		LLVM_DEBUG(dbgs() << "ARM TP: element count must be loop invariant.\n");
		return false;
		}

		if ((ConstElemCount = dyn_cast<ConstantInt>(ElemCount))) {
		samparkerUnsubmitted Not Done Reply Inline Actions nit: unnecessary parenthesis nesting. samparker: nit: unnecessary parenthesis nesting.
		ConstantInt *TC = dyn_cast<ConstantInt>(TripCount);
		if (!TC) {
		LLVM_DEBUG(dbgs() << "ARM TP: Constant tripcount expected in "
		"set.loop.iterations\n");
		return false;
		}

		// Calculate 2 tripcount values and check that they are consistent with
		// each other:
		// i) The number of loop iterations extracted from the set.loop.iterations
		// intrinsic, multipled by the vector width:
		uint64_t TC1 = TC->getZExtValue() * VectorWidth;

		// ii) TC1 has to be equal to TC + 1, with the + 1 to compensate for start
		// counting from 0.
		uint64_t TC2 = ConstElemCount->getZExtValue() + 1;

		if (TC1 != TC2) {
		LLVM_DEBUG(dbgs() << "ARM TP: inconsistent constant tripcount values: "
		SjoerdMeijerAuthorUnsubmitted Not Done Reply Inline Actions I probably forgot to state what I didn't do here. Here, at this place, we could add more checks and cross reference the BTC obtained after IR pattern matching with the BTC that SCEV can calculate. For example, if we find a`%BTC = add %N, -1` for the get.active.lane intrinsic, then we could take the variable `%N` and see if that is used in the backedge taken count expression that SCEV can calculate for this loop. If we find `%N` as an operand in both expressions, we know both expressions are bound by the same variable, which is a good check to have. However, for the simple cases this is pretty simple, but as soon as we have a SCEV RecAddexpr, things get more complicated pretty fast. For example, if the pattern matched BTC instruction is described with: {(-1 + (sext i16 %N to i32)),+,-1}<nw><%for.body> and the BTC of the vectorised loop with a factor of 4 with: ((-4 + (4 * ({(3 + (sext i16 %N to i32))<nsw>,+,-1}<%for.body> /u 4))<nuw>) /u 4) Then extracting `%N` from both of these expressions and comparing this involves writing a mini scev visitor which I am a bit reluctant to do, may not be so generic, and I was hoping that the checks already performed are good enough smoke tests.... SjoerdMeijer: I probably forgot to state what I didn't do here. Here, at this place, we could add more…
		<< TC1 << " from set.loop.iterations, and "
		<< TC2 << " from get.active.lane.mask\n");
		return false;
		}
		} else {
		// Smoke tests if the element count is a runtime value. I.e., this isn't
		// fully generic because that would require a full SCEV visitor here. It
		// would require extracting the variable from the elementcount SCEV
		// expression, and match this up with the tripcount SCEV expression. If
		// this matches up, we know both expressions are bound by the same
		// variable, and thus we know this tripcount belongs to this loop. The
		// checks below will catch most cases though.
		if (isa<SCEVAddExpr>(EC) \|\| isa<SCEVUnknown>(EC)) {
		samparkerUnsubmitted Not Done Reply Inline Actions It looks like this can be sunk into the if-statement that defines it. samparker: It looks like this can be sunk into the if-statement that defines it.
		// If the element count is a simple AddExpr or SCEVUnknown, which is e.g.
		// the case when the element count is just a variable %N, we can just see
		samparkerUnsubmitted Not Done Reply Inline Actions isa<> samparker: isa<>
		// if it is an operand in the tripcount scev expression.
		if (isa<SCEVAddExpr>(TC) && !SE->hasOperand(TC, EC)) {
		LLVM_DEBUG(dbgs() << "ARM TP: 1Can't verify the element counter\n");
		return false;
		samparkerUnsubmitted Not Done Reply Inline Actions isa<> samparker: isa<>
		}
		} else if (const SCEVAddRecExpr *AddRecExpr = dyn_cast<SCEVAddRecExpr>(EC)) {
		// For more complicated AddRecExpr, check that the corresponding loop and
		// its loop hierarhy contains the trip count loop.
		samparkerUnsubmitted Not Done Reply Inline Actions nit: parenthesis samparker: nit: parenthesis
		if (!AddRecExpr->getLoop()->contains(L)) {
		LLVM_DEBUG(dbgs() << "ARM TP: 2Can't verify the element counter\n");
		return false;
		}
		} else {
		LLVM_DEBUG(dbgs() << "ARM TP: Unsupported SCEV type, can't verify the "
		"element counter\n");
		samparkerUnsubmitted Not Done Reply Inline Actions I guess we should return false for any other SCEVExpr type? samparker: I guess we should return false for any other SCEVExpr type?
		return false;
		}
		}

// 2) Prove that the sub expression is non-negative, i.e. it doesn't overflow:		// 2) Prove that the sub expression is non-negative, i.e. it doesn't overflow:
//		//
// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
//		//
// 2.1) First prove overflow can't happen in:		// 2.1) First prove overflow can't happen in:
//		//
// ElementCount + (VectorWidth - 1)		// ElementCount + (VectorWidth - 1)
//		//
// Because of a lack of context, it is difficult to get a useful bounds on		// Because of a lack of context, it is difficult to get a useful bounds on
// this expression. But since ElementCount uses the same variables as the		// this expression. But since ElementCount uses the same variables as the
// TripCount (TC), for which we can find meaningful value ranges, we use that		// TripCount (TC), for which we can find meaningful value ranges, we use that
// instead and assert that:		// instead and assert that:
//		//
// upperbound(TC) <= UINT_MAX - VectorWidth		// upperbound(TC) <= UINT_MAX - VectorWidth
//		//
auto *TC = SE->getSCEV(TripCount);
unsigned SizeInBits = TripCount->getType()->getScalarSizeInBits();		unsigned SizeInBits = TripCount->getType()->getScalarSizeInBits();
int VectorWidth = VecTy->getNumElements();
auto Diff = APInt(SizeInBits, ~0) - APInt(SizeInBits, VectorWidth);		auto Diff = APInt(SizeInBits, ~0) - APInt(SizeInBits, VectorWidth);
uint64_t MaxMinusVW = Diff.getZExtValue();		uint64_t MaxMinusVW = Diff.getZExtValue();
// FIXME: since ranges can be negative we work with signed ranges here, but		// FIXME: since ranges can be negative we work with signed ranges here, but
// we shouldn't extract the zext'ed values for them.		// we shouldn't extract the zext'ed values for them.
uint64_t UpperboundTC = SE->getSignedRange(TC).getUpper().getZExtValue();		uint64_t UpperboundTC = SE->getSignedRange(TC).getUpper().getZExtValue();

if (UpperboundTC > MaxMinusVW && !ForceTailPredication) {		if (UpperboundTC > MaxMinusVW && !ForceTailPredication) {
LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in tripcount rounding:\n";		LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in tripcount rounding:\n";
Show All 20 Lines	bool MVETailPredication::IsSafeActiveMask(IntrinsicInst *ActiveLaneMask,
//		//
// %5 = add nuw nsw i32 %4, 1		// %5 = add nuw nsw i32 %4, 1
// call void @llvm.set.loop.iterations.i32(i32 %5)		// call void @llvm.set.loop.iterations.i32(i32 %5)
//		//
// where %5 is some expression using %N, which needs to have a lower bound of		// where %5 is some expression using %N, which needs to have a lower bound of
// 1. Thus, if the ranges of Ceil and TC are not a single constant but a set,		// 1. Thus, if the ranges of Ceil and TC are not a single constant but a set,
// we first add 0 to TC such that we can do the <= comparison on both sets.		// we first add 0 to TC such that we can do the <= comparison on both sets.
//		//
auto *ElementCount = SE->getSCEV(ElemCountVal);
// Tmp = ElementCount + (VW-1)		// Tmp = ElementCount + (VW-1)
auto *ECPlusVWMinus1 = SE->getAddExpr(ElementCount,		auto *ECPlusVWMinus1 = SE->getAddExpr(EC,
SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth - 1)));		SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth - 1)));
// Ceil = ElementCount + (VW-1) / VW		// Ceil = ElementCount + (VW-1) / VW
auto *Ceil = SE->getUDivExpr(ECPlusVWMinus1,		auto *Ceil = SE->getUDivExpr(ECPlusVWMinus1,
SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth)));		SE->getSCEV(ConstantInt::get(TripCount->getType(), VectorWidth)));

ConstantRange RangeCeil = SE->getSignedRange(Ceil) ;		ConstantRange RangeCeil = SE->getSignedRange(Ceil) ;
ConstantRange RangeTC = SE->getSignedRange(TC) ;		ConstantRange RangeTC = SE->getSignedRange(TC) ;
if (!RangeTC.isSingleElement()) {		if (!RangeTC.isSingleElement()) {
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	bool MVETailPredication::TryConvert(Value *TripCount) {

auto getPredicateOp = [](IntrinsicInst *I) {		auto getPredicateOp = [](IntrinsicInst *I) {
unsigned IntrinsicID = I->getIntrinsicID();		unsigned IntrinsicID = I->getIntrinsicID();
if (IntrinsicID == Intrinsic::arm_mve_vldr_gather_offset_predicated \|\|		if (IntrinsicID == Intrinsic::arm_mve_vldr_gather_offset_predicated \|\|
IntrinsicID == Intrinsic::arm_mve_vstr_scatter_offset_predicated)		IntrinsicID == Intrinsic::arm_mve_vstr_scatter_offset_predicated)
return 5;		return 5;
return (IntrinsicID == Intrinsic::masked_load \|\| isGather(I)) ? 2 : 3;		return (IntrinsicID == Intrinsic::masked_load \|\| isGather(I)) ? 2 : 3;
};		};

		efriedmaUnsubmitted Not Done Reply Inline Actions Why are we searching the basic block here, instead of just using `dyn_cast<Instruction>(BTC)`? efriedma: Why are we searching the basic block here, instead of just using `dyn_cast<Instruction>(BTC)`?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions There was a use for this in the previous version of this patch, to reuse some IR, but it's not necessary anymore in this version, so has been removed. SjoerdMeijer: There was a use for this in the previous version of this patch, to reuse some IR, but it's not…
// Walk through the masked intrinsics and try to find whether the predicate		// Walk through the masked intrinsics and try to find whether the predicate
// operand is generated by intrinsic @llvm.get.active.lane.mask().		// operand is generated by intrinsic @llvm.get.active.lane.mask().
for (auto *I : MaskedInsts) {		for (auto *I : MaskedInsts) {
Value *PredOp = I->getArgOperand(getPredicateOp(I));		Value *PredOp = I->getArgOperand(getPredicateOp(I));
auto *Predicate = dyn_cast<Instruction>(PredOp);		auto *Predicate = dyn_cast<Instruction>(PredOp);
if (!Predicate \|\| Predicates.count(Predicate))		if (!Predicate \|\| Predicates.count(Predicate))
continue;		continue;

Show All 30 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll

Show First 20 Lines • Show All 425 Lines • ▼ Show 20 Lines	vector.body:
%v15 = call i32 @llvm.loop.decrement.reg.i32(i32 %v6, i32 1)		%v15 = call i32 @llvm.loop.decrement.reg.i32(i32 %v6, i32 1)
%v16 = icmp ne i32 %v15, 0		%v16 = icmp ne i32 %v15, 0
br i1 %v16, label %vector.body, label %for.cond.cleanup		br i1 %v16, label %vector.body, label %for.cond.cleanup

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void
}		}

		; CHECK-LABEL: const_expected_in_set_loop
		; CHECK: call <4 x i1> @llvm.get.active.lane.mask
		; CHECK-NOT: vctp
		; CHECK: ret void
		;
		define dso_local void @const_expected_in_set_loop(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32 %N) local_unnamed_addr #0 {
		entry:
		%cmp8 = icmp sgt i32 %N, 0
		%0 = add i32 %N, 3
		%1 = lshr i32 %0, 2
		%2 = shl nuw i32 %1, 2
		%3 = add i32 %2, -4
		%4 = lshr i32 %3, 2
		%5 = add nuw nsw i32 %4, 1
		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

		vector.ph:
		call void @llvm.set.loop.iterations.i32(i32 %5)
		br label %vector.body

		vector.body: ; preds = %vector.body, %vector.ph
		%lsr.iv17 = phi i32* [ %scevgep18, %vector.body ], [ %A, %vector.ph ]
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %C, %vector.ph ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %vector.ph ]
		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		%6 = phi i32 [ %5, %vector.ph ], [ %8, %vector.body ]
		%lsr.iv13 = bitcast i32* %lsr.iv to <4 x i32>*
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1719 = bitcast i32* %lsr.iv17 to <4 x i32>*

		%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 42)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv13, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%wide.masked.load12 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%7 = add nsw <4 x i32> %wide.masked.load12, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %7, <4 x i32>* %lsr.iv1719, i32 4, <4 x i1> %active.lane.mask)
		%index.next = add i32 %index, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%scevgep18 = getelementptr i32, i32* %lsr.iv17, i32 4
		%8 = call i32 @llvm.loop.decrement.reg.i32(i32 %6, i32 1)
		%9 = icmp ne i32 %8, 0
		br i1 %9, label %vector.body, label %for.cond.cleanup

		for.cond.cleanup: ; preds = %vector.body, %entry
		ret void
		}

		; CHECK-LABEL: wrong_tripcount_arg
		; CHECK: vector.body:
		; CHECK: call <4 x i1> @llvm.arm.mve.vctp32
		; CHECK-NOT: call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32
		; CHECK: vector.body35:
		; CHECK: call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32
		; CHECK-NOT: call <4 x i1> @llvm.arm.mve.vctp32
		; CHECK: ret void
		;
		define dso_local void @wrong_tripcount_arg(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture %D, i32 %N1, i32 %N2) local_unnamed_addr #0 {
		entry:
		%cmp29 = icmp sgt i32 %N1, 0
		%0 = add i32 %N1, 3
		%1 = lshr i32 %0, 2
		%2 = shl nuw i32 %1, 2
		%3 = add i32 %2, -4
		%4 = lshr i32 %3, 2
		%5 = add nuw nsw i32 %4, 1
		br i1 %cmp29, label %vector.ph, label %for.cond4.preheader

		vector.ph: ; preds = %entry
		call void @llvm.set.loop.iterations.i32(i32 %5)
		br label %vector.body

		vector.body: ; preds = %vector.body, %vector.ph
		%lsr.iv62 = phi i32* [ %scevgep63, %vector.body ], [ %D, %vector.ph ]
		%lsr.iv59 = phi i32* [ %scevgep60, %vector.body ], [ %C, %vector.ph ]
		%lsr.iv56 = phi i32* [ %scevgep57, %vector.body ], [ %B, %vector.ph ]
		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		%6 = phi i32 [ %5, %vector.ph ], [ %8, %vector.body ]
		%lsr.iv5658 = bitcast i32* %lsr.iv56 to <4 x i32>*
		%lsr.iv5961 = bitcast i32* %lsr.iv59 to <4 x i32>*
		%lsr.iv6264 = bitcast i32* %lsr.iv62 to <4 x i32>*
		%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %N1)
		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv5658, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%wide.masked.load32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv5961, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%7 = add nsw <4 x i32> %wide.masked.load32, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %7, <4 x i32>* %lsr.iv6264, i32 4, <4 x i1> %active.lane.mask)
		%index.next = add i32 %index, 4
		%scevgep57 = getelementptr i32, i32* %lsr.iv56, i32 4
		%scevgep60 = getelementptr i32, i32* %lsr.iv59, i32 4
		%scevgep63 = getelementptr i32, i32* %lsr.iv62, i32 4
		%8 = call i32 @llvm.loop.decrement.reg.i32(i32 %6, i32 1)
		%9 = icmp ne i32 %8, 0
		br i1 %9, label %vector.body, label %for.cond4.preheader

		for.cond4.preheader: ; preds = %vector.body, %entry
		%cmp527 = icmp sgt i32 %N2, 0
		%10 = add i32 %N2, 3
		%11 = lshr i32 %10, 2
		%12 = shl nuw i32 %11, 2
		%13 = add i32 %12, -4
		%14 = lshr i32 %13, 2
		%15 = add nuw nsw i32 %14, 1
		br i1 %cmp527, label %vector.ph36, label %for.cond.cleanup6

		vector.ph36: ; preds = %for.cond4.preheader
		call void @llvm.set.loop.iterations.i32(i32 %15)
		br label %vector.body35

		vector.body35: ; preds = %vector.body35, %vector.ph36
		%lsr.iv53 = phi i32* [ %scevgep54, %vector.body35 ], [ %A, %vector.ph36 ]
		%lsr.iv50 = phi i32* [ %scevgep51, %vector.body35 ], [ %C, %vector.ph36 ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body35 ], [ %B, %vector.ph36 ]
		%index40 = phi i32 [ 0, %vector.ph36 ], [ %index.next41, %vector.body35 ]
		%16 = phi i32 [ %15, %vector.ph36 ], [ %18, %vector.body35 ]
		%lsr.iv49 = bitcast i32* %lsr.iv to <4 x i32>*
		%lsr.iv5052 = bitcast i32* %lsr.iv50 to <4 x i32>*
		%lsr.iv5355 = bitcast i32* %lsr.iv53 to <4 x i32>*

		; This has N1 as the tripcount / element count, which is the tripcount of the
		; first loop and not this one:
		%active.lane.mask46 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index40, i32 %N1)

		%wide.masked.load47 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv49, i32 4, <4 x i1> %active.lane.mask46, <4 x i32> undef)
		%wide.masked.load48 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv5052, i32 4, <4 x i1> %active.lane.mask46, <4 x i32> undef)
		%17 = add nsw <4 x i32> %wide.masked.load48, %wide.masked.load47
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %17, <4 x i32>* %lsr.iv5355, i32 4, <4 x i1> %active.lane.mask46)
		%index.next41 = add i32 %index40, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep51 = getelementptr i32, i32* %lsr.iv50, i32 4
		%scevgep54 = getelementptr i32, i32* %lsr.iv53, i32 4
		%18 = call i32 @llvm.loop.decrement.reg.i32(i32 %16, i32 1)
		%19 = icmp ne i32 %18, 0
		br i1 %19, label %vector.body35, label %for.cond.cleanup6

		for.cond.cleanup6: ; preds = %vector.body35, %for.cond4.preheader
		ret void
		}

		; CHECK-LABEL: tripcount_arg_not_invariant
		; CHECK: call <4 x i1> @llvm.get.active.lane.mask
		; CHECK-NOT: vctp
		; CHECK: ret void
		;
		define dso_local void @tripcount_arg_not_invariant(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32 %N) local_unnamed_addr #0 {
		entry:
		%cmp8 = icmp sgt i32 %N, 0
		%0 = add i32 %N, 3
		%1 = lshr i32 %0, 2
		%2 = shl nuw i32 %1, 2
		%3 = add i32 %2, -4
		%4 = lshr i32 %3, 2
		%5 = add nuw nsw i32 %4, 1
		br i1 %cmp8, label %vector.ph, label %for.cond.cleanup

		vector.ph: ; preds = %entry
		%trip.count.minus.1 = add i32 %N, -1
		call void @llvm.set.loop.iterations.i32(i32 %5)
		br label %vector.body

		vector.body: ; preds = %vector.body, %vector.ph
		%lsr.iv17 = phi i32* [ %scevgep18, %vector.body ], [ %A, %vector.ph ]
		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %C, %vector.ph ]
		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %vector.ph ]
		%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		%6 = phi i32 [ %5, %vector.ph ], [ %8, %vector.body ]

		%lsr.iv13 = bitcast i32* %lsr.iv to <4 x i32>*
		%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
		%lsr.iv1719 = bitcast i32* %lsr.iv17 to <4 x i32>*

		%active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %index)

		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv13, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%wide.masked.load12 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %active.lane.mask, <4 x i32> undef)
		%7 = add nsw <4 x i32> %wide.masked.load12, %wide.masked.load
		call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %7, <4 x i32>* %lsr.iv1719, i32 4, <4 x i1> %active.lane.mask)
		%index.next = add i32 %index, 4
		%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
		%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
		%scevgep18 = getelementptr i32, i32* %lsr.iv17, i32 4
		%8 = call i32 @llvm.loop.decrement.reg.i32(i32 %6, i32 1)
		%9 = icmp ne i32 %8, 0
		;br i1 %9, label %vector.body, label %for.cond.cleanup
		br i1 %9, label %vector.body, label %vector.ph

		for.cond.cleanup: ; preds = %vector.body, %entry
		ret void
		}

declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)		declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32 immarg, <16 x i1>, <16 x i8>)
declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)		declare void @llvm.masked.store.v16i8.p0v16i8(<16 x i8>, <16 x i8>*, i32 immarg, <16 x i1>)
declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)		declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32 immarg, <8 x i1>, <8 x i16>)
declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)		declare void @llvm.masked.store.v8i16.p0v8i16(<8 x i16>, <8 x i16>*, i32 immarg, <8 x i1>)
declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)		declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32 immarg, <4 x i1>, <4 x i32>)
declare void @llvm.masked.store.v2i64.p0v2i64(<2 x i64>, <2 x i64>*, i32 immarg, <2 x i1>)		declare void @llvm.masked.store.v2i64.p0v2i64(<2 x i64>, <2 x i64>*, i32 immarg, <2 x i1>)
declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32 immarg, <2 x i1>, <2 x i64>)		declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32 immarg, <2 x i1>, <2 x i64>)
declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)		declare void @llvm.masked.store.v4i32.p0v4i32(<4 x i32>, <4 x i32>*, i32 immarg, <4 x i1>)
declare void @llvm.set.loop.iterations.i32(i32)		declare void @llvm.set.loop.iterations.i32(i32)
declare i32 @llvm.loop.decrement.reg.i32(i32, i32)		declare i32 @llvm.loop.decrement.reg.i32(i32, i32)
declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32, i32)
declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32, i32)
declare <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32, i32)		declare <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32, i32)

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	; @llvm.get.active.lane.mask, but let's keep this test as a sanity check:
%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)		%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
%4 = icmp ne i32 %3, 0		%4 = icmp ne i32 %3, 0
br i1 %4, label %vector.body, label %for.cond.cleanup		br i1 %4, label %vector.body, label %for.cond.cleanup

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void
}		}

; CHECK-LABEL: @overflow_BTC_plus_1(		; CHECK-LABEL: @inconsistent_tripcounts(
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.arm.mve.vctp32		; CHECK-NOT: @llvm.arm.mve.vctp32
; CHECK: @llvm.get.active.lane.mask		; CHECK: @llvm.get.active.lane.mask
; CHECK: ret void		; CHECK: ret void
;		;
define dso_local void @overflow_BTC_plus_1(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {		define dso_local void @inconsistent_tripcounts(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
entry:		entry:
call void @llvm.set.loop.iterations.i32(i32 8001)		call void @llvm.set.loop.iterations.i32(i32 8001)
br label %vector.body		br label %vector.body

vector.body:		vector.body:
%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
Show All 28 Lines
; CHECK-LABEL: @overflow_in_sub(		; CHECK-LABEL: @overflow_in_sub(
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.arm.mve.vctp32		; CHECK-NOT: @llvm.arm.mve.vctp32
; CHECK: @llvm.get.active.lane.mask		; CHECK: @llvm.get.active.lane.mask
; CHECK: ret void		; CHECK: ret void
;		;
define dso_local void @overflow_in_sub(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {		define dso_local void @overflow_in_sub(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
entry:		entry:
call void @llvm.set.loop.iterations.i32(i32 8001)		call void @llvm.set.loop.iterations.i32(i32 1073741824)
br label %vector.body

vector.body:
%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
%lsr.iv1416 = bitcast i32* %lsr.iv14 to <4 x i32>*
%lsr.iv1113 = bitcast i32* %lsr.iv11 to <4 x i32>*
%lsr.iv10 = bitcast i32* %lsr.iv to <4 x i32>*
%broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0
%broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>

; Overflow in the substraction. This should hold:
;
; ceil(ElementCount / VectorWidth) >= TripCount
;
; But we have:
;
; ceil(3200 / 4) >= 8001
; 8000 >= 8001
;
%1 = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 31999)

%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv10, i32 4, <4 x i1> %1, <4 x i32> undef)
%wide.masked.load9 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %lsr.iv1113, i32 4, <4 x i1> %1, <4 x i32> undef)
%2 = add nsw <4 x i32> %wide.masked.load9, %wide.masked.load
call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %2, <4 x i32>* %lsr.iv1416, i32 4, <4 x i1> %1)
%index.next = add i32 %index, 4
%scevgep = getelementptr i32, i32* %lsr.iv, i32 4
%scevgep12 = getelementptr i32, i32* %lsr.iv11, i32 4
%scevgep15 = getelementptr i32, i32* %lsr.iv14, i32 4
%3 = call i32 @llvm.loop.decrement.reg.i32(i32 %0, i32 1)
%4 = icmp ne i32 %3, 0
br i1 %4, label %vector.body, label %for.cond.cleanup

for.cond.cleanup:
ret void
}

; CHECK-LABEL: @overflow_in_rounding_tripcount(
; CHECK: vector.body:
; CHECK-NOT: @llvm.arm.mve.vctp32
; CHECK: @llvm.get.active.lane.mask
; CHECK: ret void
;
define dso_local void @overflow_in_rounding_tripcount(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32* noalias nocapture readnone %D, i32 %N) local_unnamed_addr #0 {
entry:

; TC = 4294967292
; 4294967292 <= 4294967291 (MAX - vectorwidth)
; False
;
call void @llvm.set.loop.iterations.i32(i32 4294967291)
br label %vector.body		br label %vector.body

vector.body:		vector.body:
%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]		%lsr.iv14 = phi i32* [ %scevgep15, %vector.body ], [ %A, %entry ]
%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]		%lsr.iv11 = phi i32* [ %scevgep12, %vector.body ], [ %C, %entry ]
%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]		%lsr.iv = phi i32* [ %scevgep, %vector.body ], [ %B, %entry ]
%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]		%index = phi i32 [ 0, %entry ], [ %index.next, %vector.body ]
%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]		%0 = phi i32 [ 8001, %entry ], [ %3, %vector.body ]
▲ Show 20 Lines • Show All 223 Lines • Show Last 20 Lines