Download Raw Diff

Details

Reviewers

sdesmalen
fhahn
bmahjour
spatel
SjoerdMeijer
RKSimon

Commits

rG4979c9045862: [LV] Account for tripcount when calculation vectorization profitability

Summary

The loop vectorizer will currently assume a large trip count when calculating which of several vectorization factors are more profitable. That is often not a terrible assumption to make as small trip count loops will usually have been fully unrolled. There are cases however where we will try to vectorize them, and especially when folding the tail by masking can incorrectly choose to vectorize loops that are not beneficial, due to the folded tail rounding the iteration count up for the vectorized loop.

The motivating example here has a trip count of 5, so either performs 5 scalar iterations or 2 vector iterations (with VF=4). At a high enough trip count the vectorization becomes profitable, but the rounding up to 2 vector iterations vs only 5 scalar makes it unprofitable.

This adds an alternative cost calculation when we know the max trip count and are folding tail by masking, rounding the iteration count up to the correct number for the vector width. We still do not account for anything like setup cost or the mixture of vector and scalar loops, but this is at least an improvement in a few cases that we have had reported.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.May 2 2021, 1:10 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptMay 2 2021, 1:10 PM

dmgreen requested review of this revision.May 2 2021, 1:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2021, 1:10 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nikic added a subscriber: nikic.May 2 2021, 1:20 PM

nikic added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	There is divideCeil() in MathExtras.

Harbormaster completed remote builds in B102193: Diff 342259.May 2 2021, 1:58 PM

bmahjour added inline comments.May 3 2021, 2:52 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	but the count is the scalar trip count while the width is the vectorization factor....it seems odd that this calculation combines the two concepts together. Also if the trip count is a large constant, wouldn't the ceiling cause the VF to be effectively ignored? Perhaps you can achieve the desired outcome by adjusting the cost elsewhere (say in `expectedCost`) and preferably with a target hook.
5982	For this inequality to compare the per-lane costs, CostA above should be multiplied by ceil(...,B.Width) and CostB should be multiplied by ceil(...,A.Width).

dmgreen added inline comments.May 4 2021, 12:51 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	Oh excellent, thank. I will update the other places I stole this code from too :-)
5978	I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound like something that should be target dependent. Let say the trip count is 5. CostA will be the cost of a single iteration with a vector factor of A.Width. If VF is 4 then the total cost of the loop vectorized 4x becomes CostAceil(5/4). The rounding up is due to us folding the tail and so executing 2 iterations in this case []. If B is scalar then the total scalar cost will be CostBceil(5/1) = CostB5. You then compare the two final costs to find the smallest, it being expected to be the most profitable. So long as we are working in int64_t and the tripcount/cost are int32, this generalizes for any known trip count without rounding/overflow, I believe. ([] I thought about adding the non-FoldTailByMasking case too, but there the total cost is more like CostVecfloor(TripCount/VF) + CostScalar*(TripCount%VF) + SomeOverhead, which is more difficult to work out correctly. The existing CostVec/VF is at least an approximation of that.)

Use divideCeil

dmgreen added inline comments.May 4 2021, 12:54 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp ceil, but ceil(A/B)? It wasn't the most descriptive name for a function.

dmgreen mentioned this in rG18883a3fec5a: [TTI] Replace ceil lambdas with divideCeil. NFCI.May 4 2021, 1:05 AM

Harbormaster completed remote builds in B102477: Diff 342665.May 4 2021, 1:34 AM

bmahjour added inline comments.May 4 2021, 10:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	Yes, sorry I saw the lambda `ceil` and didn't notice that it was doing a divide. My main concern was that the costs were not being normalized by the VF for larger trip counts, but now I see that they are, it's just that the trip count is rounded up and multiplied on both sides of the inequality. Regarding overflow, I believe it's possible because `InstructionCost` uses `int`...so you might want to change it to `int64_t`. Regarding target dependency, I'm wondering whether the motivating example is concerned with throughput or latency. The CostA and CostB above are mostly modeling throughput, so if the concern is latency, don't we need to take it into account in a target dependent way?

sdesmalen added inline comments.May 4 2021, 1:29 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	I think this only makes sense if both factors are fixed-width VFs? If so, please add this as a condition and use `getFixedValue()` in the cost-calculation.
5979	should this be `B` (and the one below be `A`)?

RKSimon resigned from this revision.May 4 2021, 2:33 PM

RKSimon added a subscriber: RKSimon.

dmgreen added inline comments.May 5 2021, 2:11 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	Hmm. Sure. Sounds good! I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is.
5978	I would expect CostA to be positive and (relatively) small. MaxTripCount can easily be something close to UINTMAX. Width will obviously be a low power of 2. Unsigned felt like the natural type, but changing it to signed sounds fine too. My motivating case is... concerned with reciprocal throughput.. or that is at least close enough to it. Costing instructions isn't always simple, even on simple architectures. I don't believe it's any different to the existing code below, even if they are both a bit of an approximation.
5979	I'm not sure I following why. Can you give some more details which part would be B/A?

Change to int64_t and guard against scalable vectors.

Harbormaster completed remote builds in B102687: Diff 342967.May 5 2021, 3:02 AM

sdesmalen added inline comments.May 5 2021, 6:34 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	isScalable?
5974	I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is. We may be able to use knowledge about the scalable vectors' runtime width from the vscale_range attribute. When we know nothing about the runtime VF, then I'm not sure if we can make any sensible decisions.
5978	nit: PerVectorIterCost
5979	Sorry, please ignore that comment. You're not calculating the "cost per lane" (like we do below, which switches B/A), but rather calculating the total cost for handling TC scalar iterations by doing ceil(TC/VF) vector iterations.
5980	nit: PerVectorIterCost
5984	nit: `getFixedValue` (here and below)

dmgreen marked 3 inline comments as done.May 5 2021, 7:29 AM

dmgreen added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5974	Doh!
5974	Yep, but comparing a non-scalable VF and a scalable VF will be wrong whichever method we choose, unless the scalable factor happens to be 1. I always imagined that the backend TTI would be telling the vectorizer the correct vscale to use if it was known from -mcpu or guess at a likely one if not (which would probably be 1 or 2 at the moment).
5978	These ones can be scalar too.

Fix typos and whatnot.

Thanks, LGTM!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	nit: PerVFIterCost in that case?

This revision is now accepted and ready to land.May 5 2021, 7:39 AM

bmahjour added inline comments.May 5 2021, 8:14 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	Sorry if I wasn't clear, but I wasn't asking to change the type of `RTCostA` and `RTCostB` to signed int64_t. I was asking to change the underlying cost type inside `InstructionCost` from `int` to `int64_t`. Without that change, it's possible to cause an overflow in expressions like `= CostA * ceil(...)`. The other possible fix would be to enforce an upper bound on the value of `MaxTripCount` where this heuristic is being applied, but I think that's less desirable.

Harbormaster completed remote builds in B102741: Diff 343041.May 5 2021, 8:27 AM

dmgreen added inline comments.May 5 2021, 9:47 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	Hmm.. divideCeil takes and return uint64_t's. MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX. CostA will be an an int. So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough.

bmahjour added inline comments.May 5 2021, 1:30 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5978	Hmm.. divideCeil takes and return uint64_t's. MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX. CostA will be an an int. So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough. My bad. I thought `CostA` was of type `InstructionCost` and didn't notice it was actually `InstructionCost::CostType`. If it were we would have had an overflow issue due to implicit conversion to `InstructionCost`, but that's not the case.

Closed by commit rG4979c9045862: [LV] Account for tripcount when calculation vectorization profitability (authored by dmgreen). · Explain WhyMay 6 2021, 4:37 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG4979c9045862: [LV] Account for tripcount when calculation vectorization profitability.

dmgreen mentioned this in D115713: [LV] Don't apply "TinyTripCountVectorThreshold" for loops with compile time known TC..Oct 10 2022, 12:04 AM

dmgreen mentioned this in D147720: [LV] Use the known trip count when costing non-tail folded VFs.Apr 6 2023, 9:29 AM

dmgreen mentioned this in rG1869a9c225c7: [LV] Use the known trip count when costing non-tail folded VFs.Apr 24 2023, 2:02 PM

Diff 343359

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,963 Lines • ▼ Show 20 Lines	ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
return MaxVF;		return MaxVF;
}		}

bool LoopVectorizationCostModel::isMoreProfitable(		bool LoopVectorizationCostModel::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {		const VectorizationFactor &A, const VectorizationFactor &B) const {
InstructionCost::CostType CostA = *A.Cost.getValue();		InstructionCost::CostType CostA = *A.Cost.getValue();
InstructionCost::CostType CostB = *B.Cost.getValue();		InstructionCost::CostType CostB = *B.Cost.getValue();

		unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);

		if (!A.Width.isScalable() && !B.Width.isScalable() && FoldTailByMasking &&
		nikicUnsubmitted Not Done Reply Inline Actions There is divideCeil() in MathExtras. nikic: There is divideCeil() in MathExtras.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Oh excellent, thank. I will update the other places I stole this code from too :-) dmgreen: Oh excellent, thank. I will update the other places I stole this code from too :-)
		sdesmalenUnsubmitted Not Done Reply Inline Actions I think this only makes sense if both factors are fixed-width VFs? If so, please add this as a condition and use `getFixedValue()` in the cost-calculation. sdesmalen: I think this only makes sense if both factors are fixed-width VFs? If so, please add this as a…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Hmm. Sure. Sounds good! I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is. dmgreen: Hmm. Sure. Sounds good! I think we may need something similar for scalable vectors too…
		sdesmalenUnsubmitted Not Done Reply Inline Actions I think we may need something similar for scalable vectors too, eventually. They will run into the same issue with low trip-count loops. It will just not be as obvious what the actual vector width is. We may be able to use knowledge about the scalable vectors' runtime width from the vscale_range attribute. When we know nothing about the runtime VF, then I'm not sure if we can make any sensible decisions. sdesmalen: > I think we may need something similar for scalable vectors too, eventually. They will run…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yep, but comparing a non-scalable VF and a scalable VF will be wrong whichever method we choose, unless the scalable factor happens to be 1. I always imagined that the backend TTI would be telling the vectorizer the correct vscale to use if it was known from -mcpu or guess at a likely one if not (which would probably be 1 or 2 at the moment). dmgreen: Yep, but comparing a non-scalable VF and a scalable VF will be wrong whichever method we choose…
		sdesmalenUnsubmitted Done Reply Inline Actions isScalable? sdesmalen: isScalable?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Doh! dmgreen: Doh!
		MaxTripCount) {
		// If we are folding the tail and the trip count is a known (possibly small)
		// constant, the trip count will be rounded up to an integer number of
		// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),
		bmahjourUnsubmitted Not Done Reply Inline Actions but the count is the scalar trip count while the width is the vectorization factor....it seems odd that this calculation combines the two concepts together. Also if the trip count is a large constant, wouldn't the ceiling cause the VF to be effectively ignored? Perhaps you can achieve the desired outcome by adjusting the cost elsewhere (say in `expectedCost`) and preferably with a target hook. bmahjour: but the count is the scalar trip count while the width is the vectorization factor....it seems…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound like something that should be target dependent. Let say the trip count is 5. CostA will be the cost of a single iteration with a vector factor of A.Width. If VF is 4 then the total cost of the loop vectorized 4x becomes CostAceil(5/4). The rounding up is due to us folding the tail and so executing 2 iterations in this case []. If B is scalar then the total scalar cost will be CostBceil(5/1) = CostB5. You then compare the two final costs to find the smallest, it being expected to be the most profitable. So long as we are working in int64_t and the tripcount/cost are int32, this generalizes for any known trip count without rounding/overflow, I believe. ([] I thought about adding the non-FoldTailByMasking case too, but there the total cost is more like CostVecfloor(TripCount/VF) + CostScalar(TripCount%VF) + SomeOverhead, which is more difficult to work out correctly. The existing CostVec/VF is at least an approximation of that.) dmgreen:* I'm not sure I understand what you mean. Can you explain in more detail? This doesn't sound…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp ceil, but ceil(A/B)? It wasn't the most descriptive name for a function. dmgreen: Reading this again, perhaps the confusion comes from the old 'ceil' not being some form a fp…
		bmahjourUnsubmitted Not Done Reply Inline Actions Yes, sorry I saw the lambda `ceil` and didn't notice that it was doing a divide. My main concern was that the costs were not being normalized by the VF for larger trip counts, but now I see that they are, it's just that the trip count is rounded up and multiplied on both sides of the inequality. Regarding overflow, I believe it's possible because `InstructionCost` uses `int`...so you might want to change it to `int64_t`. Regarding target dependency, I'm wondering whether the motivating example is concerned with throughput or latency. The CostA and CostB above are mostly modeling throughput, so if the concern is latency, don't we need to take it into account in a target dependent way? bmahjour: Yes, sorry I saw the lambda `ceil` and didn't notice that it was doing a divide. My main…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I would expect CostA to be positive and (relatively) small. MaxTripCount can easily be something close to UINTMAX. Width will obviously be a low power of 2. Unsigned felt like the natural type, but changing it to signed sounds fine too. My motivating case is... concerned with reciprocal throughput.. or that is at least close enough to it. Costing instructions isn't always simple, even on simple architectures. I don't believe it's any different to the existing code below, even if they are both a bit of an approximation. dmgreen: I would expect CostA to be positive and (relatively) small. MaxTripCount can easily be…
		bmahjourUnsubmitted Not Done Reply Inline Actions Sorry if I wasn't clear, but I wasn't asking to change the type of `RTCostA` and `RTCostB` to signed int64_t. I was asking to change the underlying cost type inside `InstructionCost` from `int` to `int64_t`. Without that change, it's possible to cause an overflow in expressions like `= CostA * ceil(...)`. The other possible fix would be to enforce an upper bound on the value of `MaxTripCount` where this heuristic is being applied, but I think that's less desirable. bmahjour: Sorry if I wasn't clear, but I wasn't asking to change the type of `RTCostA` and `RTCostB` to…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Hmm.. divideCeil takes and return uint64_t's. MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX. CostA will be an an int. So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough. dmgreen: Hmm.. divideCeil takes and return uint64_t's. MaxTripCount will be unsigned so we know (with…
		bmahjourUnsubmitted Not Done Reply Inline Actions Hmm.. divideCeil takes and return uint64_t's. MaxTripCount will be unsigned so we know (with reasonable assumptions for the size of unsigned (?)) that divideCeil(MaxTripCount, Width) will be less than or equal to UINT_MAX. CostA will be an an int. So I'm not sure how int32_t * (uint64_t)uint32_t would then overflow, especially if CostA will be much less than 2^32 (usually much less than 2^16). My understanding is that it would just convert the multiply to a i64 and would be happy enough. My bad. I thought `CostA` was of type `InstructionCost` and didn't notice it was actually `InstructionCost::CostType`. If it were we would have had an overflow issue due to implicit conversion to `InstructionCost`, but that's not the case. bmahjour: > Hmm.. divideCeil takes and return uint64_t's. > MaxTripCount will be unsigned so we know…
		sdesmalenUnsubmitted Done Reply Inline Actions nit: PerVectorIterCost sdesmalen: nit: PerVectorIterCost
		dmgreenAuthorUnsubmitted Done Reply Inline Actions These ones can be scalar too. dmgreen: These ones can be scalar too.
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: PerVFIterCost in that case? sdesmalen: nit: PerVFIterCost in that case?
		// which we compare directly. When not folding the tail, the total cost will
		sdesmalenUnsubmitted Not Done Reply Inline Actions should this be `B` (and the one below be `A`)? sdesmalen: should this be `B` (and the one below be `A`)?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I'm not sure I following why. Can you give some more details which part would be B/A? dmgreen: I'm not sure I following why. Can you give some more details which part would be B/A?
		sdesmalenUnsubmitted Not Done Reply Inline Actions Sorry, please ignore that comment. You're not calculating the "cost per lane" (like we do below, which switches B/A), but rather calculating the total cost for handling TC scalar iterations by doing ceil(TC/VF) vector iterations. sdesmalen: Sorry, please ignore that comment. You're not calculating the "cost per lane" (like we do below…
		// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: PerVectorIterCost sdesmalen: nit: PerVectorIterCost
		// approximated with the per-lane cost below instead of using the tripcount
		// as here.
		bmahjourUnsubmitted Not Done Reply Inline Actions For this inequality to compare the per-lane costs, CostA above should be multiplied by ceil(...,B.Width) and CostB should be multiplied by ceil(...,A.Width). bmahjour: For this inequality to compare the per-lane costs, CostA above should be multiplied by ceil(...
		int64_t RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());
		int64_t RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());
		sdesmalenUnsubmitted Done Reply Inline Actions nit: `getFixedValue` (here and below) sdesmalen: nit: `getFixedValue` (here and below)
		return RTCostA < RTCostB;
		}

// To avoid the need for FP division:		// To avoid the need for FP division:
// (CostA / A.Width) < (CostB / B.Width)		// (CostA / A.Width) < (CostB / B.Width)
// <=> (CostA * B.Width) < (CostB * A.Width)		// <=> (CostA * B.Width) < (CostB * A.Width)
return (CostA * B.Width.getKnownMinValue()) <		return (CostA * B.Width.getKnownMinValue()) <
(CostB * A.Width.getKnownMinValue());		(CostB * A.Width.getKnownMinValue());
}		}

VectorizationFactor		VectorizationFactor
▲ Show 20 Lines • Show All 4,267 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

This file was added.

				; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8.1m.main-arm-none-eabi"

				; Trip count of 5 - shouldn't be vectorized.
				; CHECK-LABEL: tripcount5
				; CHECK: LV: Selecting VF: 1
				define void @tripcount5(i16* nocapture readonly %in, i32* nocapture %out, i16* nocapture readonly %consts, i32 %n) #0 {
				entry:
				%arrayidx20 = getelementptr inbounds i32, i32* %out, i32 1
				%arrayidx38 = getelementptr inbounds i32, i32* %out, i32 2
				%arrayidx56 = getelementptr inbounds i32, i32* %out, i32 3
				%arrayidx74 = getelementptr inbounds i32, i32* %out, i32 4
				%arrayidx92 = getelementptr inbounds i32, i32* %out, i32 5
				%arrayidx110 = getelementptr inbounds i32, i32* %out, i32 6
				%arrayidx128 = getelementptr inbounds i32, i32* %out, i32 7
				%out.promoted = load i32, i32* %out, align 4
				%arrayidx20.promoted = load i32, i32* %arrayidx20, align 4
				%arrayidx38.promoted = load i32, i32* %arrayidx38, align 4
				%arrayidx56.promoted = load i32, i32* %arrayidx56, align 4
				%arrayidx74.promoted = load i32, i32* %arrayidx74, align 4
				%arrayidx92.promoted = load i32, i32* %arrayidx92, align 4
				%arrayidx110.promoted = load i32, i32* %arrayidx110, align 4
				%arrayidx128.promoted = load i32, i32* %arrayidx128, align 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				store i32 %add12, i32* %out, align 4
				store i32 %add30, i32* %arrayidx20, align 4
				store i32 %add48, i32* %arrayidx38, align 4
				store i32 %add66, i32* %arrayidx56, align 4
				store i32 %add84, i32* %arrayidx74, align 4
				store i32 %add102, i32* %arrayidx92, align 4
				store i32 %add120, i32* %arrayidx110, align 4
				store i32 %add138, i32* %arrayidx128, align 4
				ret void

				for.body: ; preds = %entry, %for.body
				%hop.0236 = phi i32 [ 0, %entry ], [ %add139, %for.body ]
				%add12220235 = phi i32 [ %out.promoted, %entry ], [ %add12, %for.body ]
				%add30221234 = phi i32 [ %arrayidx20.promoted, %entry ], [ %add30, %for.body ]
				%add48222233 = phi i32 [ %arrayidx38.promoted, %entry ], [ %add48, %for.body ]
				%add66223232 = phi i32 [ %arrayidx56.promoted, %entry ], [ %add66, %for.body ]
				%add84224231 = phi i32 [ %arrayidx74.promoted, %entry ], [ %add84, %for.body ]
				%add102225230 = phi i32 [ %arrayidx92.promoted, %entry ], [ %add102, %for.body ]
				%add120226229 = phi i32 [ %arrayidx110.promoted, %entry ], [ %add120, %for.body ]
				%add138227228 = phi i32 [ %arrayidx128.promoted, %entry ], [ %add138, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %in, i32 %hop.0236
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				%arrayidx1 = getelementptr inbounds i16, i16* %consts, i32 %hop.0236
				%1 = load i16, i16* %arrayidx1, align 2
				%conv2 = sext i16 %1 to i32
				%mul = mul nsw i32 %conv2, %conv
				%add = add nsw i32 %mul, %add12220235
				%add4 = or i32 %hop.0236, 1
				%arrayidx5 = getelementptr inbounds i16, i16* %in, i32 %add4
				%2 = load i16, i16* %arrayidx5, align 2
				%conv6 = sext i16 %2 to i32
				%arrayidx8 = getelementptr inbounds i16, i16* %consts, i32 %add4
				%3 = load i16, i16* %arrayidx8, align 2
				%conv9 = sext i16 %3 to i32
				%mul10 = mul nsw i32 %conv9, %conv6
				%add12 = add nsw i32 %mul10, %add
				%add13 = or i32 %hop.0236, 2
				%arrayidx14 = getelementptr inbounds i16, i16* %in, i32 %add13
				%4 = load i16, i16* %arrayidx14, align 2
				%conv15 = sext i16 %4 to i32
				%arrayidx17 = getelementptr inbounds i16, i16* %consts, i32 %add13
				%5 = load i16, i16* %arrayidx17, align 2
				%conv18 = sext i16 %5 to i32
				%mul19 = mul nsw i32 %conv18, %conv15
				%add21 = add nsw i32 %mul19, %add30221234
				%add22 = or i32 %hop.0236, 3
				%arrayidx23 = getelementptr inbounds i16, i16* %in, i32 %add22
				%6 = load i16, i16* %arrayidx23, align 2
				%conv24 = sext i16 %6 to i32
				%arrayidx26 = getelementptr inbounds i16, i16* %consts, i32 %add22
				%7 = load i16, i16* %arrayidx26, align 2
				%conv27 = sext i16 %7 to i32
				%mul28 = mul nsw i32 %conv27, %conv24
				%add30 = add nsw i32 %mul28, %add21
				%add31 = or i32 %hop.0236, 4
				%arrayidx32 = getelementptr inbounds i16, i16* %in, i32 %add31
				%8 = load i16, i16* %arrayidx32, align 2
				%conv33 = sext i16 %8 to i32
				%arrayidx35 = getelementptr inbounds i16, i16* %consts, i32 %add31
				%9 = load i16, i16* %arrayidx35, align 2
				%conv36 = sext i16 %9 to i32
				%mul37 = mul nsw i32 %conv36, %conv33
				%add39 = add nsw i32 %mul37, %add48222233
				%add40 = or i32 %hop.0236, 5
				%arrayidx41 = getelementptr inbounds i16, i16* %in, i32 %add40
				%10 = load i16, i16* %arrayidx41, align 2
				%conv42 = sext i16 %10 to i32
				%arrayidx44 = getelementptr inbounds i16, i16* %consts, i32 %add40
				%11 = load i16, i16* %arrayidx44, align 2
				%conv45 = sext i16 %11 to i32
				%mul46 = mul nsw i32 %conv45, %conv42
				%add48 = add nsw i32 %mul46, %add39
				%add49 = or i32 %hop.0236, 6
				%arrayidx50 = getelementptr inbounds i16, i16* %in, i32 %add49
				%12 = load i16, i16* %arrayidx50, align 2
				%conv51 = sext i16 %12 to i32
				%arrayidx53 = getelementptr inbounds i16, i16* %consts, i32 %add49
				%13 = load i16, i16* %arrayidx53, align 2
				%conv54 = sext i16 %13 to i32
				%mul55 = mul nsw i32 %conv54, %conv51
				%add57 = add nsw i32 %mul55, %add66223232
				%add58 = or i32 %hop.0236, 7
				%arrayidx59 = getelementptr inbounds i16, i16* %in, i32 %add58
				%14 = load i16, i16* %arrayidx59, align 2
				%conv60 = sext i16 %14 to i32
				%arrayidx62 = getelementptr inbounds i16, i16* %consts, i32 %add58
				%15 = load i16, i16* %arrayidx62, align 2
				%conv63 = sext i16 %15 to i32
				%mul64 = mul nsw i32 %conv63, %conv60
				%add66 = add nsw i32 %mul64, %add57
				%add67 = or i32 %hop.0236, 8
				%arrayidx68 = getelementptr inbounds i16, i16* %in, i32 %add67
				%16 = load i16, i16* %arrayidx68, align 2
				%conv69 = sext i16 %16 to i32
				%arrayidx71 = getelementptr inbounds i16, i16* %consts, i32 %add67
				%17 = load i16, i16* %arrayidx71, align 2
				%conv72 = sext i16 %17 to i32
				%mul73 = mul nsw i32 %conv72, %conv69
				%add75 = add nsw i32 %mul73, %add84224231
				%add76 = or i32 %hop.0236, 9
				%arrayidx77 = getelementptr inbounds i16, i16* %in, i32 %add76
				%18 = load i16, i16* %arrayidx77, align 2
				%conv78 = sext i16 %18 to i32
				%arrayidx80 = getelementptr inbounds i16, i16* %consts, i32 %add76
				%19 = load i16, i16* %arrayidx80, align 2
				%conv81 = sext i16 %19 to i32
				%mul82 = mul nsw i32 %conv81, %conv78
				%add84 = add nsw i32 %mul82, %add75
				%add85 = or i32 %hop.0236, 10
				%arrayidx86 = getelementptr inbounds i16, i16* %in, i32 %add85
				%20 = load i16, i16* %arrayidx86, align 2
				%conv87 = sext i16 %20 to i32
				%arrayidx89 = getelementptr inbounds i16, i16* %consts, i32 %add85
				%21 = load i16, i16* %arrayidx89, align 2
				%conv90 = sext i16 %21 to i32
				%mul91 = mul nsw i32 %conv90, %conv87
				%add93 = add nsw i32 %mul91, %add102225230
				%add94 = or i32 %hop.0236, 11
				%arrayidx95 = getelementptr inbounds i16, i16* %in, i32 %add94
				%22 = load i16, i16* %arrayidx95, align 2
				%conv96 = sext i16 %22 to i32
				%arrayidx98 = getelementptr inbounds i16, i16* %consts, i32 %add94
				%23 = load i16, i16* %arrayidx98, align 2
				%conv99 = sext i16 %23 to i32
				%mul100 = mul nsw i32 %conv99, %conv96
				%add102 = add nsw i32 %mul100, %add93
				%add103 = or i32 %hop.0236, 12
				%arrayidx104 = getelementptr inbounds i16, i16* %in, i32 %add103
				%24 = load i16, i16* %arrayidx104, align 2
				%conv105 = sext i16 %24 to i32
				%arrayidx107 = getelementptr inbounds i16, i16* %consts, i32 %add103
				%25 = load i16, i16* %arrayidx107, align 2
				%conv108 = sext i16 %25 to i32
				%mul109 = mul nsw i32 %conv108, %conv105
				%add111 = add nsw i32 %mul109, %add120226229
				%add112 = or i32 %hop.0236, 13
				%arrayidx113 = getelementptr inbounds i16, i16* %in, i32 %add112
				%26 = load i16, i16* %arrayidx113, align 2
				%conv114 = sext i16 %26 to i32
				%arrayidx116 = getelementptr inbounds i16, i16* %consts, i32 %add112
				%27 = load i16, i16* %arrayidx116, align 2
				%conv117 = sext i16 %27 to i32
				%mul118 = mul nsw i32 %conv117, %conv114
				%add120 = add nsw i32 %mul118, %add111
				%add121 = or i32 %hop.0236, 14
				%arrayidx122 = getelementptr inbounds i16, i16* %in, i32 %add121
				%28 = load i16, i16* %arrayidx122, align 2
				%conv123 = sext i16 %28 to i32
				%arrayidx125 = getelementptr inbounds i16, i16* %consts, i32 %add121
				%29 = load i16, i16* %arrayidx125, align 2
				%conv126 = sext i16 %29 to i32
				%mul127 = mul nsw i32 %conv126, %conv123
				%add129 = add nsw i32 %mul127, %add138227228
				%add130 = or i32 %hop.0236, 15
				%arrayidx131 = getelementptr inbounds i16, i16* %in, i32 %add130
				%30 = load i16, i16* %arrayidx131, align 2
				%conv132 = sext i16 %30 to i32
				%arrayidx134 = getelementptr inbounds i16, i16* %consts, i32 %add130
				%31 = load i16, i16* %arrayidx134, align 2
				%conv135 = sext i16 %31 to i32
				%mul136 = mul nsw i32 %conv135, %conv132
				%add138 = add nsw i32 %mul136, %add129
				%add139 = add nuw nsw i32 %hop.0236, 16
				%cmp = icmp ult i32 %hop.0236, 64
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Trip count of 8 - does get vectorized
				; CHECK-LABEL: tripcount8
				; CHECK: LV: Selecting VF: 4
				define void @tripcount8(i16* nocapture readonly %in, i32* nocapture %out, i16* nocapture readonly %consts, i32 %n) #0 {
				entry:
				%arrayidx20 = getelementptr inbounds i32, i32* %out, i32 1
				%arrayidx38 = getelementptr inbounds i32, i32* %out, i32 2
				%arrayidx56 = getelementptr inbounds i32, i32* %out, i32 3
				%arrayidx74 = getelementptr inbounds i32, i32* %out, i32 4
				%arrayidx92 = getelementptr inbounds i32, i32* %out, i32 5
				%arrayidx110 = getelementptr inbounds i32, i32* %out, i32 6
				%arrayidx128 = getelementptr inbounds i32, i32* %out, i32 7
				%out.promoted = load i32, i32* %out, align 4
				%arrayidx20.promoted = load i32, i32* %arrayidx20, align 4
				%arrayidx38.promoted = load i32, i32* %arrayidx38, align 4
				%arrayidx56.promoted = load i32, i32* %arrayidx56, align 4
				%arrayidx74.promoted = load i32, i32* %arrayidx74, align 4
				%arrayidx92.promoted = load i32, i32* %arrayidx92, align 4
				%arrayidx110.promoted = load i32, i32* %arrayidx110, align 4
				%arrayidx128.promoted = load i32, i32* %arrayidx128, align 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				store i32 %add12, i32* %out, align 4
				store i32 %add30, i32* %arrayidx20, align 4
				store i32 %add48, i32* %arrayidx38, align 4
				store i32 %add66, i32* %arrayidx56, align 4
				store i32 %add84, i32* %arrayidx74, align 4
				store i32 %add102, i32* %arrayidx92, align 4
				store i32 %add120, i32* %arrayidx110, align 4
				store i32 %add138, i32* %arrayidx128, align 4
				ret void

				for.body: ; preds = %entry, %for.body
				%hop.0236 = phi i32 [ 0, %entry ], [ %add139, %for.body ]
				%add12220235 = phi i32 [ %out.promoted, %entry ], [ %add12, %for.body ]
				%add30221234 = phi i32 [ %arrayidx20.promoted, %entry ], [ %add30, %for.body ]
				%add48222233 = phi i32 [ %arrayidx38.promoted, %entry ], [ %add48, %for.body ]
				%add66223232 = phi i32 [ %arrayidx56.promoted, %entry ], [ %add66, %for.body ]
				%add84224231 = phi i32 [ %arrayidx74.promoted, %entry ], [ %add84, %for.body ]
				%add102225230 = phi i32 [ %arrayidx92.promoted, %entry ], [ %add102, %for.body ]
				%add120226229 = phi i32 [ %arrayidx110.promoted, %entry ], [ %add120, %for.body ]
				%add138227228 = phi i32 [ %arrayidx128.promoted, %entry ], [ %add138, %for.body ]
				%arrayidx = getelementptr inbounds i16, i16* %in, i32 %hop.0236
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				%arrayidx1 = getelementptr inbounds i16, i16* %consts, i32 %hop.0236
				%1 = load i16, i16* %arrayidx1, align 2
				%conv2 = sext i16 %1 to i32
				%mul = mul nsw i32 %conv2, %conv
				%add = add nsw i32 %mul, %add12220235
				%add4 = or i32 %hop.0236, 1
				%arrayidx5 = getelementptr inbounds i16, i16* %in, i32 %add4
				%2 = load i16, i16* %arrayidx5, align 2
				%conv6 = sext i16 %2 to i32
				%arrayidx8 = getelementptr inbounds i16, i16* %consts, i32 %add4
				%3 = load i16, i16* %arrayidx8, align 2
				%conv9 = sext i16 %3 to i32
				%mul10 = mul nsw i32 %conv9, %conv6
				%add12 = add nsw i32 %mul10, %add
				%add13 = or i32 %hop.0236, 2
				%arrayidx14 = getelementptr inbounds i16, i16* %in, i32 %add13
				%4 = load i16, i16* %arrayidx14, align 2
				%conv15 = sext i16 %4 to i32
				%arrayidx17 = getelementptr inbounds i16, i16* %consts, i32 %add13
				%5 = load i16, i16* %arrayidx17, align 2
				%conv18 = sext i16 %5 to i32
				%mul19 = mul nsw i32 %conv18, %conv15
				%add21 = add nsw i32 %mul19, %add30221234
				%add22 = or i32 %hop.0236, 3
				%arrayidx23 = getelementptr inbounds i16, i16* %in, i32 %add22
				%6 = load i16, i16* %arrayidx23, align 2
				%conv24 = sext i16 %6 to i32
				%arrayidx26 = getelementptr inbounds i16, i16* %consts, i32 %add22
				%7 = load i16, i16* %arrayidx26, align 2
				%conv27 = sext i16 %7 to i32
				%mul28 = mul nsw i32 %conv27, %conv24
				%add30 = add nsw i32 %mul28, %add21
				%add31 = or i32 %hop.0236, 4
				%arrayidx32 = getelementptr inbounds i16, i16* %in, i32 %add31
				%8 = load i16, i16* %arrayidx32, align 2
				%conv33 = sext i16 %8 to i32
				%arrayidx35 = getelementptr inbounds i16, i16* %consts, i32 %add31
				%9 = load i16, i16* %arrayidx35, align 2
				%conv36 = sext i16 %9 to i32
				%mul37 = mul nsw i32 %conv36, %conv33
				%add39 = add nsw i32 %mul37, %add48222233
				%add40 = or i32 %hop.0236, 5
				%arrayidx41 = getelementptr inbounds i16, i16* %in, i32 %add40
				%10 = load i16, i16* %arrayidx41, align 2
				%conv42 = sext i16 %10 to i32
				%arrayidx44 = getelementptr inbounds i16, i16* %consts, i32 %add40
				%11 = load i16, i16* %arrayidx44, align 2
				%conv45 = sext i16 %11 to i32
				%mul46 = mul nsw i32 %conv45, %conv42
				%add48 = add nsw i32 %mul46, %add39
				%add49 = or i32 %hop.0236, 6
				%arrayidx50 = getelementptr inbounds i16, i16* %in, i32 %add49
				%12 = load i16, i16* %arrayidx50, align 2
				%conv51 = sext i16 %12 to i32
				%arrayidx53 = getelementptr inbounds i16, i16* %consts, i32 %add49
				%13 = load i16, i16* %arrayidx53, align 2
				%conv54 = sext i16 %13 to i32
				%mul55 = mul nsw i32 %conv54, %conv51
				%add57 = add nsw i32 %mul55, %add66223232
				%add58 = or i32 %hop.0236, 7
				%arrayidx59 = getelementptr inbounds i16, i16* %in, i32 %add58
				%14 = load i16, i16* %arrayidx59, align 2
				%conv60 = sext i16 %14 to i32
				%arrayidx62 = getelementptr inbounds i16, i16* %consts, i32 %add58
				%15 = load i16, i16* %arrayidx62, align 2
				%conv63 = sext i16 %15 to i32
				%mul64 = mul nsw i32 %conv63, %conv60
				%add66 = add nsw i32 %mul64, %add57
				%add67 = or i32 %hop.0236, 8
				%arrayidx68 = getelementptr inbounds i16, i16* %in, i32 %add67
				%16 = load i16, i16* %arrayidx68, align 2
				%conv69 = sext i16 %16 to i32
				%arrayidx71 = getelementptr inbounds i16, i16* %consts, i32 %add67
				%17 = load i16, i16* %arrayidx71, align 2
				%conv72 = sext i16 %17 to i32
				%mul73 = mul nsw i32 %conv72, %conv69
				%add75 = add nsw i32 %mul73, %add84224231
				%add76 = or i32 %hop.0236, 9
				%arrayidx77 = getelementptr inbounds i16, i16* %in, i32 %add76
				%18 = load i16, i16* %arrayidx77, align 2
				%conv78 = sext i16 %18 to i32
				%arrayidx80 = getelementptr inbounds i16, i16* %consts, i32 %add76
				%19 = load i16, i16* %arrayidx80, align 2
				%conv81 = sext i16 %19 to i32
				%mul82 = mul nsw i32 %conv81, %conv78
				%add84 = add nsw i32 %mul82, %add75
				%add85 = or i32 %hop.0236, 10
				%arrayidx86 = getelementptr inbounds i16, i16* %in, i32 %add85
				%20 = load i16, i16* %arrayidx86, align 2
				%conv87 = sext i16 %20 to i32
				%arrayidx89 = getelementptr inbounds i16, i16* %consts, i32 %add85
				%21 = load i16, i16* %arrayidx89, align 2
				%conv90 = sext i16 %21 to i32
				%mul91 = mul nsw i32 %conv90, %conv87
				%add93 = add nsw i32 %mul91, %add102225230
				%add94 = or i32 %hop.0236, 11
				%arrayidx95 = getelementptr inbounds i16, i16* %in, i32 %add94
				%22 = load i16, i16* %arrayidx95, align 2
				%conv96 = sext i16 %22 to i32
				%arrayidx98 = getelementptr inbounds i16, i16* %consts, i32 %add94
				%23 = load i16, i16* %arrayidx98, align 2
				%conv99 = sext i16 %23 to i32
				%mul100 = mul nsw i32 %conv99, %conv96
				%add102 = add nsw i32 %mul100, %add93
				%add103 = or i32 %hop.0236, 12
				%arrayidx104 = getelementptr inbounds i16, i16* %in, i32 %add103
				%24 = load i16, i16* %arrayidx104, align 2
				%conv105 = sext i16 %24 to i32
				%arrayidx107 = getelementptr inbounds i16, i16* %consts, i32 %add103
				%25 = load i16, i16* %arrayidx107, align 2
				%conv108 = sext i16 %25 to i32
				%mul109 = mul nsw i32 %conv108, %conv105
				%add111 = add nsw i32 %mul109, %add120226229
				%add112 = or i32 %hop.0236, 13
				%arrayidx113 = getelementptr inbounds i16, i16* %in, i32 %add112
				%26 = load i16, i16* %arrayidx113, align 2
				%conv114 = sext i16 %26 to i32
				%arrayidx116 = getelementptr inbounds i16, i16* %consts, i32 %add112
				%27 = load i16, i16* %arrayidx116, align 2
				%conv117 = sext i16 %27 to i32
				%mul118 = mul nsw i32 %conv117, %conv114
				%add120 = add nsw i32 %mul118, %add111
				%add121 = or i32 %hop.0236, 14
				%arrayidx122 = getelementptr inbounds i16, i16* %in, i32 %add121
				%28 = load i16, i16* %arrayidx122, align 2
				%conv123 = sext i16 %28 to i32
				%arrayidx125 = getelementptr inbounds i16, i16* %consts, i32 %add121
				%29 = load i16, i16* %arrayidx125, align 2
				%conv126 = sext i16 %29 to i32
				%mul127 = mul nsw i32 %conv126, %conv123
				%add129 = add nsw i32 %mul127, %add138227228
				%add130 = or i32 %hop.0236, 15
				%arrayidx131 = getelementptr inbounds i16, i16* %in, i32 %add130
				%30 = load i16, i16* %arrayidx131, align 2
				%conv132 = sext i16 %30 to i32
				%arrayidx134 = getelementptr inbounds i16, i16* %consts, i32 %add130
				%31 = load i16, i16* %arrayidx134, align 2
				%conv135 = sext i16 %31 to i32
				%mul136 = mul nsw i32 %conv135, %conv132
				%add138 = add nsw i32 %mul136, %add129
				%add139 = add nuw nsw i32 %hop.0236, 16
				%cmp = icmp ult i32 %hop.0236, 112
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				attributes #0 = { "target-features"="+mve" }
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Account for tripcount when calculation vectorization profitability
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 343359

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Account for tripcount when calculation vectorization profitabilityClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 343359

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

[LV] Account for tripcount when calculation vectorization profitability
ClosedPublic