This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/6
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
bad-reduction.ll

Differential D67841

[SLP] avoid reduction transform on patterns that the backend can load-combine
ClosedPublic

Authored by spatel on Sep 20 2019, 9:45 AM.

Download Raw Diff

Details

Reviewers

chfast
mvels
lebedev.ri
hfinkel
efriedma
reames
ABataev
Vasilis
RKSimon

Commits

rG8cc6d42e8d6c: [SLP] avoid reduction transform on patterns that the backend can load-combine…
rL375025: [SLP] avoid reduction transform on patterns that the backend can load-combine…
rGe2321bb4488a: [SLP] avoid reduction transform on patterns that the backend can load-combine
rL373833: [SLP] avoid reduction transform on patterns that the backend can load-combine

Summary

I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend. Therefore, preventing SLP from destroying load combine opportunities requires that it recognizes patterns that could be combined later, but not do the optimization itself (it's not a vector combine anyway, so it's probably out-of-scope for SLP).

In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining will produce a single instruction on these tests like:

movbe   rax, qword ptr [rdi]

or:

mov     rax, qword ptr [rdi]

Not some (half) vector monstrosity as we currently do using SLP:

vpmovzxbq       ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero
vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
movzx   eax, byte ptr [rdi]
movzx   ecx, byte ptr [rdi + 5]
shl     rcx, 40
movzx   edx, byte ptr [rdi + 6]
shl     rdx, 48
or      rdx, rcx
movzx   ecx, byte ptr [rdi + 7]
shl     rcx, 56
or      rcx, rdx
or      rcx, rax
vextracti128    xmm1, ymm0, 1
vpor    xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
vpor    xmm0, xmm0, xmm1
vmovq   rax, xmm0
or      rax, rcx
vzeroupper
ret

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Sep 20 2019, 9:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2019, 9:45 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

Is this similar to D42981?

In D67841#1676961, @lebedev.ri wrote:

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

More recently:
http://llvm.1065342.n5.nabble.com/llvm-dev-Load-combine-pass-td101187i20.html#a131076

This is the commit to remove the pass from trunk:
https://reviews.llvm.org/rL306067

In D67841#1676963, @ABataev wrote:

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

dtemirbulatov added a subscriber: dtemirbulatov.Sep 20 2019, 10:18 AM

In D67841#1676983, @spatel wrote:

In D67841#1676963, @ABataev wrote:

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

I looked at the other patch again, and it is not solving the examples in the motivating bug reports. The problem in these cases is not the vector cost model; it's the scalar cost model. Unless we resurrect LoadCombiner in IR, we need to recognize that small consecutive scalar loads can be combined to a larger scalar op.

The pattern-matching here isn't specific enough to fully recognize a bswap for example, but that's not necessary IMO. We have perf regressions, and this is a conservative change to the cost model calc that fixes them. If we underestimated the scalar cost for some sequence that can't be reduced to a single load, that can be improved later.

ABataev added inline comments.Sep 25 2019, 11:10 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6514	Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer?

Patch updated:
Move the analysis/calculation of load-combining from SLP to the TTI cost model.

ABataev added inline comments.Sep 26 2019, 8:46 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6553	Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`?

spatel marked 2 inline comments as done.Sep 26 2019, 9:04 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6553	Do you have a specific suggestion about how to alter that API to make this fit? I'm not seeing how to do it without making this more confusing than having a dedicated function.

ABataev added inline comments.Sep 26 2019, 10:02 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6553	`getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call this function and try to make a pattern matching for the arguments of the instruction rather than the instruction itself as in your original patch. Also, you will need to change the way you estimate the cost of the instruction. You need to estimate it for the single instruction. It is going to be less precise than in your current implementation but this is the common problem of the cost model, which should be addressed separately.

Patch updated:
Make load combining cost calculation a specialization of general arithmetic instruction cost by using optional parameter.

ABataev added inline comments.Sep 26 2019, 1:38 PM

llvm/lib/Analysis/TargetTransformInfo.cpp
633 ↗	(On Diff #222018)	Ooops, sorry, my fault, pointed to the wrong function. Better to modify a function in `include/llvm/CodeGen/BasicTTIImpl.h`, `BasicTTIImplBase::getArithmeticInstrCost`, if this is common limitation.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6557	I don't think it is a good idea to break the contract of the function. Better to pass the arguments of the instruction, just like requested by the function.

Patch updated:
Pass in reduction root operands to cost model rather than reduced value.

spatel marked an inline comment as done.Sep 27 2019, 8:28 AM

spatel added inline comments.

llvm/lib/Analysis/TargetTransformInfo.cpp
633 ↗	(On Diff #222018)	AFAICT, if we change the concept/model base classes, that will only be called after any target-specific override, so we would need to change all of the override implementations to access the new code. Adding code to the concrete class allows us to add the target-independent customization for a load-combine before the target does its usual logic. Can you point to an example of what you'd like to see and/or show it exactly?

ABataev added inline comments.Sep 27 2019, 8:42 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
633 ↗	(On Diff #222018)	Yes, this is what I was afraid of. In general, it would be good to put it to the base class, but if you think that this not possible without code duplication in derived classes, better to leave it as is. Other opinions?

Ping.

(I did draft a code change that altered the concept/model classes, but as mentioned earlier, I couldn't make that work as intended.)

I'm good with this unless there are no other opinions.

This revision is now accepted and ready to land.Oct 4 2019, 5:52 AM

LGTM as a stopgap

Closed by commit rGe2321bb4488a: [SLP] avoid reduction transform on patterns that the backend can load-combine (authored by spatel). · Explain WhyOct 5 2019, 11:04 AM

This revision was automatically updated to reflect the committed changes.

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

RKSimon reopened this revision.Oct 7 2019, 2:50 AM

This revision is now accepted and ready to land.Oct 7 2019, 2:50 AM

RKSimon requested changes to this revision.Oct 7 2019, 2:50 AM

This revision now requires changes to proceed.Oct 7 2019, 2:50 AM

In D67841#1696910, @mstorsjo wrote:

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

Thanks. I looked at the test cases attached to the bug report, and this patch causes scary behavior:
LV: Found an estimated cost of 4294967293 for VF 1 For instruction: %or75 = or i32 %shl74, %shl71

The loop vectorizer assumes that costs are always positive (it converts the value returned by the cost model to an *unsigned* value).
This matches the assert in the getArithmeticInstrCost() implementation that we tried to bypass:

assert(Cost >= 0 && "TTI should not produce negative costs!");

But we want SLP to weigh the *relative* cost of scalar code (that will be reduced) vs. vector code.

I think we should use the earlier revision of this patch that created a dedicated function for estimating a load combining pattern. Ie, we tried to squeeze this into the more general getArithmeticInstrCost() API, but it does not belong there. Existing callers have made assumptions about using that cost model API, and we violated the contract:

/// This is an approximation of reciprocal throughput of a math/logic op.
/// A higher cost indicates less expected throughput.
/// From Agner Fog's guides, reciprocal throughput is "the average number of
/// clock cycles per instruction when the instructions are not part of a
/// limiting dependency chain."
/// Therefore, costs should be scaled to account for multiple execution units
/// on the target that can process this type of instruction. For example, if
/// there are 5 scalar integer units and 2 vector integer units that can
/// calculate an 'add' in a single cycle, this model should indicate that the
/// cost of the vector add instruction is 2.5 times the cost of the scalar
/// add instruction.
/// \p Args is an optional argument which holds the instruction operands
/// values so the TTI can analyze those values searching for special
/// cases or optimizations based on those values.
int getArithmeticInstrCost(

Patch updated:
Jump back to the earlier revision which created a new method for cost of a load-combine pattern. This is independent of the existing arithmetic instruction cost API, so we can be sure that there is no conflict with other cost model users. It's also more accurate because we can add the cost of a wider load just once for the entire pattern. Test case for the loop vectorizer crash was added at rL373913.

Patch updated:
Moved "using PatternMatch" line within function that uses that API.

Side note: filed https://bugs.llvm.org/show_bug.cgi?id=43591 for larger questions about the cost model. I really don't want to hold up the underlying motivating bugs for this patch while we try to untangle that mess.

Ping.

There seems to be general agreement that this is a temporary (stopgap) solution until we can do limited load combining in IR. So it's a question of whether we're ok with a cost model hack to overcome the motivating bugs while we figure out how to add a new pass.

ABataev added inline comments.Oct 11 2019, 9:39 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗	(On Diff #223625)	Shall we just terminate the reduction of this special construct unconditionally? Do we need this cost calculation just to prevent the reduction all the time we see this pattern? If yes, then, probably, we don't need to calculate the cost. There is a function `isTreeTinyAndNotFullyVectorizable()`. Can you put pattern matching analysis for this particular construct in this function without any additional cost analysis?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6552–6553	Maybe better to put the check for the operation into the function itself?

spatel marked 2 inline comments as done.Oct 16 2019, 4:46 AM

spatel added inline comments.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗	(On Diff #223625)	I'm not opposed to ignoring costs for this patch. It's makes the patch simpler (although less flexible since targets can't then override via cost model). I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :) I think the first revision of the patch was already close in spirit to this suggestion - it modified SLP alone rather than the cost model. Let me rework that, and we'll see if that looks better.

Patch updated:
Change implementation to a pure SLP bailout based on pattern-matching of a load-combine-reduction opportunity. Same test results.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 16 2019, 11:05 AM

Closed by commit rG8cc6d42e8d6c: [SLP] avoid reduction transform on patterns that the backend can load-combine… (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D78997: [SLP] add another bailout for load-combine patterns.Apr 28 2020, 6:01 AM

spatel mentioned this in rG86dfbc676ebe: [SLP] add another bailout for load-combine patterns.May 5 2020, 10:14 AM

spatel mentioned this in rG02051c7f3ae9: [SLP] add another bailout for load-combine patterns (2nd try).May 7 2020, 12:28 PM

jrbyrnes mentioned this in D133584: [DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine.Sep 19 2022, 10:59 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

48 lines

test/

Transforms/

SLPVectorizer/

X86/

bad-reduction.ll

156 lines

Diff 225271

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 612 Lines • ▼ Show 20 Lines	public:
///		///
/// \returns number of elements in vector if isomorphism exists, 0 otherwise.		/// \returns number of elements in vector if isomorphism exists, 0 otherwise.
unsigned canMapToVector(Type *T, const DataLayout &DL) const;		unsigned canMapToVector(Type *T, const DataLayout &DL) const;

/// \returns True if the VectorizableTree is both tiny and not fully		/// \returns True if the VectorizableTree is both tiny and not fully
/// vectorizable. We do not vectorize such trees.		/// vectorizable. We do not vectorize such trees.
bool isTreeTinyAndNotFullyVectorizable() const;		bool isTreeTinyAndNotFullyVectorizable() const;

		/// Assume that a legal-sized 'or'-reduction of shifted/zexted loaded values
		/// can be load combined in the backend. Load combining may not be allowed in
		/// the IR optimizer, so we do not want to alter the pattern. For example,
		/// partially transforming a scalar bswap() pattern into vector code is
		/// effectively impossible for the backend to undo.
		/// TODO: If load combining is allowed in the IR optimizer, this analysis
		/// may not be necessary.
		bool isLoadCombineReductionCandidate(unsigned ReductionOpcode) const;

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

/// This structure holds any data we need about the edges being traversed		/// This structure holds any data we need about the edges being traversed
/// during buildTree_rec(). We keep track of:		/// during buildTree_rec(). We keep track of:
/// (i) the user TreeEntry index, and		/// (i) the user TreeEntry index, and
/// (ii) the index of the edge.		/// (ii) the index of the edge.
struct EdgeInfo {		struct EdgeInfo {
EdgeInfo() = default;		EdgeInfo() = default;
▲ Show 20 Lines • Show All 2,650 Lines • ▼ Show 20 Lines	bool BoUpSLP::isFullyVectorizableTinyTree() const {

// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees.
if (VectorizableTree[0]->NeedToGather \|\| VectorizableTree[1]->NeedToGather)		if (VectorizableTree[0]->NeedToGather \|\| VectorizableTree[1]->NeedToGather)
return false;		return false;

return true;		return true;
}		}

		bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {
		if (RdxOpcode != Instruction::Or)
		return false;

		unsigned NumElts = VectorizableTree[0]->Scalars.size();
		Value *FirstReduced = VectorizableTree[0]->Scalars[0];

		// Look past the reduction to find a source value. Arbitrarily follow the
		// path through operand 0 of any 'or'. Also, peek through optional
		// shift-left-by-constant.
		Value *ZextLoad = FirstReduced;
		while (match(ZextLoad, m_Or(m_Value(), m_Value())) \|\|
		match(ZextLoad, m_Shl(m_Value(), m_Constant())))
		ZextLoad = cast<BinaryOperator>(ZextLoad)->getOperand(0);

		// Check if the input to the reduction is an extended load.
		Value *LoadPtr;
		if (!match(ZextLoad, m_ZExt(m_Load(m_Value(LoadPtr)))))
		return false;

		// Require that the total load bit width is a legal integer type.
		// For example, <8 x i8> --> i64 is a legal integer on a 64-bit target.
		// But <16 x i8> --> i128 is not, so the backend probably can't reduce it.
		Type *SrcTy = LoadPtr->getType()->getPointerElementType();
		unsigned LoadBitWidth = SrcTy->getIntegerBitWidth() * NumElts;
		LLVMContext &Context = FirstReduced->getContext();
		if (!TTI->isTypeLegal(IntegerType::get(Context, LoadBitWidth)))
		return false;

		// Everything matched - assume that we can fold the whole sequence using
		// load combining.
		LLVM_DEBUG(dbgs() << "SLP: Assume load combining for scalar reduction of "
		<< *(cast<Instruction>(FirstReduced)) << "\n");

		return true;
		}

bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {		bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {
// We can vectorize the tree if its size is greater than or equal to the		// We can vectorize the tree if its size is greater than or equal to the
// minimum size specified by the MinTreeSize command line option.		// minimum size specified by the MinTreeSize command line option.
if (VectorizableTree.size() >= MinTreeSize)		if (VectorizableTree.size() >= MinTreeSize)
return false;		return false;

// If we have a tiny tree (a tree whose size is less than MinTreeSize), we		// If we have a tiny tree (a tree whose size is less than MinTreeSize), we
// can vectorize it if we can prove it fully vectorizable.		// can vectorize it if we can prove it fully vectorizable.
▲ Show 20 Lines • Show All 3,074 Lines • ▼ Show 20 Lines	while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());		SmallVector<Value *, 4> ReorderedOps(VL.size());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
[VL](const unsigned Idx) { return VL[Idx]; });		[VL](const unsigned Idx) { return VL[Idx]; });
V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);		V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);
}		}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;
		if (V.isLoadCombineReductionCandidate(ReductionData.getOpcode()))
		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int TreeCost = V.getTreeCost();		int TreeCost = V.getTreeCost();
int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);		int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);
int Cost = TreeCost + ReductionCost;		int Cost = TreeCost + ReductionCost;
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold) {
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	public:
}		}

private:		private:
/// Calculate the cost of a reduction.		/// Calculate the cost of a reduction.
int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,		int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,
unsigned ReduxWidth) {		unsigned ReduxWidth) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);		Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);

		ABataevUnsubmitted Done Reply Inline Actions Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer? ABataev: Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer?
int PairwiseRdxCost;		int PairwiseRdxCost;
int SplittingRdxCost;		int SplittingRdxCost;
switch (ReductionData.getKind()) {		switch (ReductionData.getKind()) {
case RK_Arithmetic:		case RK_Arithmetic:
PairwiseRdxCost =		PairwiseRdxCost =
TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,		TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,
/IsPairwiseForm=/true);		/IsPairwiseForm=/true);
SplittingRdxCost =		SplittingRdxCost =
Show All 21 Lines	int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,

IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;		IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;
int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;		int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;

int ScalarReduxCost = 0;		int ScalarReduxCost = 0;
switch (ReductionData.getKind()) {		switch (ReductionData.getKind()) {
case RK_Arithmetic:		case RK_Arithmetic:
ScalarReduxCost =		ScalarReduxCost =
TTI->getArithmeticInstrCost(ReductionData.getOpcode(), ScalarTy);		TTI->getArithmeticInstrCost(ReductionData.getOpcode(), ScalarTy);
break;		break;
		ABataevUnsubmitted Done Reply Inline Actions Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`? ABataev: Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`?
		spatelAuthorUnsubmitted Done Reply Inline Actions Do you have a specific suggestion about how to alter that API to make this fit? I'm not seeing how to do it without making this more confusing than having a dedicated function. spatel: Do you have a specific suggestion about how to alter that API to make this fit? I'm not…
		ABataevUnsubmitted Done Reply Inline Actions `getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call this function and try to make a pattern matching for the arguments of the instruction rather than the instruction itself as in your original patch. Also, you will need to change the way you estimate the cost of the instruction. You need to estimate it for the single instruction. It is going to be less precise than in your current implementation but this is the common problem of the cost model, which should be addressed separately. ABataev: `getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call…
		ABataevUnsubmitted Done Reply Inline Actions Maybe better to put the check for the operation into the function itself? ABataev: Maybe better to put the check for the operation into the function itself?
case RK_Min:		case RK_Min:
case RK_Max:		case RK_Max:
case RK_UMin:		case RK_UMin:
case RK_UMax:		case RK_UMax:
		ABataevUnsubmitted Done Reply Inline Actions I don't think it is a good idea to break the contract of the function. Better to pass the arguments of the instruction, just like requested by the function. ABataev: I don't think it is a good idea to break the contract of the function. Better to pass the…
ScalarReduxCost =		ScalarReduxCost =
TTI->getCmpSelInstrCost(ReductionData.getOpcode(), ScalarTy) +		TTI->getCmpSelInstrCost(ReductionData.getOpcode(), ScalarTy) +
TTI->getCmpSelInstrCost(Instruction::Select, ScalarTy,		TTI->getCmpSelInstrCost(Instruction::Select, ScalarTy,
CmpInst::makeCmpResultType(ScalarTy));		CmpInst::makeCmpResultType(ScalarTy));
break;		break;
case RK_None:		case RK_None:
llvm_unreachable("Expected arithmetic or min/max reduction operation");		llvm_unreachable("Expected arithmetic or min/max reduction operation");
}		}
▲ Show 20 Lines • Show All 582 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/bad-reduction.ll

	Show All 9 Lines
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G0]] to <4 x i8>*			; CHECK-NEXT: [[T0:%.]] = load i8, i8 [[G0]]
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[T1:%.]] = load i8, i8 [[G1]]
				; CHECK-NEXT: [[T2:%.]] = load i8, i8 [[G2]]
				; CHECK-NEXT: [[T3:%.]] = load i8, i8 [[G3]]
	; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]			; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]
	; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]			; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]
	; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]			; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]
	; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]			; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]
	; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i8> [[TMP2]] to <4 x i64>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[T0]] to i64
				; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[T1]] to i64
				; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[T2]] to i64
				; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[T3]] to i64
	; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64			; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64
	; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64			; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64
	; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64			; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64
	; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64			; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <4 x i64> [[TMP3]], <i64 56, i64 48, i64 40, i64 32>			; CHECK-NEXT: [[SH0:%.*]] = shl nuw i64 [[Z0]], 56
				; CHECK-NEXT: [[SH1:%.*]] = shl nuw nsw i64 [[Z1]], 48
				; CHECK-NEXT: [[SH2:%.*]] = shl nuw nsw i64 [[Z2]], 40
				; CHECK-NEXT: [[SH3:%.*]] = shl nuw nsw i64 [[Z3]], 32
	; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24			; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24
	; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16			; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16
	; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8			; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[OR01:%.*]] = or i64 [[SH0]], [[SH1]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <4 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[OR012:%.*]] = or i64 [[OR01]], [[SH2]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x i64> [[BIN_RDX]], <4 x i64> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[OR0123:%.*]] = or i64 [[OR012]], [[SH3]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <4 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[OR01234:%.*]] = or i64 [[OR0123]], [[SH4]]
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i64> [[BIN_RDX2]], i32 0			; CHECK-NEXT: [[OR012345:%.*]] = or i64 [[OR01234]], [[SH5]]
	; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP5]], [[SH4]]			; CHECK-NEXT: [[OR0123456:%.*]] = or i64 [[OR012345]], [[SH6]]
	; CHECK-NEXT: [[TMP7:%.*]] = or i64 [[TMP6]], [[SH5]]			; CHECK-NEXT: [[OR01234567:%.*]] = or i64 [[OR0123456]], [[Z7]]
	; CHECK-NEXT: [[TMP8:%.*]] = or i64 [[TMP7]], [[SH6]]			; CHECK-NEXT: ret i64 [[OR01234567]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = or i64 [[TMP8]], [[Z7]]
	; CHECK-NEXT: ret i64 [[OP_EXTRA]]
	;			;
	%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0			%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0
	%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1			%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1
	%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2			%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2
	%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3			%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3
	%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4			%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4
	%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5			%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5
	%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6			%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6
	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G0]] to <8 x i8>*			; CHECK-NEXT: [[T0:%.]] = load i8, i8 [[G0]]
	; CHECK-NEXT: [[TMP2:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[T1:%.]] = load i8, i8 [[G1]]
	; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i8> [[TMP2]] to <8 x i64>			; CHECK-NEXT: [[T2:%.]] = load i8, i8 [[G2]]
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <8 x i64> [[TMP3]], <i64 56, i64 48, i64 40, i64 32, i64 24, i64 16, i64 8, i64 0>			; CHECK-NEXT: [[T3:%.]] = load i8, i8 [[G3]]
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i64> [[TMP4]], <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <8 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i64> [[BIN_RDX]], <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <8 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i64> [[BIN_RDX2]], <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[T0]] to i64
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = or <8 x i64> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[T1]] to i64
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i64> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[T2]] to i64
	; CHECK-NEXT: ret i64 [[TMP5]]			; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[T3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64
				; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64
				; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64
				; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64
				; CHECK-NEXT: [[SH0:%.*]] = shl nuw i64 [[Z0]], 56
				; CHECK-NEXT: [[SH1:%.*]] = shl nuw nsw i64 [[Z1]], 48
				; CHECK-NEXT: [[SH2:%.*]] = shl nuw nsw i64 [[Z2]], 40
				; CHECK-NEXT: [[SH3:%.*]] = shl nuw nsw i64 [[Z3]], 32
				; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24
				; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16
				; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8
				; CHECK-NEXT: [[SH7:%.*]] = shl nuw nsw i64 [[Z7]], 0
				; CHECK-NEXT: [[OR01:%.*]] = or i64 [[SH0]], [[SH1]]
				; CHECK-NEXT: [[OR012:%.*]] = or i64 [[OR01]], [[SH2]]
				; CHECK-NEXT: [[OR0123:%.*]] = or i64 [[OR012]], [[SH3]]
				; CHECK-NEXT: [[OR01234:%.*]] = or i64 [[OR0123]], [[SH4]]
				; CHECK-NEXT: [[OR012345:%.*]] = or i64 [[OR01234]], [[SH5]]
				; CHECK-NEXT: [[OR0123456:%.*]] = or i64 [[OR012345]], [[SH6]]
				; CHECK-NEXT: [[OR01234567:%.*]] = or i64 [[OR0123456]], [[SH7]]
				; CHECK-NEXT: ret i64 [[OR01234567]]
	;			;
	%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0			%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0
	%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1			%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1
	%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2			%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2
	%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3			%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3
	%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4			%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4
	%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5			%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5
	%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6			%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7
	; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1			; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G1]] to <4 x i8>*			; CHECK-NEXT: [[LD1:%.]] = load i8, i8 [[G1]], align 1
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[LD2:%.]] = load i8, i8 [[G2]], align 1
				; CHECK-NEXT: [[LD3:%.]] = load i8, i8 [[G3]], align 1
				; CHECK-NEXT: [[LD4:%.]] = load i8, i8 [[G4]], align 1
	; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1			; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1
	; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1			; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1
	; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1			; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1
	; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64
	; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i8> [[TMP2]] to <4 x i64>			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[LD1]] to i64
				; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[LD2]] to i64
				; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[LD3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[LD4]] to i64
	; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64			; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64
	; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64			; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64
	; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64			; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw <4 x i64> [[TMP3]], <i64 8, i64 16, i64 24, i64 32>			; CHECK-NEXT: [[S1:%.*]] = shl nuw nsw i64 [[Z1]], 8
				; CHECK-NEXT: [[S2:%.*]] = shl nuw nsw i64 [[Z2]], 16
				; CHECK-NEXT: [[S3:%.*]] = shl nuw nsw i64 [[Z3]], 24
				; CHECK-NEXT: [[S4:%.*]] = shl nuw nsw i64 [[Z4]], 32
	; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40			; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40
	; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48			; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48
	; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56			; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[O1:%.*]] = or i64 [[S1]], [[Z0]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <4 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[O2:%.*]] = or i64 [[O1]], [[S2]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x i64> [[BIN_RDX]], <4 x i64> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[O3:%.*]] = or i64 [[O2]], [[S3]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <4 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[O4:%.*]] = or i64 [[O3]], [[S4]]
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i64> [[BIN_RDX2]], i32 0			; CHECK-NEXT: [[O5:%.*]] = or i64 [[O4]], [[S5]]
	; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP5]], [[S5]]			; CHECK-NEXT: [[O6:%.*]] = or i64 [[O5]], [[S6]]
	; CHECK-NEXT: [[TMP7:%.*]] = or i64 [[TMP6]], [[S6]]			; CHECK-NEXT: [[O7:%.*]] = or i64 [[O6]], [[S7]]
	; CHECK-NEXT: [[TMP8:%.*]] = or i64 [[TMP7]], [[S7]]			; CHECK-NEXT: ret i64 [[O7]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = or i64 [[TMP8]], [[Z0]]
	; CHECK-NEXT: ret i64 [[OP_EXTRA]]
	;			;
	%g1 = getelementptr inbounds i8, i8* %arg, i64 1			%g1 = getelementptr inbounds i8, i8* %arg, i64 1
	%g2 = getelementptr inbounds i8, i8* %arg, i64 2			%g2 = getelementptr inbounds i8, i8* %arg, i64 2
	%g3 = getelementptr inbounds i8, i8* %arg, i64 3			%g3 = getelementptr inbounds i8, i8* %arg, i64 3
	%g4 = getelementptr inbounds i8, i8* %arg, i64 4			%g4 = getelementptr inbounds i8, i8* %arg, i64 4
	%g5 = getelementptr inbounds i8, i8* %arg, i64 5			%g5 = getelementptr inbounds i8, i8* %arg, i64 5
	%g6 = getelementptr inbounds i8, i8* %arg, i64 6			%g6 = getelementptr inbounds i8, i8* %arg, i64 6
	%g7 = getelementptr inbounds i8, i8* %arg, i64 7			%g7 = getelementptr inbounds i8, i8* %arg, i64 7
	Show All 39 Lines
	; CHECK-LABEL: @load64le_nop_shift(			; CHECK-LABEL: @load64le_nop_shift(
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[ARG]] to <8 x i8>*			; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1
	; CHECK-NEXT: [[TMP2:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[LD1:%.]] = load i8, i8 [[G1]], align 1
	; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i8> [[TMP2]] to <8 x i64>			; CHECK-NEXT: [[LD2:%.]] = load i8, i8 [[G2]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <8 x i64> [[TMP3]], <i64 0, i64 8, i64 16, i64 24, i64 32, i64 40, i64 48, i64 56>			; CHECK-NEXT: [[LD3:%.]] = load i8, i8 [[G3]], align 1
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i64> [[TMP4]], <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[LD4:%.]] = load i8, i8 [[G4]], align 1
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <8 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i64> [[BIN_RDX]], <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <8 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i64> [[BIN_RDX2]], <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = or <8 x i64> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[LD1]] to i64
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i64> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[LD2]] to i64
	; CHECK-NEXT: ret i64 [[TMP5]]			; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[LD3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[LD4]] to i64
				; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64
				; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64
				; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64
				; CHECK-NEXT: [[S0:%.*]] = shl nuw nsw i64 [[Z0]], 0
				; CHECK-NEXT: [[S1:%.*]] = shl nuw nsw i64 [[Z1]], 8
				; CHECK-NEXT: [[S2:%.*]] = shl nuw nsw i64 [[Z2]], 16
				; CHECK-NEXT: [[S3:%.*]] = shl nuw nsw i64 [[Z3]], 24
				; CHECK-NEXT: [[S4:%.*]] = shl nuw nsw i64 [[Z4]], 32
				; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40
				; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48
				; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56
				; CHECK-NEXT: [[O1:%.*]] = or i64 [[S1]], [[S0]]
				; CHECK-NEXT: [[O2:%.*]] = or i64 [[O1]], [[S2]]
				; CHECK-NEXT: [[O3:%.*]] = or i64 [[O2]], [[S3]]
				; CHECK-NEXT: [[O4:%.*]] = or i64 [[O3]], [[S4]]
				; CHECK-NEXT: [[O5:%.*]] = or i64 [[O4]], [[S5]]
				; CHECK-NEXT: [[O6:%.*]] = or i64 [[O5]], [[S6]]
				; CHECK-NEXT: [[O7:%.*]] = or i64 [[O6]], [[S7]]
				; CHECK-NEXT: ret i64 [[O7]]
	;			;
	%g1 = getelementptr inbounds i8, i8* %arg, i64 1			%g1 = getelementptr inbounds i8, i8* %arg, i64 1
	%g2 = getelementptr inbounds i8, i8* %arg, i64 2			%g2 = getelementptr inbounds i8, i8* %arg, i64 2
	%g3 = getelementptr inbounds i8, i8* %arg, i64 3			%g3 = getelementptr inbounds i8, i8* %arg, i64 3
	%g4 = getelementptr inbounds i8, i8* %arg, i64 4			%g4 = getelementptr inbounds i8, i8* %arg, i64 4
	%g5 = getelementptr inbounds i8, i8* %arg, i64 5			%g5 = getelementptr inbounds i8, i8* %arg, i64 5
	%g6 = getelementptr inbounds i8, i8* %arg, i64 6			%g6 = getelementptr inbounds i8, i8* %arg, i64 6
	%g7 = getelementptr inbounds i8, i8* %arg, i64 7			%g7 = getelementptr inbounds i8, i8* %arg, i64 7
	Show All 37 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] avoid reduction transform on patterns that the backend can load-combineClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 225271

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/bad-reduction.ll

[SLP] avoid reduction transform on patterns that the backend can load-combine
ClosedPublic