This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
lib/
-
Analysis/
1/3
TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
6/6
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
bad-reduction.ll

Differential D67841

[SLP] avoid reduction transform on patterns that the backend can load-combine
ClosedPublic

Authored by spatel on Sep 20 2019, 9:45 AM.

Download Raw Diff

Details

Reviewers

chfast
mvels
lebedev.ri
hfinkel
efriedma
reames
ABataev
Vasilis
RKSimon

Commits

rG8cc6d42e8d6c: [SLP] avoid reduction transform on patterns that the backend can load-combine…
rL375025: [SLP] avoid reduction transform on patterns that the backend can load-combine…
rGe2321bb4488a: [SLP] avoid reduction transform on patterns that the backend can load-combine
rL373833: [SLP] avoid reduction transform on patterns that the backend can load-combine

Summary

I don't see an ideal solution to these 2 related, potentially large, perf regressions:
https://bugs.llvm.org/show_bug.cgi?id=42708
https://bugs.llvm.org/show_bug.cgi?id=43146

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend. Therefore, preventing SLP from destroying load combine opportunities requires that it recognizes patterns that could be combined later, but not do the optimization itself (it's not a vector combine anyway, so it's probably out-of-scope for SLP).

In the x86 tests shown (and discussed in more detail in the bug reports), SDAG combining will produce a single instruction on these tests like:

movbe   rax, qword ptr [rdi]

or:

mov     rax, qword ptr [rdi]

Not some (half) vector monstrosity as we currently do using SLP:

vpmovzxbq       ymm0, dword ptr [rdi + 1] # ymm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero,mem[2],zero,zero,zero,zero,zero,zero,zero,mem[3],zero,zero,zero,zero,zero,zero,zero
vpsllvq ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
movzx   eax, byte ptr [rdi]
movzx   ecx, byte ptr [rdi + 5]
shl     rcx, 40
movzx   edx, byte ptr [rdi + 6]
shl     rdx, 48
or      rdx, rcx
movzx   ecx, byte ptr [rdi + 7]
shl     rcx, 56
or      rcx, rdx
or      rcx, rax
vextracti128    xmm1, ymm0, 1
vpor    xmm0, xmm0, xmm1
vpshufd xmm1, xmm0, 78          # xmm1 = xmm0[2,3,0,1]
vpor    xmm0, xmm0, xmm1
vmovq   rax, xmm0
or      rax, rcx
vzeroupper
ret

Diff Detail

Event Timeline

spatel created this revision.Sep 20 2019, 9:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2019, 9:45 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

Is this similar to D42981?

In D67841#1676961, @lebedev.ri wrote:

We decided that load combining was unsuitable for IR because it could obscure other optimizations in IR. So we removed the LoadCombiner pass and deferred to the backend.

For reference, can link those patches/discussions?

http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html

More recently:
http://llvm.1065342.n5.nabble.com/llvm-dev-Load-combine-pass-td101187i20.html#a131076

This is the commit to remove the pass from trunk:
https://reviews.llvm.org/rL306067

In D67841#1676963, @ABataev wrote:

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

dtemirbulatov added a subscriber: dtemirbulatov.Sep 20 2019, 10:18 AM

In D67841#1676983, @spatel wrote:

In D67841#1676963, @ABataev wrote:

Is this similar to D42981?

I think it's trying to accomplish a similar goal, but that patch affects far more than this (and I'm not sure it would solve this problem). Ie, this is intentionally trying hard to affect only the patterns that we know will be combined into a single larger scalar load.

I looked at the other patch again, and it is not solving the examples in the motivating bug reports. The problem in these cases is not the vector cost model; it's the scalar cost model. Unless we resurrect LoadCombiner in IR, we need to recognize that small consecutive scalar loads can be combined to a larger scalar op.

The pattern-matching here isn't specific enough to fully recognize a bswap for example, but that's not necessary IMO. We have perf regressions, and this is a conservative change to the cost model calc that fixes them. If we underestimated the scalar cost for some sequence that can't be reduced to a single load, that can be improved later.

ABataev added inline comments.Sep 25 2019, 11:10 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6458	Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer?

Patch updated:
Move the analysis/calculation of load-combining from SLP to the TTI cost model.

ABataev added inline comments.Sep 26 2019, 8:46 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496	Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`?

spatel marked 2 inline comments as done.Sep 26 2019, 9:04 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496	Do you have a specific suggestion about how to alter that API to make this fit? I'm not seeing how to do it without making this more confusing than having a dedicated function.

ABataev added inline comments.Sep 26 2019, 10:02 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6496	`getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call this function and try to make a pattern matching for the arguments of the instruction rather than the instruction itself as in your original patch. Also, you will need to change the way you estimate the cost of the instruction. You need to estimate it for the single instruction. It is going to be less precise than in your current implementation but this is the common problem of the cost model, which should be addressed separately.

Patch updated:
Make load combining cost calculation a specialization of general arithmetic instruction cost by using optional parameter.

ABataev added inline comments.Sep 26 2019, 1:38 PM

llvm/lib/Analysis/TargetTransformInfo.cpp
633	Ooops, sorry, my fault, pointed to the wrong function. Better to modify a function in `include/llvm/CodeGen/BasicTTIImpl.h`, `BasicTTIImplBase::getArithmeticInstrCost`, if this is common limitation.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6500	I don't think it is a good idea to break the contract of the function. Better to pass the arguments of the instruction, just like requested by the function.

Patch updated:
Pass in reduction root operands to cost model rather than reduced value.

spatel marked an inline comment as done.Sep 27 2019, 8:28 AM

spatel added inline comments.

llvm/lib/Analysis/TargetTransformInfo.cpp
633	AFAICT, if we change the concept/model base classes, that will only be called after any target-specific override, so we would need to change all of the override implementations to access the new code. Adding code to the concrete class allows us to add the target-independent customization for a load-combine before the target does its usual logic. Can you point to an example of what you'd like to see and/or show it exactly?

ABataev added inline comments.Sep 27 2019, 8:42 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
633	Yes, this is what I was afraid of. In general, it would be good to put it to the base class, but if you think that this not possible without code duplication in derived classes, better to leave it as is. Other opinions?

Ping.

(I did draft a code change that altered the concept/model classes, but as mentioned earlier, I couldn't make that work as intended.)

I'm good with this unless there are no other opinions.

This revision is now accepted and ready to land.Oct 4 2019, 5:52 AM

LGTM as a stopgap

Closed by commit rGe2321bb4488a: [SLP] avoid reduction transform on patterns that the backend can load-combine (authored by spatel). · Explain WhyOct 5 2019, 11:04 AM

This revision was automatically updated to reflect the committed changes.

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

RKSimon reopened this revision.Oct 7 2019, 2:50 AM

This revision is now accepted and ready to land.Oct 7 2019, 2:50 AM

RKSimon requested changes to this revision.Oct 7 2019, 2:50 AM

This revision now requires changes to proceed.Oct 7 2019, 2:50 AM

In D67841#1696910, @mstorsjo wrote:

This caused lots of failed asserts in building many different projects, see https://bugs.llvm.org/show_bug.cgi?id=43582, so I went ahead and reverted it for now.

Thanks. I looked at the test cases attached to the bug report, and this patch causes scary behavior:
LV: Found an estimated cost of 4294967293 for VF 1 For instruction: %or75 = or i32 %shl74, %shl71

The loop vectorizer assumes that costs are always positive (it converts the value returned by the cost model to an *unsigned* value).
This matches the assert in the getArithmeticInstrCost() implementation that we tried to bypass:

assert(Cost >= 0 && "TTI should not produce negative costs!");

But we want SLP to weigh the *relative* cost of scalar code (that will be reduced) vs. vector code.

I think we should use the earlier revision of this patch that created a dedicated function for estimating a load combining pattern. Ie, we tried to squeeze this into the more general getArithmeticInstrCost() API, but it does not belong there. Existing callers have made assumptions about using that cost model API, and we violated the contract:

/// This is an approximation of reciprocal throughput of a math/logic op.
/// A higher cost indicates less expected throughput.
/// From Agner Fog's guides, reciprocal throughput is "the average number of
/// clock cycles per instruction when the instructions are not part of a
/// limiting dependency chain."
/// Therefore, costs should be scaled to account for multiple execution units
/// on the target that can process this type of instruction. For example, if
/// there are 5 scalar integer units and 2 vector integer units that can
/// calculate an 'add' in a single cycle, this model should indicate that the
/// cost of the vector add instruction is 2.5 times the cost of the scalar
/// add instruction.
/// \p Args is an optional argument which holds the instruction operands
/// values so the TTI can analyze those values searching for special
/// cases or optimizations based on those values.
int getArithmeticInstrCost(

Patch updated:
Jump back to the earlier revision which created a new method for cost of a load-combine pattern. This is independent of the existing arithmetic instruction cost API, so we can be sure that there is no conflict with other cost model users. It's also more accurate because we can add the cost of a wider load just once for the entire pattern. Test case for the loop vectorizer crash was added at rL373913.

Patch updated:
Moved "using PatternMatch" line within function that uses that API.

Side note: filed https://bugs.llvm.org/show_bug.cgi?id=43591 for larger questions about the cost model. I really don't want to hold up the underlying motivating bugs for this patch while we try to untangle that mess.

Ping.

There seems to be general agreement that this is a temporary (stopgap) solution until we can do limited load combining in IR. So it's a question of whether we're ok with a cost model hack to overcome the motivating bugs while we figure out how to add a new pass.

ABataev added inline comments.Oct 11 2019, 9:39 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗	(On Diff #223625)	Shall we just terminate the reduction of this special construct unconditionally? Do we need this cost calculation just to prevent the reduction all the time we see this pattern? If yes, then, probably, we don't need to calculate the cost. There is a function `isTreeTinyAndNotFullyVectorizable()`. Can you put pattern matching analysis for this particular construct in this function without any additional cost analysis?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6495–6501	Maybe better to put the check for the operation into the function itself?

spatel marked 2 inline comments as done.Oct 16 2019, 4:46 AM

spatel added inline comments.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1698 ↗	(On Diff #223625)	I'm not opposed to ignoring costs for this patch. It's makes the patch simpler (although less flexible since targets can't then override via cost model). I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :) I think the first revision of the patch was already close in spirit to this suggestion - it modified SLP alone rather than the cost model. Let me rework that, and we'll see if that looks better.

Patch updated:
Change implementation to a pure SLP bailout based on pattern-matching of a load-combine-reduction opportunity. Same test results.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 16 2019, 11:05 AM

Closed by commit rG8cc6d42e8d6c: [SLP] avoid reduction transform on patterns that the backend can load-combine… (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D78997: [SLP] add another bailout for load-combine patterns.Apr 28 2020, 6:01 AM

spatel mentioned this in rG86dfbc676ebe: [SLP] add another bailout for load-combine patterns.May 5 2020, 10:14 AM

spatel mentioned this in rG02051c7f3ae9: [SLP] add another bailout for load-combine patterns (2nd try).May 7 2020, 12:28 PM

jrbyrnes mentioned this in D133584: [DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine.Sep 19 2022, 10:59 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

10 lines

lib/

Analysis/

TargetTransformInfo.cpp

53 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

8 lines

test/

Transforms/

SLPVectorizer/

X86/

bad-reduction.ll

156 lines

Diff 222018

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,123 Lines • ▼ Show 20 Lines	private:
/// Estimate the latency of specified instruction.		/// Estimate the latency of specified instruction.
/// Returns 1 as the default value.		/// Returns 1 as the default value.
int getInstructionLatency(const Instruction *I) const;		int getInstructionLatency(const Instruction *I) const;

/// Returns the expected throughput cost of the instruction.		/// Returns the expected throughput cost of the instruction.
/// Returns -1 if the cost is unknown.		/// Returns -1 if the cost is unknown.
int getInstructionThroughput(const Instruction *I) const;		int getInstructionThroughput(const Instruction *I) const;

		/// Given an input value that is an element of an 'or' reduction, check if the
		/// reduction is composed of narrower loaded values. Assuming that a
		/// legal-sized reduction of shifted/zexted loaded values can be load combined
		/// in the backend, create a relative cost that accounts for the removal of
		/// the intermediate ops and replacement by a single wide load.
		/// TODO: If load combining is allowed in the IR optimizer, this analysis
		/// may not be necessary.
		Optional<int> getLoadCombineCost(unsigned Opcode, Type *Ty,
		ArrayRef<const Value *> Args) const;

/// The abstract base class used to type erase specific TTI		/// The abstract base class used to type erase specific TTI
/// implementations.		/// implementations.
class Concept;		class Concept;

/// The template model for the base class which wraps a concrete		/// The template model for the base class which wraps a concrete
/// implementation in a type erased interface.		/// implementation in a type erased interface.
template <typename T> class Model;		template <typename T> class Model;

▲ Show 20 Lines • Show All 763 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 571 Lines • ▼ Show 20 Lines	TargetTransformInfo::getOperandInfo(Value *V, OperandValueProperties &OpProps) {
// Check for a splat of a uniform value. This is not loop aware, so return		// Check for a splat of a uniform value. This is not loop aware, so return
// true only for the obviously uniform cases (argument, globalvalue)		// true only for the obviously uniform cases (argument, globalvalue)
if (Splat && (isa<Argument>(Splat) \|\| isa<GlobalValue>(Splat)))		if (Splat && (isa<Argument>(Splat) \|\| isa<GlobalValue>(Splat)))
OpInfo = OK_UniformValue;		OpInfo = OK_UniformValue;

return OpInfo;		return OpInfo;
}		}

		Optional<int>
		TargetTransformInfo::getLoadCombineCost(unsigned Opcode, Type *Ty,
		ArrayRef<const Value *> Args) const {
		if (Opcode != Instruction::Or)
		return llvm::None;
		if (Args.size() != 1)
		return llvm::None;
		const Value *V = Args.back();
		// Look past the reduction to find a source value. Arbitrarily follow the
		// path through operand 0 of any 'or'. Also, peek through optional
		// shift-left-by-constant.
		const Value *ZextLoad = V;
		while (match(ZextLoad, m_Or(m_Value(), m_Value())) \|\|
		match(ZextLoad, m_Shl(m_Value(), m_Constant())))
		ZextLoad = cast<BinaryOperator>(ZextLoad)->getOperand(0);

		// Check if the input to the reduction is an extended load.
		Value *LoadPtr;
		if (!match(ZextLoad, m_ZExt(m_Load(m_Value(LoadPtr)))))
		return llvm::None;

		// Require that the total load bit width is a legal integer type.
		// For example, <8 x i8> --> i64 is a legal integer on a 64-bit target.
		// But <16 x i8> --> i128 is not, so the backend probably can't reduce it.
		Type *WideType = V->getType();
		Type *EltType = LoadPtr->getType()->getPointerElementType();
		unsigned WideWidth = WideType->getIntegerBitWidth();
		unsigned EltWidth = EltType->getIntegerBitWidth();
		if (!isTypeLegal(WideType) \|\| WideWidth % EltWidth != 0)
		return llvm::None;

		// Calculate relative cost: {narrow load+zext+shl+or} are assumed to be
		// removed and replaced by a single wide load.
		// FIXME: This is not accurate for the larger pattern where we replace
		// multiple narrow load sequences with just 1 wide load. We could
		// remove the addition of the wide load cost here and expect the caller
		// to make an adjustment for that.
		int Cost = 0;
		Cost -= getMemoryOpCost(Instruction::Load, EltType, 0, 0);
		Cost -= getCastInstrCost(Instruction::ZExt, WideType, EltType);
		Cost -= getArithmeticInstrCost(Instruction::Shl, WideType);
		Cost -= getArithmeticInstrCost(Instruction::Or, WideType);
		Cost += getMemoryOpCost(Instruction::Load, WideType, 0, 0);
		return Cost;
		}


int TargetTransformInfo::getArithmeticInstrCost(		int TargetTransformInfo::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,		unsigned Opcode, Type *Ty, OperandValueKind Opd1Info,
OperandValueKind Opd2Info, OperandValueProperties Opd1PropInfo,		OperandValueKind Opd2Info, OperandValueProperties Opd1PropInfo,
OperandValueProperties Opd2PropInfo,		OperandValueProperties Opd2PropInfo,
ArrayRef<const Value *> Args) const {		ArrayRef<const Value *> Args) const {
		// Check if we can match this instruction as part of a larger pattern.
		Optional<int> LoadCombineCost = getLoadCombineCost(Opcode, Ty, Args);
		ABataevUnsubmitted Not Done Reply Inline Actions Ooops, sorry, my fault, pointed to the wrong function. Better to modify a function in `include/llvm/CodeGen/BasicTTIImpl.h`, `BasicTTIImplBase::getArithmeticInstrCost`, if this is common limitation. ABataev: Ooops, sorry, my fault, pointed to the wrong function. Better to modify a function in…
		spatelAuthorUnsubmitted Done Reply Inline Actions AFAICT, if we change the concept/model base classes, that will only be called after any target-specific override, so we would need to change all of the override implementations to access the new code. Adding code to the concrete class allows us to add the target-independent customization for a load-combine before the target does its usual logic. Can you point to an example of what you'd like to see and/or show it exactly? spatel: AFAICT, if we change the concept/model base classes, that will only be called after any target…
		ABataevUnsubmitted Not Done Reply Inline Actions Yes, this is what I was afraid of. In general, it would be good to put it to the base class, but if you think that this not possible without code duplication in derived classes, better to leave it as is. Other opinions? ABataev: Yes, this is what I was afraid of. In general, it would be good to put it to the base class…
		if (LoadCombineCost)
		return LoadCombineCost.getValue();

		// Fallback to implementation-specific overrides or base class.
int Cost = TTIImpl->getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,		int Cost = TTIImpl->getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
Opd1PropInfo, Opd2PropInfo, Args);		Opd1PropInfo, Opd2PropInfo, Args);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

int TargetTransformInfo::getShuffleCost(ShuffleKind Kind, Type *Ty, int Index,		int TargetTransformInfo::getShuffleCost(ShuffleKind Kind, Type *Ty, int Index,
Type *SubTp) const {		Type *SubTp) const {
▲ Show 20 Lines • Show All 790 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,449 Lines • ▼ Show 20 Lines	public:
}		}

private:		private:
/// Calculate the cost of a reduction.		/// Calculate the cost of a reduction.
int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,		int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,
unsigned ReduxWidth) {		unsigned ReduxWidth) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);		Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);

		ABataevUnsubmitted Done Reply Inline Actions Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer? ABataev: Maybe, better to do it a member of TargetTransformInfo rather than of SLP vectorizer?
int PairwiseRdxCost;		int PairwiseRdxCost;
int SplittingRdxCost;		int SplittingRdxCost;
switch (ReductionData.getKind()) {		switch (ReductionData.getKind()) {
case RK_Arithmetic:		case RK_Arithmetic:
PairwiseRdxCost =		PairwiseRdxCost =
TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,		TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,
/IsPairwiseForm=/true);		/IsPairwiseForm=/true);
SplittingRdxCost =		SplittingRdxCost =
Show All 20 Lines	int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,
}		}

IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;		IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;
int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;		int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;

int ScalarReduxCost = 0;		int ScalarReduxCost = 0;
switch (ReductionData.getKind()) {		switch (ReductionData.getKind()) {
case RK_Arithmetic:		case RK_Arithmetic:
ScalarReduxCost =		// Note: Passing in the reduced value allows the cost model to match
TTI->getArithmeticInstrCost(ReductionData.getOpcode(), ScalarTy);		// load combining patterns for this reduction.
		ABataevUnsubmitted Done Reply Inline Actions Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`? ABataev: Shall we move all this stuff to `TargetTransformInfo::getArithmeticInstrCost`?
		spatelAuthorUnsubmitted Done Reply Inline Actions Do you have a specific suggestion about how to alter that API to make this fit? I'm not seeing how to do it without making this more confusing than having a dedicated function. spatel: Do you have a specific suggestion about how to alter that API to make this fit? I'm not…
		ABataevUnsubmitted Done Reply Inline Actions `getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call this function and try to make a pattern matching for the arguments of the instruction rather than the instruction itself as in your original patch. Also, you will need to change the way you estimate the cost of the instruction. You need to estimate it for the single instruction. It is going to be less precise than in your current implementation but this is the common problem of the cost model, which should be addressed separately. ABataev: `getArithmeticInstrCost` function has optional parameter `Args`. We can use it when we call…
		ScalarReduxCost = TTI->getArithmeticInstrCost(ReductionData.getOpcode(),
		ScalarTy, TargetTransformInfo::OK_AnyValue,
		TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None,
		TargetTransformInfo::OP_None, { FirstReducedVal });
		ABataevUnsubmitted Done Reply Inline Actions I don't think it is a good idea to break the contract of the function. Better to pass the arguments of the instruction, just like requested by the function. ABataev: I don't think it is a good idea to break the contract of the function. Better to pass the…
break;		break;
		ABataevUnsubmitted Done Reply Inline Actions Maybe better to put the check for the operation into the function itself? ABataev: Maybe better to put the check for the operation into the function itself?
case RK_Min:		case RK_Min:
case RK_Max:		case RK_Max:
case RK_UMin:		case RK_UMin:
case RK_UMax:		case RK_UMax:
ScalarReduxCost =		ScalarReduxCost =
TTI->getCmpSelInstrCost(ReductionData.getOpcode(), ScalarTy) +		TTI->getCmpSelInstrCost(ReductionData.getOpcode(), ScalarTy) +
TTI->getCmpSelInstrCost(Instruction::Select, ScalarTy,		TTI->getCmpSelInstrCost(Instruction::Select, ScalarTy,
CmpInst::makeCmpResultType(ScalarTy));		CmpInst::makeCmpResultType(ScalarTy));
▲ Show 20 Lines • Show All 580 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/bad-reduction.ll

	Show All 9 Lines
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G0]] to <4 x i8>*			; CHECK-NEXT: [[T0:%.]] = load i8, i8 [[G0]]
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[T1:%.]] = load i8, i8 [[G1]]
				; CHECK-NEXT: [[T2:%.]] = load i8, i8 [[G2]]
				; CHECK-NEXT: [[T3:%.]] = load i8, i8 [[G3]]
	; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]			; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]
	; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]			; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]
	; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]			; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]
	; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]			; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]
	; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i8> [[TMP2]] to <4 x i64>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[T0]] to i64
				; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[T1]] to i64
				; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[T2]] to i64
				; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[T3]] to i64
	; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64			; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64
	; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64			; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64
	; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64			; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64
	; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64			; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <4 x i64> [[TMP3]], <i64 56, i64 48, i64 40, i64 32>			; CHECK-NEXT: [[SH0:%.*]] = shl nuw i64 [[Z0]], 56
				; CHECK-NEXT: [[SH1:%.*]] = shl nuw nsw i64 [[Z1]], 48
				; CHECK-NEXT: [[SH2:%.*]] = shl nuw nsw i64 [[Z2]], 40
				; CHECK-NEXT: [[SH3:%.*]] = shl nuw nsw i64 [[Z3]], 32
	; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24			; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24
	; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16			; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16
	; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8			; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[OR01:%.*]] = or i64 [[SH0]], [[SH1]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <4 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[OR012:%.*]] = or i64 [[OR01]], [[SH2]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x i64> [[BIN_RDX]], <4 x i64> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[OR0123:%.*]] = or i64 [[OR012]], [[SH3]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <4 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[OR01234:%.*]] = or i64 [[OR0123]], [[SH4]]
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i64> [[BIN_RDX2]], i32 0			; CHECK-NEXT: [[OR012345:%.*]] = or i64 [[OR01234]], [[SH5]]
	; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP5]], [[SH4]]			; CHECK-NEXT: [[OR0123456:%.*]] = or i64 [[OR012345]], [[SH6]]
	; CHECK-NEXT: [[TMP7:%.*]] = or i64 [[TMP6]], [[SH5]]			; CHECK-NEXT: [[OR01234567:%.*]] = or i64 [[OR0123456]], [[Z7]]
	; CHECK-NEXT: [[TMP8:%.*]] = or i64 [[TMP7]], [[SH6]]			; CHECK-NEXT: ret i64 [[OR01234567]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = or i64 [[TMP8]], [[Z7]]
	; CHECK-NEXT: ret i64 [[OP_EXTRA]]
	;			;
	%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0			%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0
	%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1			%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1
	%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2			%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2
	%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3			%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3
	%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4			%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4
	%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5			%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5
	%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6			%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6
	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds [[V8I8:%.]], %v8i8* [[P:%.*]], i64 0, i32 0
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds [[V8I8]], %v8i8 [[P]], i64 0, i32 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G0]] to <8 x i8>*			; CHECK-NEXT: [[T0:%.]] = load i8, i8 [[G0]]
	; CHECK-NEXT: [[TMP2:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[T1:%.]] = load i8, i8 [[G1]]
	; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i8> [[TMP2]] to <8 x i64>			; CHECK-NEXT: [[T2:%.]] = load i8, i8 [[G2]]
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <8 x i64> [[TMP3]], <i64 56, i64 48, i64 40, i64 32, i64 24, i64 16, i64 8, i64 0>			; CHECK-NEXT: [[T3:%.]] = load i8, i8 [[G3]]
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i64> [[TMP4]], <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[T4:%.]] = load i8, i8 [[G4]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <8 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[T5:%.]] = load i8, i8 [[G5]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i64> [[BIN_RDX]], <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[T6:%.]] = load i8, i8 [[G6]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <8 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[T7:%.]] = load i8, i8 [[G7]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i64> [[BIN_RDX2]], <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[T0]] to i64
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = or <8 x i64> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[T1]] to i64
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i64> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[T2]] to i64
	; CHECK-NEXT: ret i64 [[TMP5]]			; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[T3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[T4]] to i64
				; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[T5]] to i64
				; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[T6]] to i64
				; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[T7]] to i64
				; CHECK-NEXT: [[SH0:%.*]] = shl nuw i64 [[Z0]], 56
				; CHECK-NEXT: [[SH1:%.*]] = shl nuw nsw i64 [[Z1]], 48
				; CHECK-NEXT: [[SH2:%.*]] = shl nuw nsw i64 [[Z2]], 40
				; CHECK-NEXT: [[SH3:%.*]] = shl nuw nsw i64 [[Z3]], 32
				; CHECK-NEXT: [[SH4:%.*]] = shl nuw nsw i64 [[Z4]], 24
				; CHECK-NEXT: [[SH5:%.*]] = shl nuw nsw i64 [[Z5]], 16
				; CHECK-NEXT: [[SH6:%.*]] = shl nuw nsw i64 [[Z6]], 8
				; CHECK-NEXT: [[SH7:%.*]] = shl nuw nsw i64 [[Z7]], 0
				; CHECK-NEXT: [[OR01:%.*]] = or i64 [[SH0]], [[SH1]]
				; CHECK-NEXT: [[OR012:%.*]] = or i64 [[OR01]], [[SH2]]
				; CHECK-NEXT: [[OR0123:%.*]] = or i64 [[OR012]], [[SH3]]
				; CHECK-NEXT: [[OR01234:%.*]] = or i64 [[OR0123]], [[SH4]]
				; CHECK-NEXT: [[OR012345:%.*]] = or i64 [[OR01234]], [[SH5]]
				; CHECK-NEXT: [[OR0123456:%.*]] = or i64 [[OR012345]], [[SH6]]
				; CHECK-NEXT: [[OR01234567:%.*]] = or i64 [[OR0123456]], [[SH7]]
				; CHECK-NEXT: ret i64 [[OR01234567]]
	;			;
	%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0			%g0 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 0
	%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1			%g1 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 1
	%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2			%g2 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 2
	%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3			%g3 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 3
	%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4			%g4 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 4
	%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5			%g5 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 5
	%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6			%g6 = getelementptr inbounds %v8i8, %v8i8* %p, i64 0, i32 6
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7
	; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1			; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[G1]] to <4 x i8>*			; CHECK-NEXT: [[LD1:%.]] = load i8, i8 [[G1]], align 1
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[LD2:%.]] = load i8, i8 [[G2]], align 1
				; CHECK-NEXT: [[LD3:%.]] = load i8, i8 [[G3]], align 1
				; CHECK-NEXT: [[LD4:%.]] = load i8, i8 [[G4]], align 1
	; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1			; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1
	; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1			; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1
	; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1			; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1
	; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64
	; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i8> [[TMP2]] to <4 x i64>			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[LD1]] to i64
				; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[LD2]] to i64
				; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[LD3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[LD4]] to i64
	; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64			; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64
	; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64			; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64
	; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64			; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw <4 x i64> [[TMP3]], <i64 8, i64 16, i64 24, i64 32>			; CHECK-NEXT: [[S1:%.*]] = shl nuw nsw i64 [[Z1]], 8
				; CHECK-NEXT: [[S2:%.*]] = shl nuw nsw i64 [[Z2]], 16
				; CHECK-NEXT: [[S3:%.*]] = shl nuw nsw i64 [[Z3]], 24
				; CHECK-NEXT: [[S4:%.*]] = shl nuw nsw i64 [[Z4]], 32
	; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40			; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40
	; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48			; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48
	; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56			; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[O1:%.*]] = or i64 [[S1]], [[Z0]]
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <4 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[O2:%.*]] = or i64 [[O1]], [[S2]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <4 x i64> [[BIN_RDX]], <4 x i64> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[O3:%.*]] = or i64 [[O2]], [[S3]]
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <4 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[O4:%.*]] = or i64 [[O3]], [[S4]]
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i64> [[BIN_RDX2]], i32 0			; CHECK-NEXT: [[O5:%.*]] = or i64 [[O4]], [[S5]]
	; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[TMP5]], [[S5]]			; CHECK-NEXT: [[O6:%.*]] = or i64 [[O5]], [[S6]]
	; CHECK-NEXT: [[TMP7:%.*]] = or i64 [[TMP6]], [[S6]]			; CHECK-NEXT: [[O7:%.*]] = or i64 [[O6]], [[S7]]
	; CHECK-NEXT: [[TMP8:%.*]] = or i64 [[TMP7]], [[S7]]			; CHECK-NEXT: ret i64 [[O7]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = or i64 [[TMP8]], [[Z0]]
	; CHECK-NEXT: ret i64 [[OP_EXTRA]]
	;			;
	%g1 = getelementptr inbounds i8, i8* %arg, i64 1			%g1 = getelementptr inbounds i8, i8* %arg, i64 1
	%g2 = getelementptr inbounds i8, i8* %arg, i64 2			%g2 = getelementptr inbounds i8, i8* %arg, i64 2
	%g3 = getelementptr inbounds i8, i8* %arg, i64 3			%g3 = getelementptr inbounds i8, i8* %arg, i64 3
	%g4 = getelementptr inbounds i8, i8* %arg, i64 4			%g4 = getelementptr inbounds i8, i8* %arg, i64 4
	%g5 = getelementptr inbounds i8, i8* %arg, i64 5			%g5 = getelementptr inbounds i8, i8* %arg, i64 5
	%g6 = getelementptr inbounds i8, i8* %arg, i64 6			%g6 = getelementptr inbounds i8, i8* %arg, i64 6
	%g7 = getelementptr inbounds i8, i8* %arg, i64 7			%g7 = getelementptr inbounds i8, i8* %arg, i64 7
	Show All 39 Lines
	; CHECK-LABEL: @load64le_nop_shift(			; CHECK-LABEL: @load64le_nop_shift(
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i8, i8 [[ARG:%.*]], i64 1
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 2
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 3
	; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4			; CHECK-NEXT: [[G4:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 4
	; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5			; CHECK-NEXT: [[G5:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 5
	; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6			; CHECK-NEXT: [[G6:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 6
	; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7			; CHECK-NEXT: [[G7:%.]] = getelementptr inbounds i8, i8 [[ARG]], i64 7
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[ARG]] to <8 x i8>*			; CHECK-NEXT: [[LD0:%.]] = load i8, i8 [[ARG]], align 1
	; CHECK-NEXT: [[TMP2:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1			; CHECK-NEXT: [[LD1:%.]] = load i8, i8 [[G1]], align 1
	; CHECK-NEXT: [[TMP3:%.*]] = zext <8 x i8> [[TMP2]] to <8 x i64>			; CHECK-NEXT: [[LD2:%.]] = load i8, i8 [[G2]], align 1
	; CHECK-NEXT: [[TMP4:%.*]] = shl nuw <8 x i64> [[TMP3]], <i64 0, i64 8, i64 16, i64 24, i64 32, i64 40, i64 48, i64 56>			; CHECK-NEXT: [[LD3:%.]] = load i8, i8 [[G3]], align 1
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <8 x i64> [[TMP4]], <8 x i64> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[LD4:%.]] = load i8, i8 [[G4]], align 1
	; CHECK-NEXT: [[BIN_RDX:%.*]] = or <8 x i64> [[TMP4]], [[RDX_SHUF]]			; CHECK-NEXT: [[LD5:%.]] = load i8, i8 [[G5]], align 1
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i64> [[BIN_RDX]], <8 x i64> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[LD6:%.]] = load i8, i8 [[G6]], align 1
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = or <8 x i64> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[LD7:%.]] = load i8, i8 [[G7]], align 1
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <8 x i64> [[BIN_RDX2]], <8 x i64> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[Z0:%.*]] = zext i8 [[LD0]] to i64
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = or <8 x i64> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[Z1:%.*]] = zext i8 [[LD1]] to i64
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i64> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[Z2:%.*]] = zext i8 [[LD2]] to i64
	; CHECK-NEXT: ret i64 [[TMP5]]			; CHECK-NEXT: [[Z3:%.*]] = zext i8 [[LD3]] to i64
				; CHECK-NEXT: [[Z4:%.*]] = zext i8 [[LD4]] to i64
				; CHECK-NEXT: [[Z5:%.*]] = zext i8 [[LD5]] to i64
				; CHECK-NEXT: [[Z6:%.*]] = zext i8 [[LD6]] to i64
				; CHECK-NEXT: [[Z7:%.*]] = zext i8 [[LD7]] to i64
				; CHECK-NEXT: [[S0:%.*]] = shl nuw nsw i64 [[Z0]], 0
				; CHECK-NEXT: [[S1:%.*]] = shl nuw nsw i64 [[Z1]], 8
				; CHECK-NEXT: [[S2:%.*]] = shl nuw nsw i64 [[Z2]], 16
				; CHECK-NEXT: [[S3:%.*]] = shl nuw nsw i64 [[Z3]], 24
				; CHECK-NEXT: [[S4:%.*]] = shl nuw nsw i64 [[Z4]], 32
				; CHECK-NEXT: [[S5:%.*]] = shl nuw nsw i64 [[Z5]], 40
				; CHECK-NEXT: [[S6:%.*]] = shl nuw nsw i64 [[Z6]], 48
				; CHECK-NEXT: [[S7:%.*]] = shl nuw i64 [[Z7]], 56
				; CHECK-NEXT: [[O1:%.*]] = or i64 [[S1]], [[S0]]
				; CHECK-NEXT: [[O2:%.*]] = or i64 [[O1]], [[S2]]
				; CHECK-NEXT: [[O3:%.*]] = or i64 [[O2]], [[S3]]
				; CHECK-NEXT: [[O4:%.*]] = or i64 [[O3]], [[S4]]
				; CHECK-NEXT: [[O5:%.*]] = or i64 [[O4]], [[S5]]
				; CHECK-NEXT: [[O6:%.*]] = or i64 [[O5]], [[S6]]
				; CHECK-NEXT: [[O7:%.*]] = or i64 [[O6]], [[S7]]
				; CHECK-NEXT: ret i64 [[O7]]
	;			;
	%g1 = getelementptr inbounds i8, i8* %arg, i64 1			%g1 = getelementptr inbounds i8, i8* %arg, i64 1
	%g2 = getelementptr inbounds i8, i8* %arg, i64 2			%g2 = getelementptr inbounds i8, i8* %arg, i64 2
	%g3 = getelementptr inbounds i8, i8* %arg, i64 3			%g3 = getelementptr inbounds i8, i8* %arg, i64 3
	%g4 = getelementptr inbounds i8, i8* %arg, i64 4			%g4 = getelementptr inbounds i8, i8* %arg, i64 4
	%g5 = getelementptr inbounds i8, i8* %arg, i64 5			%g5 = getelementptr inbounds i8, i8* %arg, i64 5
	%g6 = getelementptr inbounds i8, i8* %arg, i64 6			%g6 = getelementptr inbounds i8, i8* %arg, i64 6
	%g7 = getelementptr inbounds i8, i8* %arg, i64 7			%g7 = getelementptr inbounds i8, i8* %arg, i64 7
	Show All 37 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] avoid reduction transform on patterns that the backend can load-combineClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 222018

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/bad-reduction.ll

[SLP] avoid reduction transform on patterns that the backend can load-combine
ClosedPublic