This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Target/X86/
-
X86/
-
X86TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/
-
Analysis/CostModel/X86/
-
CostModel/
-
X86/
-
interleave-load-i32.ll
-
vectorized-loop.ll
-
Transforms/
-
LoopVectorize/X86/
-
X86/
1/2
interleaving.ll
1/2
pr34438.ll
-
vect.omp.force.small-tc.ll
-
PhaseOrdering/X86/
-
X86/
2
vector-reductions-expanded.ll
2
vector-reductions.ll
-
SLPVectorizer/X86/
-
X86/
-
extract_in_tree_user.ll
-
horizontal-list.ll
1/2
load-merge-inseltpoison.ll
1
load-merge.ll
1
lookahead.ll
1/3
pr35497.ll
-
reduction_unrolled.ll
-
remark_horcost.ll

Differential D42981

[COST] Fix cost model of load instructions on X86
Needs ReviewPublic

Authored by ABataev on Feb 6 2018, 12:13 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
mkuper
hfinkel

Summary

Many of the scalar X86 instructions allow to use memory references directly,
without preliminary loading to registers. Cost model for X86 target does
not take into account this situation. This patch considers the cost of
the load instruction as 0, if this is the only load instruction in the
binary instruction, that supports loading from memory or if this is only
the second load instruction.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Feb 6 2018, 12:13 PM

Harbormaster completed remote builds in B14675: Diff 133059.Feb 6 2018, 12:15 PM

Fixed cost model for floating point math instructions.

Harbormaster completed remote builds in B14734: Diff 133273.Feb 7 2018, 12:17 PM

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

In D42981#1001063, @spatel wrote:

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

In D42981#1001109, @ABataev wrote:

In D42981#1001063, @spatel wrote:

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

dtemirbulatov added a subscriber: dtemirbulatov.Feb 8 2018, 12:13 AM

fhahn added a subscriber: fhahn.Feb 8 2018, 1:59 AM

In D42981#1001119, @spatel wrote:

In D42981#1001109, @ABataev wrote:

In D42981#1001063, @spatel wrote:

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

It does not mean, that we should not fix the cost of the floating point operations, but the same problem exists for integer operations. Because of the extra cost of the scalar loads we are too optimistic about vectorized code in many cases and it leads to performance degradation.

spatel mentioned this in D43079: [TTI CostModel] change default cost of FP ops to 1 (PR36280).Feb 8 2018, 9:09 AM

RKSimon added inline comments.Feb 9 2018, 5:38 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1824 ↗	(On Diff #133273)	A little tricky, but we only fold certain operands (Operand(1) typically) so if it doesn't commute then we can't fold - ISD::SUB/FSUB/FDIV/SHL/ASHR/LSHR will all definitely be affected by this. Also we need to handle the case where both operands are loads - we will still have at least one load cost.
test/Transforms/SLPVectorizer/X86/arith-mul.ll
265 ↗	(On Diff #133273)	This looks like a cost model issue - scalar integer muls on silvermont are pretty awful just like the vector imuls.....

ABataev added inline comments.Feb 9 2018, 6:56 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1824 ↗	(On Diff #133273)	Agree, missed it. Will be fixed. Yes, it is handled already. The additional checks here are exactly to check this situation.

ABataev added inline comments.Feb 9 2018, 7:23 AM

test/Transforms/SLPVectorizer/X86/arith-mul.ll
265 ↗	(On Diff #133273)	Yes, we just don't have correct cost model for scalar ops. This is another issue that must be fixed

Fixed checks for non-commutative instructions

Harbormaster completed remote builds in B14810: Diff 133622.Feb 9 2018, 7:56 AM

In D42981#1001852, @ABataev wrote:

In D42981#1001119, @spatel wrote:

In D42981#1001109, @ABataev wrote:

In D42981#1001063, @spatel wrote:

The patch is doing what I hoped for on PR36280, but I don't understand the reasoning.

A folded load is not actually free on any x86 CPU AFAIK. It always has an extra uop, and that uop is handled on a different port / execution unit than the math op. The cost model that we're using here is based on throughput rates (let me know if I'm wrong), so I don't see how any load could be free.

This cost is already the part of the cost of the arithmetic instructions and we count this cost one more time when we try to estimate the cost of standalone load instructions.

I need some help to understand this. Is SLP double-counting the cost of load instructions? If so, why? If you could explain exactly what is happening in the PR36280 test, that would be good for me.

Sure. Here in PR36280 we have a vectorization tree of 3 nodes: 1) float mul insts %mul1, %mul2; 2) load insts %p1, %p2; 3) Gather %x, %y. Also we have 2 external uses %add1 and %add2. When we calculate the cvectorization cost, it is done on per tree-node basis. 1) Node cost is 2 (cost of the vectorized fmul) - (2 + 2) (costs of scalar mults) = -2; 2) Node cost is 1 (cost of the vectorized load) - (1 + 1)(!!!!) (cost of the scalar loads) = -1. Note, that in the resulting code these loads are folded in the vmulss instructions and the cost of these instructions is calculated already when we calculated the cost of the vectorization of the float point multiplacations. The resl cost must be 1 (vector load) - (0 + 0) (the cost of the scalar loads) = 1; 3) The cost of gather is 1 for gather. + 1 for an extract op. The total thus is -1. If we correct that cost of loads, the final cost is going to be 1.

Thank you for the explanation. I thought we were summing uops as the cost calculation, but we're not.

I'm not sure if the current model is best suited for x86 (I'd use uop count as the first approximation for x86 perf), but I guess that's a bigger and independent question. I still don't know SLP that well, so if others are happy with this solution, I have no objections.

RKSimon added inline comments.Feb 18 2018, 7:03 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1824 ↗	(On Diff #133273)	Shift instructions can only fold the second operand as well (and even then very poorly....). I'd be tempted to just not include them in isFreeOp at all tbh.
1820 ↗	(On Diff #133622)	Shouldn't this be I->getOperand(1) == OpI ?
1836 ↗	(On Diff #133622)	Please can you add a comment explaining the reason for the ScalarSIzeInBits condition?
1842 ↗	(On Diff #133622)	Why have you only put the constant condition on FADD/FMUL ?
1886 ↗	(On Diff #133622)	Comment this please.

spatel mentioned this in rL325515: [TTI CostModel] change default cost of FP ops to 1 (PR36280).Feb 19 2018, 8:14 AM

ABataev added inline comments.Feb 20 2018, 8:13 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1824 ↗	(On Diff #133273)	No, actually only the first operand can be the memory address and, thus, can be folded
1820 ↗	(On Diff #133622)	Yes, missed this when updated the patch last time, thanks!
1836 ↗	(On Diff #133622)	Yes, sure.
1842 ↗	(On Diff #133622)	Integer operations allow using register/immediate value as one of the operands.
1886 ↗	(On Diff #133622)	Ok.

ABataev added inline comments.Feb 20 2018, 8:21 AM

lib/Target/X86/X86TargetTransformInfo.cpp
1824 ↗	(On Diff #133273)	I mean, you're right in terms of asm instructions, but in LLVM IR we shift the first operand, so it must Operand(0), not Operand(1). Also, I limited the size of the data.

Update after review

Harbormaster completed remote builds in B15206: Diff 135113.Feb 20 2018, 11:28 AM

Fixed tests for LoopVectorizer, affected by the cost change.

Harbormaster completed remote builds in B15209: Diff 135119.Feb 20 2018, 11:56 AM

spatel mentioned this in D43769: [TTI] rename getArithmeticInstructionCost() to getUnitThroughput(); NFC.Feb 26 2018, 10:18 AM

Rebase

Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2019, 7:28 AM

Herald added a subscriber: zzheng. · View Herald Transcript

Harbormaster completed remote builds in B37786: Diff 218918.Sep 5 2019, 7:30 AM

I think test coverage is lacking here, i think there are no direct costmodel tests for all those X86TargetTransformInfo.cpp changes, they are only indirectly tested via vectorizer changes.

benchmarks?

ABataev mentioned this in D67841: [SLP] avoid reduction transform on patterns that the backend can load-combine.Sep 20 2019, 9:53 AM

ABataev mentioned this in D43582: [SLP] Generalization of stores vectorization..Sep 25 2019, 11:08 AM

Re-ping

In D42981#1725487, @xbolva00 wrote:

Re-ping

Sorry, I had no time to work on this. Plus, I don't have the access to x86 machine to check the performance.

Rebased ands reworked. Excluded SLM as its cost model is not quite correct.
Gives perf improvement for cpu2017.511.povray_r ~3% and improves performance up to 8% in some other cases.

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptDec 4 2020, 12:04 PM

Harbormaster completed remote builds in B81137: Diff 309604.Dec 4 2020, 1:16 PM

vdmitrie added a subscriber: vdmitrie.Feb 4 2021, 11:23 AM

Rebase

Harbormaster completed remote builds in B89035: Diff 323410.Feb 12 2021, 1:19 PM

Rebase

There is a very similar solution for SystemZ already in the LLVM.

Harbormaster completed remote builds in B90652: Diff 326154.Feb 24 2021, 1:14 PM

Please can you explain why in some cases the cost increases,
and we no longer vectorize some test cases/vectorize with smaller vector width?

In D42981#2587030, @lebedev.ri wrote:

Please can you explain why in some cases the cost increases,
and we no longer vectorize some test cases/vectorize with smaller vector width?

It does not increase the cost of vector code, it lowers the cost of scalar code with memops. Just the resulting cost difference between vector version and scalar version increases. I explained it already, that's because of the cost model. For example, the throughput cost of add r,r for X86 is 0.25, and for add m,r is 0.5. But the cost model gives the cost 1 for the add r,r and 2 for add m,r (load + add, 1 for load and 1 for add LLVM IR instructions).
At the same time, the actual throughput cost of VMOV+VADD is ~1.5, and the cost model estimates it to 2 (again, 1 for vmov and 1 for vadd). As you can see, add m,r and VMOV+VADD have the same cost, though actually for add m,r it should be lower.
So, in terms of comparing the cost for 2 scalar instructions add r,r and add m,r it is still not correct but it is not important because we don't need to compare the cost of scalar instructions but compare the cost of scalar and vector versions of instructions.
So, it lowers the cost of scalar instructions with mem access relatively to the vector versions of instructions.

Are you sure we are talking about the same diff?

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll
35–77	No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant.
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-expanded.ll
251–262	Same, no longer vectorized?
294–305	Same, no longer vectorized?
llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll
129–150	Same, no longer vectorized?
251–272	Same, no longer vectorized?
llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll
156–171	Same, no longer vectorized?
llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
156–171	Same, no longer vectorized?
llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll
329	Same, no longer vectorized?
llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
37	Same, no longer vectorized?

(To be noted, i agree that the change makes sense, i'm just surprised of it's visible effect on the test cases)

Roman, yes, this patch lowers the cost of scalar instruction and lowers the difference between the costs of vector variant and scalar variant of the code. For x86 in some cases, the scalar instructions with memaccesses are more profitable than the vector variant of the same instructions (pair of load + binop, actually).

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll
35–77	Yes, it looks like it increases the cost of vector code, actually, it lowers the difference between the cost of the scalar code and vector code. Most probably, here we have a corner case, where the cost of vectorized code is the same as the cost of the scalar code.
llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll
156–171	Yes, because it is profitable to execute the scalar code, not the vector variant.

In D42981#2587624, @lebedev.ri wrote:

(To be noted, i agree that the change makes sense, i'm just surprised of it's visible effect on the test cases)

I think this is fine. We lower the cost of scalar instructions and of course, it may affect/affects the vectorization. And this is good that the tests are affected by that. Or probably there is miscommunication and I just don't quite understand your concerns.

I have a few concerns

we're increasing register pressure (x86 fp scalars share regs with the vector types)
I'm not certain we are accounting for the impact of increased AGU usage - even though the load has been folded "for free", are we correctly handling the gep costs?

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll
35	This looks like a VECTOR regression on AVX targets?

In D42981#2587680, @RKSimon wrote:

I have a few concerns

we're increasing register pressure (x86 fp scalars share regs with the vector types)

I'm not certain we are accounting for the impact of increased AGU usage - even though the load has been folded "for free", are we correctly handling the gep costs?

Yes, probably. That's why I did internal thorough performance testing of this patch. And also one of my colleagues did this too. And we got almost the same results. We have about +9% gain for one of the tests, +2% and +3% gains for 2 other tests (for AVX2), no significant changes for AVX512, so mostly the old targets are affected. No significant perf losses were found during testing.

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll
35	Looks like it prefers 256 bit vectors rather than 2 x 256 + SPLIT vector instructions (8 x i64). I think this is fine for older targets

Rebase

RKSimon added inline comments.Mar 2 2021, 8:15 AM

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
2–4	You are missing a AVX512 run - but you have AVX512 check prefixes below?

ABataev added inline comments.Mar 2 2021, 8:17 AM

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll
2–4	I'll fix it, thanks

Harbormaster completed remote builds in B91570: Diff 327449.Mar 2 2021, 8:54 AM

Removed extra checks

Harbormaster completed remote builds in B91671: Diff 327582.Mar 2 2021, 6:45 PM

Rebase

Harbormaster completed remote builds in B94511: Diff 331632.Mar 18 2021, 1:00 PM

Rebase

Harbormaster completed remote builds in B96721: Diff 334693.Apr 1 2021, 8:42 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86TargetTransformInfo.cpp

96 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

25 lines

test/

Analysis/

CostModel/

X86/

interleave-load-i32.ll

4 lines

vectorized-loop.ll

2 lines

Transforms/

LoopVectorize/

X86/

interleaving.ll

66 lines

pr34438.ll

69 lines

vect.omp.force.small-tc.ll

44 lines

PhaseOrdering/

X86/

vector-reductions-expanded.ll

44 lines

vector-reductions.ll

58 lines

SLPVectorizer/

X86/

extract_in_tree_user.ll

2 lines

horizontal-list.ll

32 lines

load-merge-inseltpoison.ll

34 lines

load-merge.ll

34 lines

lookahead.ll

36 lines

pr35497.ll

13 lines

reduction_unrolled.ll

64 lines

remark_horcost.ll

2 lines

Diff 334693

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,184 Lines • ▼ Show 20 Lines	unsigned X86TTIImpl::getScalarizationOverhead(VectorType *Ty,
// TODO: Use default extraction for now, but we should investigate extending this		// TODO: Use default extraction for now, but we should investigate extending this
// to handle repeated subvector extraction.		// to handle repeated subvector extraction.
if (Extract)		if (Extract)
Cost += BaseT::getScalarizationOverhead(Ty, DemandedElts, false, Extract);		Cost += BaseT::getScalarizationOverhead(Ty, DemandedElts, false, Extract);

return Cost;		return Cost;
}		}

		static bool isLoadOrFreeCast(X86TTIImpl TTI, const Value V,
		TTI::TargetCostKind CostKind) {
		if (isa<LoadInst>(V))
		return true;
		const auto *I = dyn_cast<Instruction>(V);
		if (!I)
		return false;
		switch (I->getOpcode()) {
		case Instruction::Trunc:
		case Instruction::BitCast:
		case Instruction::PtrToInt:
		// Check if the cast of the load is free if the cast operation is free.
		return I->hasOneUse() &&
		TTI->getCastInstrCost(
		I->getOpcode(), I->getType(), I->getOperand(0)->getType(),
		TTI::getCastContextHint(I), CostKind, I) == 0 &&
		isa<LoadInst>(I->getOperand(0)) &&
		I->getParent() == cast<Instruction>(I->getOperand(0))->getParent();
		default:
		break;
		}
		return false;
		}

		static bool isFreeOp(X86TTIImpl TTI, const Instruction I, const Type *Ty,
		TTI::TargetCostKind CostKind, const Instruction *OpI) {
		switch (I->getOpcode()) {
		case Instruction::FAdd:
		case Instruction::FMul:
		case Instruction::ICmp:
		case Instruction::FCmp:
		case Instruction::Add:
		case Instruction::Mul:
		case Instruction::And:
		case Instruction::Or:
		case Instruction::Xor: {
		bool IsLoadOrFreeCast0 = isLoadOrFreeCast(TTI, I->getOperand(0), CostKind);
		bool IsLoadOrFreeCast1 = isLoadOrFreeCast(TTI, I->getOperand(1), CostKind);
		bool IsNotConstant0 = !isa<Constant>(I->getOperand(0));
		bool IsNotConstant1 = !isa<Constant>(I->getOperand(1));
		bool IsSingleUseOp0 = I->getOperand(0)->hasOneUse();
		bool IsSingleUseOp1 = I->getOperand(1)->hasOneUse();
		bool SameParentOp0 =
		isa<Instruction>(I->getOperand(0)) &&
		cast<Instruction>(I->getOperand(0))->getParent() == OpI->getParent();
		bool SameParentOp1 =
		isa<Instruction>(I->getOperand(1)) &&
		cast<Instruction>(I->getOperand(1))->getParent() == OpI->getParent();
		return (I->getOperand(1) == OpI && !IsLoadOrFreeCast0 && IsSingleUseOp0 &&
		IsNotConstant0 && SameParentOp0) \|\|
		(I->getOperand(0) == OpI && !IsLoadOrFreeCast1 && IsSingleUseOp1 &&
		IsNotConstant1 && SameParentOp1 &&
		(!isa<FPMathOperator>(I) \|\| I->hasAllowReassoc())) \|\|
		(IsLoadOrFreeCast0 && IsLoadOrFreeCast1 &&
		(((!IsSingleUseOp1 \|\| !SameParentOp1) && I->getOperand(1) == OpI) \|\|
		((!IsSingleUseOp0 \|\| !SameParentOp0) && I->getOperand(0) == OpI &&
		(!isa<FPMathOperator>(I) \|\| I->hasAllowReassoc())))) \|\|
		I->getOperand(1) == OpI;
		}
		case Instruction::Sub:
		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::FSub:
		case Instruction::FDiv:
		return I->getOperand(1) == OpI;
		}
		return false;
		}

int X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,		int X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
MaybeAlign Alignment, unsigned AddressSpace,		MaybeAlign Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
// TODO: Handle other cost kinds.		// TODO: Handle other cost kinds.
if (CostKind != TTI::TCK_RecipThroughput) {		if (CostKind != TTI::TCK_RecipThroughput) {
if (auto *SI = dyn_cast_or_null<StoreInst>(I)) {		if (auto *SI = dyn_cast_or_null<StoreInst>(I)) {
// Store instruction with index and scale costs 2 Uops.		// Store instruction with index and scale costs 2 Uops.
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	int X86TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
// Each load/store unit costs 1.		// Each load/store unit costs 1.
int Cost = LT.first * 1;		int Cost = LT.first * 1;

// This isn't exactly right. We're using slow unaligned 32-byte accesses as a		// This isn't exactly right. We're using slow unaligned 32-byte accesses as a
// proxy for a double-pumped AVX memory interface such as on Sandybridge.		// proxy for a double-pumped AVX memory interface such as on Sandybridge.
if (LT.second.getStoreSize() == 32 && ST->isUnalignedMem32Slow())		if (LT.second.getStoreSize() == 32 && ST->isUnalignedMem32Slow())
Cost *= 2;		Cost *= 2;

		if (Opcode == Instruction::Load && Cost == LT.first && !ST->isSLM() &&
		isa_and_nonnull<LoadInst>(I) && I->hasOneUse()) {
		// TODO: Include SLM when the cost model is correct.
		// Check if the cost of the scalar load can be considered as 0. We can
		// consider it 0, if arithmetic\|logic\|compare\|cast instruction allows to use
		// memory address as one of its argument.
		const Instruction *UI = I->user_back();
		if (UI->getParent() == I->getParent() &&
		isFreeOp(this, UI, UI->getType(), CostKind, I))
		return 0;
		switch (UI->getOpcode()) {
		case Instruction::Trunc:
		case Instruction::BitCast:
		case Instruction::PtrToInt:
		// Check if the cast of the load is free if the cast operation is free.
		if (UI->user_back()->getParent() == UI->getParent() && UI->hasOneUse() &&
		getCastInstrCost(UI->getOpcode(), UI->getType(),
		UI->getOperand(0)->getType(),
		TTI::getCastContextHint(UI), CostKind, UI) == 0 &&
		isFreeOp(this, UI->user_back(), UI->getType(), CostKind, UI))
		return 0;
		break;
		default:
		break;
		}
		}

return Cost;		return Cost;
}		}

int X86TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *SrcTy,		int X86TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *SrcTy,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
bool IsLoad = (Instruction::Load == Opcode);		bool IsLoad = (Instruction::Load == Opcode);
bool IsStore = (Instruction::Store == Opcode);		bool IsStore = (Instruction::Store == Opcode);
▲ Show 20 Lines • Show All 1,507 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,758 Lines • ▼ Show 20 Lines	case Instruction::GetElementPtr: {
InstructionCost ScalarCost = VecTy->getNumElements() * ScalarEltCost;		InstructionCost ScalarCost = VecTy->getNumElements() * ScalarEltCost;
InstructionCost VecCost = TTI->getArithmeticInstrCost(		InstructionCost VecCost = TTI->getArithmeticInstrCost(
Instruction::Add, VecTy, CostKind, Op1VK, Op2VK);		Instruction::Add, VecTy, CostKind, Op1VK, Op2VK);
LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecCost, ScalarCost));		LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecCost, ScalarCost));
return ReuseShuffleCost + VecCost - ScalarCost;		return ReuseShuffleCost + VecCost - ScalarCost;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Cost of wide load - cost of scalar loads.		// Cost of wide load - cost of scalar loads.
Align alignment = cast<LoadInst>(VL0)->getAlign();		const auto *LI = cast<LoadInst>(VL0);
InstructionCost ScalarEltCost = TTI->getMemoryOpCost(		Align Alignment = LI->getAlign();
Instruction::Load, ScalarTy, alignment, 0, CostKind, VL0);		InstructionCost ScalarLdCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;		for (const unsigned I : E->ReuseShuffleIndices) {
		const auto *LD = cast<LoadInst>(VL[I]);
		ReuseShuffleCost -=
		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, Alignment,
		LD->getPointerAddressSpace(), CostKind, LD);
		}
		} else {
		for (Value *V : VL) {
		const auto *LD = cast<LoadInst>(V);
		ScalarLdCost +=
		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, Alignment,
		LD->getPointerAddressSpace(), CostKind, LD);
		}
}		}
InstructionCost ScalarLdCost = VecTy->getNumElements() * ScalarEltCost;
InstructionCost VecLdCost;		InstructionCost VecLdCost;
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {
VecLdCost = TTI->getMemoryOpCost(Instruction::Load, VecTy, alignment, 0,		VecLdCost = TTI->getMemoryOpCost(Instruction::Load, VecTy, Alignment, 0,
CostKind, VL0);		CostKind, VL0);
} else {		} else {
assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");		assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");
VecLdCost = TTI->getGatherScatterOpCost(		VecLdCost = TTI->getGatherScatterOpCost(
Instruction::Load, VecTy, cast<LoadInst>(VL0)->getPointerOperand(),		Instruction::Load, VecTy, cast<LoadInst>(VL0)->getPointerOperand(),
/VariableMask=/false, alignment, CostKind, VL0);		/VariableMask=/false, Alignment, CostKind, VL0);
}		}
if (!NeedToShuffleReuses && !E->ReorderIndices.empty()) {		if (!NeedToShuffleReuses && !E->ReorderIndices.empty()) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
inversePermutation(E->ReorderIndices, NewMask);		inversePermutation(E->ReorderIndices, NewMask);
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy, NewMask);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, NewMask);
}		}
LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecLdCost, ScalarLdCost));		LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecLdCost, ScalarLdCost));
▲ Show 20 Lines • Show All 4,070 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/X86/interleave-load-i32.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -loop-vectorize -S -mattr=avx512f --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s		; RUN: opt -loop-vectorize -S -mattr=avx512f --debug-only=loop-vectorize < %s 2>&1 \| FileCheck %s

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

@A = global [10240 x i32] zeroinitializer, align 16		@A = global [10240 x i32] zeroinitializer, align 16
@B = global [10240 x i32] zeroinitializer, align 16		@B = global [10240 x i32] zeroinitializer, align 16

; Function Attrs: nounwind uwtable		; Function Attrs: nounwind uwtable
define void @load_i32_interleave4() {		define void @load_i32_interleave4() {
;CHECK-LABEL: load_i32_interleave4		;CHECK-LABEL: load_i32_interleave4
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: %0 = load		;CHECK: Found an estimated cost of 0 for VF 1 For instruction: %0 = load
;CHECK: Found an estimated cost of 5 for VF 2 For instruction: %0 = load		;CHECK: Found an estimated cost of 5 for VF 2 For instruction: %0 = load
;CHECK: Found an estimated cost of 5 for VF 4 For instruction: %0 = load		;CHECK: Found an estimated cost of 5 for VF 4 For instruction: %0 = load
;CHECK: Found an estimated cost of 8 for VF 8 For instruction: %0 = load		;CHECK: Found an estimated cost of 8 for VF 8 For instruction: %0 = load
;CHECK: Found an estimated cost of 22 for VF 16 For instruction: %0 = load		;CHECK: Found an estimated cost of 22 for VF 16 For instruction: %0 = load
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
Show All 19 Lines	for.body: ; preds = %entry, %for.body
store i32 %add11, i32* %arrayidx13, align 16		store i32 %add11, i32* %arrayidx13, align 16
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4
%cmp = icmp slt i64 %indvars.iv.next, 1024		%cmp = icmp slt i64 %indvars.iv.next, 1024
br i1 %cmp, label %for.body, label %for.cond.cleanup		br i1 %cmp, label %for.body, label %for.cond.cleanup
}		}

define void @load_i32_interleave5() {		define void @load_i32_interleave5() {
;CHECK-LABEL: load_i32_interleave5		;CHECK-LABEL: load_i32_interleave5
;CHECK: Found an estimated cost of 1 for VF 1 For instruction: %0 = load		;CHECK: Found an estimated cost of 0 for VF 1 For instruction: %0 = load
;CHECK: Found an estimated cost of 6 for VF 2 For instruction: %0 = load		;CHECK: Found an estimated cost of 6 for VF 2 For instruction: %0 = load
;CHECK: Found an estimated cost of 9 for VF 4 For instruction: %0 = load		;CHECK: Found an estimated cost of 9 for VF 4 For instruction: %0 = load
;CHECK: Found an estimated cost of 18 for VF 8 For instruction: %0 = load		;CHECK: Found an estimated cost of 18 for VF 8 For instruction: %0 = load
;CHECK: Found an estimated cost of 35 for VF 16 For instruction: %0 = load		;CHECK: Found an estimated cost of 35 for VF 16 For instruction: %0 = load
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
Show All 28 Lines

llvm/test/Analysis/CostModel/X86/vectorized-loop.ll

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for.body: ; preds = %middle.block, %for.body
%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %end.idx.rnd.down, %middle.block ]		%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ %end.idx.rnd.down, %middle.block ]
%13 = add nsw i64 %indvars.iv, 2		%13 = add nsw i64 %indvars.iv, 2
%arrayidx = getelementptr inbounds i32, i32* %B, i64 %13		%arrayidx = getelementptr inbounds i32, i32* %B, i64 %13
;CHECK: cost of 1 {{.*}} load		;CHECK: cost of 1 {{.*}} load
%14 = load i32, i32* %arrayidx, align 4		%14 = load i32, i32* %arrayidx, align 4
;CHECK: cost of 1 {{.*}} mul		;CHECK: cost of 1 {{.*}} mul
%mul = mul nsw i32 %14, 5		%mul = mul nsw i32 %14, 5
%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv		%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
;CHECK: cost of 1 {{.*}} load		;CHECK: cost of 0 {{.*}} load
%15 = load i32, i32* %arrayidx2, align 4		%15 = load i32, i32* %arrayidx2, align 4
%add3 = add nsw i32 %15, %mul		%add3 = add nsw i32 %15, %mul
store i32 %add3, i32* %arrayidx2, align 4		store i32 %add3, i32* %arrayidx2, align 4
%indvars.iv.next = add i64 %indvars.iv, 1		%indvars.iv.next = add i64 %indvars.iv, 1
;CHECK: cost of 0 {{.*}} trunc		;CHECK: cost of 0 {{.*}} trunc
%16 = trunc i64 %indvars.iv.next to i32		%16 = trunc i64 %indvars.iv.next to i32
%cmp = icmp slt i32 %16, %end		%cmp = icmp slt i32 %16, %end
;CHECK: cost of 0 {{.*}} br		;CHECK: cost of 0 {{.*}} br
br i1 %cmp, label %for.body, label %for.end		br i1 %cmp, label %for.body, label %for.end

for.end: ; preds = %middle.block, %for.body, %entry		for.end: ; preds = %middle.block, %for.body, %entry
;CHECK: cost of 0 {{.*}} ret		;CHECK: cost of 0 {{.*}} ret
ret i32 undef		ret i32 undef
}		}

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine < %s \| FileCheck %s --check-prefix=SSE
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=sandybridge < %s \| FileCheck %s --check-prefix=AVX			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=sandybridge < %s \| FileCheck %s --check-prefix=AVX
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=haswell < %s \| FileCheck %s --check-prefix=AVX			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=haswell < %s \| FileCheck %s --check-prefix=AVX2
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=slm < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=slm < %s \| FileCheck %s --check-prefix=SSE
	; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=atom < %s \| FileCheck %s --check-prefix=SSE			; RUN: opt -S -mtriple=x86_64-pc_linux -loop-vectorize -instcombine -mcpu=atom < %s \| FileCheck %s --check-prefix=SSE

	define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b) {			define void @foo(i32* noalias nocapture %a, i32* noalias nocapture readonly %b) {
	; SSE-LABEL: @foo(			; SSE-LABEL: @foo(
	; SSE-NEXT: entry:			; SSE-NEXT: entry:
	; SSE-NEXT: br label [[FOR_BODY:%.*]]			; SSE-NEXT: br label [[FOR_BODY:%.*]]
	; SSE: for.cond.cleanup:			; SSE: for.cond.cleanup:
	Show All 10 Lines
	; SSE-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDVARS_IV]]			; SSE-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDVARS_IV]]
	; SSE-NEXT: store i32 [[ADD4]], i32* [[ARRAYIDX6]], align 4			; SSE-NEXT: store i32 [[ADD4]], i32* [[ARRAYIDX6]], align 4
	; SSE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; SSE-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; SSE-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024			; SSE-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
	; SSE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]			; SSE-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
	;			;
	; AVX-LABEL: @foo(			; AVX-LABEL: @foo(
	; AVX-NEXT: entry:			; AVX-NEXT: entry:
	; AVX-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AVX: vector.ph:
	; AVX-NEXT: br label [[VECTOR_BODY:%.*]]
	; AVX: vector.body:
	; AVX-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; AVX-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
	; AVX-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
	; AVX-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <8 x i32>*
	; AVX-NEXT: [[WIDE_VEC:%.]] = load <8 x i32>, <8 x i32> [[TMP2]], align 4
	; AVX-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; AVX-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; AVX-NEXT: [[TMP3:%.*]] = add nsw <4 x i32> [[STRIDED_VEC1]], [[STRIDED_VEC]]
	; AVX-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
	; AVX-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*
	; AVX-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP5]], align 4
	; AVX-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; AVX-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
	; AVX-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; AVX: middle.block:
	; AVX-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
	; AVX: scalar.ph:
	; AVX-NEXT: br label [[FOR_BODY:%.*]]			; AVX-NEXT: br label [[FOR_BODY:%.*]]
	; AVX: for.cond.cleanup:			; AVX: for.cond.cleanup:
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	; AVX: for.body:			; AVX: for.body:
	; AVX-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]			; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
				; AVX-NEXT: [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
				; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; AVX-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX]], align 4
				; AVX-NEXT: [[TMP2:%.*]] = or i64 [[TMP0]], 1
				; AVX-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]
				; AVX-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
				; AVX-NEXT: [[ADD4:%.*]] = add nsw i32 [[TMP3]], [[TMP1]]
				; AVX-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDVARS_IV]]
				; AVX-NEXT: store i32 [[ADD4]], i32* [[ARRAYIDX6]], align 4
				; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; AVX-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
				; AVX-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP:%.*]], label [[FOR_BODY]]
				;
				; AVX2-LABEL: @foo(
				; AVX2-NEXT: entry:
				; AVX2-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; AVX2: vector.ph:
				; AVX2-NEXT: br label [[VECTOR_BODY:%.*]]
				; AVX2: vector.body:
				; AVX2-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; AVX2-NEXT: [[TMP0:%.*]] = shl nsw i64 [[INDEX]], 1
				; AVX2-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
				; AVX2-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <8 x i32>*
				; AVX2-NEXT: [[WIDE_VEC:%.]] = load <8 x i32>, <8 x i32> [[TMP2]], align 4
				; AVX2-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; AVX2-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; AVX2-NEXT: [[TMP3:%.*]] = add nsw <4 x i32> [[STRIDED_VEC1]], [[STRIDED_VEC]]
				; AVX2-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[INDEX]]
				; AVX2-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP4]] to <4 x i32>*
				; AVX2-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP5]], align 4
				; AVX2-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				; AVX2-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
				; AVX2-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
				; AVX2: middle.block:
				; AVX2-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; AVX2: scalar.ph:
				; AVX2-NEXT: br label [[FOR_BODY:%.*]]
				; AVX2: for.cond.cleanup:
				; AVX2-NEXT: ret void
				; AVX2: for.body:
				; AVX2-NEXT: br i1 undef, label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]
	;			;
				lebedev.riUnsubmitted Not Done Reply Inline Actions No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant. lebedev.ri: No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant.
				ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, it looks like it increases the cost of vector code, actually, it lowers the difference between the cost of the scalar code and vector code. Most probably, here we have a corner case, where the cost of vectorized code is the same as the cost of the scalar code. ABataev: Yes, it looks like it increases the cost of vector code, actually, it lowers the difference…
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.body			for.cond.cleanup: ; preds = %for.body
	ret void			ret void

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 13 Lines

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; PR34438			; PR34438
	; Loop has a short trip count of 8 iterations. It should be vectorized because no runtime checks or tail loop are necessary.			; Loop has a short trip count of 8 iterations. It should be vectorized because no runtime checks or tail loop are necessary.
	; Two cases tested AVX (MaxVF=8 = TripCount) and AVX512 (MaxVF=16 > TripCount)			; Two cases tested AVX (MaxVF=8 = TripCount) and AVX512 (MaxVF=16 > TripCount)

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx -S \| FileCheck %s
	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=skylake-avx512 -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=skylake-avx512 -S \| FileCheck %s --check-prefix=AVX512

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.8.0"			target triple = "x86_64-apple-macosx10.8.0"

	define void @small_tc(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @small_tc(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @small_tc(			; CHECK-LABEL: @small_tc(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>			; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <4 x float> [[TMP7]], <4 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
				RKSimonUnsubmitted Not Done Reply Inline Actions This looks like a VECTOR regression on AVX targets? RKSimon: This looks like a VECTOR regression on AVX targets?
				ABataevAuthorUnsubmitted Done Reply Inline Actions Looks like it prefers 256 bit vectors rather than 2 x 256 + SPLIT vector instructions (8 x i64). I think this is fine for older targets ABataev: Looks like it prefers 256 bit vectors rather than 2 x 256 + SPLIT vector instructions (8 x i64).
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 8			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 8
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP3:!llvm.loop !.*]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP3:!llvm.loop !.*]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
				; AVX512-LABEL: @small_tc(
				; AVX512-NEXT: entry:
				; AVX512-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; AVX512: vector.ph:
				; AVX512-NEXT: br label [[VECTOR_BODY:%.*]]
				; AVX512: vector.body:
				; AVX512-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; AVX512-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0
				; AVX512-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer
				; AVX512-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
				; AVX512-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; AVX512-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
				; AVX512-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
				; AVX512-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*
				; AVX512-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
				; AVX512-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
				; AVX512-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*
				; AVX512-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
				; AVX512-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*
				; AVX512-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
				; AVX512-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 8
				; AVX512-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]
				; AVX512: middle.block:
				; AVX512-NEXT: [[CMP_N:%.*]] = icmp eq i64 8, 8
				; AVX512-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; AVX512: scalar.ph:
				; AVX512-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 8, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; AVX512-NEXT: br label [[FOR_BODY:%.*]]
				; AVX512: for.body:
				; AVX512-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; AVX512-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
				; AVX512-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
				; AVX512-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
				; AVX512-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4, !llvm.access.group !0
				; AVX512-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; AVX512-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 8
				; AVX512-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]], [[LOOP3:!llvm.loop !.*]]
				; AVX512: for.end:
				; AVX512-NEXT: ret void
				;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %B, i64 %indvars.iv			%arrayidx = getelementptr inbounds float, float* %B, i64 %indvars.iv
	%0 = load float, float* %arrayidx, align 4, !llvm.access.group !5			%0 = load float, float* %arrayidx, align 4, !llvm.access.group !5
	%arrayidx2 = getelementptr inbounds float, float* %A, i64 %indvars.iv			%arrayidx2 = getelementptr inbounds float, float* %A, i64 %indvars.iv
	Show All 14 Lines

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

	Show All 21 Lines
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !0			; CHECK-NEXT: store <4 x float> [[TMP7]], <4 x float>* [[TMP8]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 20
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP1:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 20, 20
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 20, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[B]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[A]], i64 [[INDVARS_IV]]
	; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0			; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2]], align 4, !llvm.access.group !0
	; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	define void @vectorized2(float* noalias nocapture %A, float* noalias nocapture readonly %B) {			define void @vectorized2(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
	; CHECK-LABEL: @vectorized2(			; CHECK-LABEL: @vectorized2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> poison, i64 [[INDEX]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i64> poison, i64 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> poison, <8 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i64> [[BROADCAST_SPLATINSERT]], <4 x i64> poison, <4 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>			; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3>
	; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds float, float [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <8 x float>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast float [[TMP2]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x float>, <8 x float> [[TMP3]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x float>, <4 x float> [[TMP3]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]			; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[TMP0]]
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds float, float [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4, !llvm.access.group !6			; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x float>, <4 x float> [[TMP6]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <8 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd fast <4 x float> [[WIDE_LOAD]], [[WIDE_LOAD1]]
	; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <8 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[TMP5]] to <4 x float>*
	; CHECK-NEXT: store <8 x float> [[TMP7]], <8 x float>* [[TMP8]], align 4, !llvm.access.group !6			; CHECK-NEXT: store <4 x float> [[TMP7]], <4 x float>* [[TMP8]], align 4, !llvm.access.group !6
	; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8			; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 4
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16			; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT]], 16
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 16, 16
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 16, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	Show All 35 Lines

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-expanded.ll

	Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines

	for.end:			for.end:
	ret i32 %r.0			ret i32 %r.0
	}			}

	define float @fadd_v4i32(float* %p) #0 {			define float @fadd_v4i32(float* %p) #0 {
	; CHECK-LABEL: @fadd_v4i32(			; CHECK-LABEL: @fadd_v4i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[P:%.]] to <4 x float>			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[P:%.]], align 4, [[TBAA7:!tbaa !.]]
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x float>, <4 x float> [[TMP0]], align 4, [[TBAA7:!tbaa !.*]]			; CHECK-NEXT: [[ADD:%.*]] = fadd fast float [[TMP0]], 4.200000e+01
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP1]], <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[P]], i64 1
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <4 x float> [[TMP1]], [[RDX_SHUF]]			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4, [[TBAA7]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[ADD_1:%.*]] = fadd fast float [[TMP1]], [[ADD]]
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = fadd fast <4 x float> [[BIN_RDX]], [[RDX_SHUF3]]			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[P]], i64 2
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x float> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_2]], align 4, [[TBAA7]]
	; CHECK-NEXT: [[BIN_RDX5:%.*]] = fadd fast float -0.000000e+00, [[TMP2]]			; CHECK-NEXT: [[ADD_2:%.*]] = fadd fast float [[TMP2]], [[ADD_1]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = fadd fast float [[BIN_RDX5]], 4.200000e+01			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[P]], i64 3
	; CHECK-NEXT: ret float [[OP_EXTRA]]			; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX_3]], align 4, [[TBAA7]]
				; CHECK-NEXT: [[ADD_3:%.*]] = fadd fast float [[TMP3]], [[ADD_2]]
				; CHECK-NEXT: ret float [[ADD_3]]
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
	;			;
	entry:			entry:
	br label %for.cond			br label %for.cond

	for.cond:			for.cond:
	%r.0 = phi float [ 4.200000e+01, %entry ], [ %add, %for.inc ]			%r.0 = phi float [ 4.200000e+01, %entry ], [ %add, %for.inc ]
	%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]			%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
	%cmp = icmp slt i32 %i.0, 4			%cmp = icmp slt i32 %i.0, 4
	Show All 15 Lines

	for.end:			for.end:
	ret float %r.0			ret float %r.0
	}			}

	define float @fmul_v4i32(float* %p) #0 {			define float @fmul_v4i32(float* %p) #0 {
	; CHECK-LABEL: @fmul_v4i32(			; CHECK-LABEL: @fmul_v4i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[P:%.]] to <4 x float>			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[P:%.*]], align 4, [[TBAA7]]
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x float>, <4 x float> [[TMP0]], align 4, [[TBAA7]]			; CHECK-NEXT: [[MUL:%.*]] = fmul fast float [[TMP0]], 4.200000e+01
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <4 x float> [[TMP1]], <4 x float> poison, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[P]], i64 1
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fmul fast <4 x float> [[TMP1]], [[RDX_SHUF]]			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4, [[TBAA7]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <4 x float> [[BIN_RDX]], <4 x float> poison, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[MUL_1:%.*]] = fmul fast float [[TMP1]], [[MUL]]
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = fmul fast <4 x float> [[BIN_RDX]], [[RDX_SHUF3]]			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[P]], i64 2
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x float> [[BIN_RDX4]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_2]], align 4, [[TBAA7]]
	; CHECK-NEXT: [[BIN_RDX5:%.*]] = fmul fast float 1.000000e+00, [[TMP2]]			; CHECK-NEXT: [[MUL_2:%.*]] = fmul fast float [[TMP2]], [[MUL_1]]
	; CHECK-NEXT: [[OP_EXTRA:%.*]] = fmul fast float [[BIN_RDX5]], 4.200000e+01			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[P]], i64 3
	; CHECK-NEXT: ret float [[OP_EXTRA]]			; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX_3]], align 4, [[TBAA7]]
				; CHECK-NEXT: [[MUL_3:%.*]] = fmul fast float [[TMP3]], [[MUL_2]]
				; CHECK-NEXT: ret float [[MUL_3]]
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
	;			;
	entry:			entry:
	br label %for.cond			br label %for.cond

	for.cond:			for.cond:
	%r.0 = phi float [ 4.200000e+01, %entry ], [ %mul, %for.inc ]			%r.0 = phi float [ 4.200000e+01, %entry ], [ %mul, %for.inc ]
	%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]			%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
	%cmp = icmp slt i32 %i.0, 4			%cmp = icmp slt i32 %i.0, 4
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	for.end:
%2 = zext i1 %cmp5 to i64		%2 = zext i1 %cmp5 to i64
%cond6 = select i1 %cmp5, i32 1, i32 0		%cond6 = select i1 %cmp5, i32 1, i32 0
ret i32 %cond6		ret i32 %cond6
}		}

define i32 @TestVectorsEqual_alt(i32* noalias %Vec0, i32* noalias %Vec1, i32 %Tolerance) {		define i32 @TestVectorsEqual_alt(i32* noalias %Vec0, i32* noalias %Vec1, i32 %Tolerance) {
; CHECK-LABEL: @TestVectorsEqual_alt(		; CHECK-LABEL: @TestVectorsEqual_alt(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[VEC0:%.]] to <4 x i32>		; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[VEC0:%.*]], align 4
; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[VEC1:%.*]], align 4
; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[VEC1:%.]] to <4 x i32>		; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[VEC0]], i64 1
; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4		; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP3]]		; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i32, i32 [[VEC1]], i64 1
; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP4]])		; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX2_1]], align 4
; CHECK-NEXT: [[CMP3_NOT:%.]] = icmp ule i32 [[TMP5]], [[TOLERANCE:%.]]		; CHECK-NEXT: [[TMP4:%.*]] = add i32 [[TMP0]], [[TMP2]]
		; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[VEC0]], i64 2
		; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4
		; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i32, i32 [[VEC1]], i64 2
		; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX2_2]], align 4
		; CHECK-NEXT: [[TMP8:%.*]] = add i32 [[TMP4]], [[TMP6]]
		; CHECK-NEXT: [[TMP9:%.*]] = add i32 [[TMP5]], [[TMP7]]
		; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[VEC0]], i64 3
		; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4
		; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i32, i32 [[VEC1]], i64 3
		; CHECK-NEXT: [[TMP11:%.]] = load i32, i32 [[ARRAYIDX2_3]], align 4
		; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[TMP8]], [[TMP10]]
		; CHECK-NEXT: [[TMP13:%.*]] = add i32 [[TMP9]], [[TMP11]]
		; CHECK-NEXT: [[ADD_3:%.*]] = sub i32 [[TMP12]], [[TMP13]]
		; CHECK-NEXT: [[CMP3_NOT:%.]] = icmp ule i32 [[ADD_3]], [[TOLERANCE:%.]]
		lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP3_NOT]] to i32		; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP3_NOT]] to i32
; CHECK-NEXT: ret i32 [[COND]]		; CHECK-NEXT: ret i32 [[COND]]
;		;
entry:		entry:
br label %for.cond		br label %for.cond

for.cond:		for.cond:
%sum.0 = phi i32 [ 0, %entry ], [ %add, %for.inc ]		%sum.0 = phi i32 [ 0, %entry ], [ %add, %for.inc ]
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	for.end:
%2 = zext i1 %cmp4 to i64		%2 = zext i1 %cmp4 to i64
%cond5 = select i1 %cmp4, i32 1, i32 0		%cond5 = select i1 %cmp4, i32 1, i32 0
ret i32 %cond5		ret i32 %cond5
}		}

define i32 @TestVectorsEqualFP_alt(float* noalias %Vec0, float* noalias %Vec1, float %Tolerance) {		define i32 @TestVectorsEqualFP_alt(float* noalias %Vec0, float* noalias %Vec1, float %Tolerance) {
; CHECK-LABEL: @TestVectorsEqualFP_alt(		; CHECK-LABEL: @TestVectorsEqualFP_alt(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[VEC0:%.]] to <4 x float>		; CHECK-NEXT: [[TMP0:%.]] = load float, float [[VEC0:%.*]], align 4
; CHECK-NEXT: [[TMP1:%.]] = load <4 x float>, <4 x float> [[TMP0]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load float, float [[VEC1:%.*]], align 4
; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[VEC1:%.]] to <4 x float>		; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[VEC0]], i64 1
; CHECK-NEXT: [[TMP3:%.]] = load <4 x float>, <4 x float> [[TMP2]], align 4		; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_1]], align 4
; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <4 x float> [[TMP1]], [[TMP3]]		; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds float, float [[VEC1]], i64 1
; CHECK-NEXT: [[TMP5:%.*]] = call fast float @llvm.vector.reduce.fadd.v4f32(float -0.000000e+00, <4 x float> [[TMP4]])		; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX2_1]], align 4
; CHECK-NEXT: [[CMP3:%.]] = fcmp fast ole float [[TMP5]], [[TOLERANCE:%.]]		; CHECK-NEXT: [[TMP4:%.*]] = fadd fast float [[TMP0]], [[TMP2]]
		; CHECK-NEXT: [[TMP5:%.*]] = fadd fast float [[TMP1]], [[TMP3]]
		; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[VEC0]], i64 2
		; CHECK-NEXT: [[TMP6:%.]] = load float, float [[ARRAYIDX_2]], align 4
		; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds float, float [[VEC1]], i64 2
		; CHECK-NEXT: [[TMP7:%.]] = load float, float [[ARRAYIDX2_2]], align 4
		; CHECK-NEXT: [[TMP8:%.*]] = fadd fast float [[TMP4]], [[TMP6]]
		; CHECK-NEXT: [[TMP9:%.*]] = fadd fast float [[TMP5]], [[TMP7]]
		; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[VEC0]], i64 3
		; CHECK-NEXT: [[TMP10:%.]] = load float, float [[ARRAYIDX_3]], align 4
		; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds float, float [[VEC1]], i64 3
		; CHECK-NEXT: [[TMP11:%.]] = load float, float [[ARRAYIDX2_3]], align 4
		; CHECK-NEXT: [[TMP12:%.*]] = fadd fast float [[TMP8]], [[TMP10]]
		; CHECK-NEXT: [[TMP13:%.*]] = fadd fast float [[TMP9]], [[TMP11]]
		; CHECK-NEXT: [[ADD_3:%.*]] = fsub fast float [[TMP12]], [[TMP13]]
		; CHECK-NEXT: [[CMP3:%.]] = fcmp fast ole float [[ADD_3]], [[TOLERANCE:%.]]
		lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP3]] to i32		; CHECK-NEXT: [[COND:%.*]] = zext i1 [[CMP3]] to i32
; CHECK-NEXT: ret i32 [[COND]]		; CHECK-NEXT: ret i32 [[COND]]
;		;
entry:		entry:
br label %for.cond		br label %for.cond

for.cond:		for.cond:
%sum.0 = phi float [ 0.000000e+00, %entry ], [ %add, %for.inc ]		%sum.0 = phi float [ 0.000000e+00, %entry ], [ %add, %for.inc ]
▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/extract_in_tree_user.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -basic-aa -slp-vectorizer -S -mtriple=i386-apple-macosx10.9.0 -mcpu=corei7-avx \| FileCheck %s			; RUN: opt < %s -basic-aa -slp-vectorizer -S -mtriple=i386-apple-macosx10.9.0 -mcpu=corei7-avx -slp-threshold=-1 \| FileCheck %s

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

	@a = common global i64* null, align 8			@a = common global i64* null, align 8

	; Function Attrs: nounwind ssp uwtable			; Function Attrs: nounwind ssp uwtable
	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll

	Show First 20 Lines • Show All 785 Lines • ▼ Show 20 Lines
	define float @loadadd31(float* nocapture readonly %x) {			define float @loadadd31(float* nocapture readonly %x) {
	; CHECK-LABEL: @loadadd31(			; CHECK-LABEL: @loadadd31(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3
				; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_2]], align 4
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4
				; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX_3]], align 4
	; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5			; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5
				; CHECK-NEXT: [[TMP4:%.]] = load float, float [[ARRAYIDX_4]], align 4
	; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6			; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6
	; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[ARRAYIDX_2]] to <4 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX_5]], align 4
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x float>, <4 x float> [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7			; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7
	; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8			; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8
	; CHECK-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9			; CHECK-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9
	; CHECK-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10			; CHECK-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10
	; CHECK-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11			; CHECK-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11
	; CHECK-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12			; CHECK-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12
	; CHECK-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13			; CHECK-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13
	; CHECK-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14			; CHECK-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*
	; CHECK-NEXT: [[TMP5:%.]] = load <8 x float>, <8 x float> [[TMP4]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4
	; CHECK-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15			; CHECK-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15
	; CHECK-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16			; CHECK-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16
	; CHECK-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17			; CHECK-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17
	; CHECK-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18			; CHECK-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18
	; CHECK-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19			; CHECK-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19
	; CHECK-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20			; CHECK-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20
	; CHECK-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21			; CHECK-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21
	; CHECK-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22			; CHECK-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22
	; CHECK-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23			; CHECK-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23
	; CHECK-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24			; CHECK-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24
	; CHECK-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25			; CHECK-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25
	; CHECK-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26			; CHECK-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26
	; CHECK-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27			; CHECK-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27
	; CHECK-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28			; CHECK-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28
	; CHECK-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29			; CHECK-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29
	; CHECK-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30			; CHECK-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*
	; CHECK-NEXT: [[TMP7:%.]] = load <16 x float>, <16 x float> [[TMP6]], align 4			; CHECK-NEXT: [[TMP9:%.]] = load <16 x float>, <16 x float> [[TMP8]], align 4
	; CHECK-NEXT: [[TMP8:%.*]] = call fast float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> [[TMP7]])			; CHECK-NEXT: [[TMP10:%.*]] = call fast float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> [[TMP9]])
	; CHECK-NEXT: [[TMP9:%.*]] = call fast float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> [[TMP5]])			; CHECK-NEXT: [[TMP11:%.*]] = call fast float @llvm.vector.reduce.fadd.v8f32(float -0.000000e+00, <8 x float> [[TMP7]])
	; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP8]], [[TMP9]]			; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: [[TMP10:%.*]] = call fast float @llvm.vector.reduce.fadd.v4f32(float -0.000000e+00, <4 x float> [[TMP3]])			; CHECK-NEXT: [[TMP12:%.*]] = fadd fast float [[OP_RDX]], [[TMP5]]
	; CHECK-NEXT: [[OP_RDX1:%.*]] = fadd fast float [[OP_RDX]], [[TMP10]]			; CHECK-NEXT: [[TMP13:%.*]] = fadd fast float [[TMP12]], [[TMP4]]
	; CHECK-NEXT: [[TMP11:%.*]] = fadd fast float [[OP_RDX1]], [[TMP1]]			; CHECK-NEXT: [[TMP14:%.*]] = fadd fast float [[TMP13]], [[TMP3]]
	; CHECK-NEXT: [[TMP12:%.*]] = fadd fast float [[TMP11]], [[TMP0]]			; CHECK-NEXT: [[TMP15:%.*]] = fadd fast float [[TMP14]], [[TMP2]]
	; CHECK-NEXT: ret float [[TMP12]]			; CHECK-NEXT: [[TMP16:%.*]] = fadd fast float [[TMP15]], [[TMP1]]
				; CHECK-NEXT: [[TMP17:%.*]] = fadd fast float [[TMP16]], [[TMP0]]
				; CHECK-NEXT: ret float [[TMP17]]
	;			;
	; THRESHOLD-LABEL: @loadadd31(			; THRESHOLD-LABEL: @loadadd31(
	; THRESHOLD-NEXT: entry:			; THRESHOLD-NEXT: entry:
	; THRESHOLD-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1			; THRESHOLD-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1
	; THRESHOLD-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4			; THRESHOLD-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2			; THRESHOLD-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2
	; THRESHOLD-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4			; THRESHOLD-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3			; THRESHOLD-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3
	▲ Show 20 Lines • Show All 469 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0			; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0
	; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1			; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1
	; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2			; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2
	; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3			; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3
	; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0			; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0
	; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1			; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1
	; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2			; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2
	; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3			; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[P0]] to <2 x i64>*			; CHECK-NEXT: [[X0:%.]] = load i64, i64 [[P0]], align 2
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 2			; CHECK-NEXT: [[X1:%.]] = load i64, i64 [[P1]], align 2
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[P2]] to <2 x i64>*			; CHECK-NEXT: [[X2:%.]] = load i64, i64 [[P2]], align 2
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 2			; CHECK-NEXT: [[X3:%.]] = load i64, i64 [[P3]], align 2
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[Q0]] to <2 x i64>*			; CHECK-NEXT: [[Y0:%.]] = load i64, i64 [[Q0]], align 2
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x i64>, <2 x i64> [[TMP5]], align 2			; CHECK-NEXT: [[Y1:%.]] = load i64, i64 [[Q1]], align 2
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[Q2]] to <2 x i64>*			; CHECK-NEXT: [[Y2:%.]] = load i64, i64 [[Q2]], align 2
	; CHECK-NEXT: [[TMP8:%.]] = load <2 x i64>, <2 x i64> [[TMP7]], align 2			; CHECK-NEXT: [[Y3:%.]] = load i64, i64 [[Q3]], align 2
	; CHECK-NEXT: [[TMP9:%.*]] = sub nsw <2 x i64> [[TMP2]], [[TMP6]]			; CHECK-NEXT: [[SUB0:%.*]] = sub nsw i64 [[X0]], [[Y0]]
	; CHECK-NEXT: [[TMP10:%.*]] = sub nsw <2 x i64> [[TMP4]], [[TMP8]]			; CHECK-NEXT: [[SUB1:%.*]] = sub nsw i64 [[X1]], [[Y1]]
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0			; CHECK-NEXT: [[SUB2:%.*]] = sub nsw i64 [[X2]], [[Y2]]
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[TMP11]]			; CHECK-NEXT: [[SUB3:%.*]] = sub nsw i64 [[X3]], [[Y3]]
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[SUB0]]
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP12]]			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB1]]
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i64> [[TMP10]], i32 0			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB2]]
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP13]]			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB3]]
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
				ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, because it is profitable to execute the scalar code, not the vector variant. ABataev: Yes, because it is profitable to execute the scalar code, not the vector variant.
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i64> [[TMP10]], i32 1
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP14]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%p0 = getelementptr inbounds i64, i64* %p, i64 0			%p0 = getelementptr inbounds i64, i64* %p, i64 0
	%p1 = getelementptr inbounds i64, i64* %p, i64 1			%p1 = getelementptr inbounds i64, i64* %p, i64 1
	%p2 = getelementptr inbounds i64, i64* %p, i64 2			%p2 = getelementptr inbounds i64, i64* %p, i64 2
	%p3 = getelementptr inbounds i64, i64* %p, i64 3			%p3 = getelementptr inbounds i64, i64* %p, i64 3

	%q0 = getelementptr inbounds i64, i64* %q, i64 0			%q0 = getelementptr inbounds i64, i64* %q, i64 0
	Show All 27 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0			; CHECK-NEXT: [[P0:%.]] = getelementptr inbounds i64, i64 [[P:%.*]], i64 0
	; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1			; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds i64, i64 [[P]], i64 1
	; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2			; CHECK-NEXT: [[P2:%.]] = getelementptr inbounds i64, i64 [[P]], i64 2
	; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3			; CHECK-NEXT: [[P3:%.]] = getelementptr inbounds i64, i64 [[P]], i64 3
	; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0			; CHECK-NEXT: [[Q0:%.]] = getelementptr inbounds i64, i64 [[Q:%.*]], i64 0
	; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1			; CHECK-NEXT: [[Q1:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 1
	; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2			; CHECK-NEXT: [[Q2:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 2
	; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3			; CHECK-NEXT: [[Q3:%.]] = getelementptr inbounds i64, i64 [[Q]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[P0]] to <2 x i64>*			; CHECK-NEXT: [[X0:%.]] = load i64, i64 [[P0]], align 2
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 2			; CHECK-NEXT: [[X1:%.]] = load i64, i64 [[P1]], align 2
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[P2]] to <2 x i64>*			; CHECK-NEXT: [[X2:%.]] = load i64, i64 [[P2]], align 2
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 2			; CHECK-NEXT: [[X3:%.]] = load i64, i64 [[P3]], align 2
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[Q0]] to <2 x i64>*			; CHECK-NEXT: [[Y0:%.]] = load i64, i64 [[Q0]], align 2
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x i64>, <2 x i64> [[TMP5]], align 2			; CHECK-NEXT: [[Y1:%.]] = load i64, i64 [[Q1]], align 2
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[Q2]] to <2 x i64>*			; CHECK-NEXT: [[Y2:%.]] = load i64, i64 [[Q2]], align 2
	; CHECK-NEXT: [[TMP8:%.]] = load <2 x i64>, <2 x i64> [[TMP7]], align 2			; CHECK-NEXT: [[Y3:%.]] = load i64, i64 [[Q3]], align 2
	; CHECK-NEXT: [[TMP9:%.*]] = sub nsw <2 x i64> [[TMP2]], [[TMP6]]			; CHECK-NEXT: [[SUB0:%.*]] = sub nsw i64 [[X0]], [[Y0]]
	; CHECK-NEXT: [[TMP10:%.*]] = sub nsw <2 x i64> [[TMP4]], [[TMP8]]			; CHECK-NEXT: [[SUB1:%.*]] = sub nsw i64 [[X1]], [[Y1]]
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0			; CHECK-NEXT: [[SUB2:%.*]] = sub nsw i64 [[X2]], [[Y2]]
	; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[TMP11]]			; CHECK-NEXT: [[SUB3:%.*]] = sub nsw i64 [[X3]], [[Y3]]
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1			; CHECK-NEXT: [[G0:%.]] = getelementptr inbounds i32, i32 [[R:%.*]], i64 [[SUB0]]
	; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP12]]			; CHECK-NEXT: [[G1:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB1]]
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i64> [[TMP10]], i32 0			; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB2]]
	; CHECK-NEXT: [[G2:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP13]]			; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[SUB3]]
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i64> [[TMP10]], i32 1
	; CHECK-NEXT: [[G3:%.]] = getelementptr inbounds i32, i32 [[R]], i64 [[TMP14]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%p0 = getelementptr inbounds i64, i64* %p, i64 0			%p0 = getelementptr inbounds i64, i64* %p, i64 0
	%p1 = getelementptr inbounds i64, i64* %p, i64 1			%p1 = getelementptr inbounds i64, i64* %p, i64 1
	%p2 = getelementptr inbounds i64, i64* %p, i64 2			%p2 = getelementptr inbounds i64, i64* %p, i64 2
	%p3 = getelementptr inbounds i64, i64* %p, i64 3			%p3 = getelementptr inbounds i64, i64* %p, i64 3

	%q0 = getelementptr inbounds i64, i64* %q, i64 0			%q0 = getelementptr inbounds i64, i64* %q, i64 0
	Show All 27 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

	Show First 20 Lines • Show All 313 Lines • ▼ Show 20 Lines
	; over A[1] (with 3 external users).			; over A[1] (with 3 external users).
	; The result is that the operands are of the Add not reordered and the loads			; The result is that the operands are of the Add not reordered and the loads
	; from A get vectorized instead of the loads from B.			; from A get vectorized instead of the loads from B.
	;			;
	define void @lookahead_limit_users_budget(double* %A, double %B, double %C, double %D, double %S, double %Ext1, double %Ext2, double %Ext3, double %Ext4, double *%Ext5) {			define void @lookahead_limit_users_budget(double* %A, double %B, double %C, double %D, double %S, double %Ext1, double %Ext2, double %Ext3, double %Ext4, double *%Ext5) {
	; CHECK-LABEL: @lookahead_limit_users_budget(			; CHECK-LABEL: @lookahead_limit_users_budget(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0			; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
				; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0
	; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0			; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0
	; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0			; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0
	; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1			; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x double> poison, double* [[B:%.*]], i32 0			; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double* [[B]], i32 1
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr double, <2 x double> [[TMP1]], <2 x i64> <i64 0, i64 2>
	; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2			; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2
	; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1			; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1
				; CHECK-NEXT: [[A0:%.]] = load double, double [[IDXA0]], align 8
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
				; CHECK-NEXT: [[B0:%.]] = load double, double [[IDXB0]], align 8
	; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8			; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8
	; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8			; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8
	; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[IDXA0]] to <2 x double>*			; CHECK-NEXT: [[A1:%.]] = load double, double [[IDXA1]], align 8
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x double>, <2 x double> [[TMP3]], align 8			; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8
	; CHECK-NEXT: [[TMP5:%.]] = call <2 x double> @llvm.masked.gather.v2f64.v2p0f64(<2 x double> [[TMP2]], i32 8, <2 x i1> <i1 true, i1 true>, <2 x double> undef)
	; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8			; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8
	; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8			; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8
	; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP4]], [[TMP5]]			; CHECK-NEXT: [[SUBA0B0:%.*]] = fsub fast double [[A0]], [[B0]]
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[C0]], i32 0			; CHECK-NEXT: [[SUBC0D0:%.*]] = fsub fast double [[C0]], [[D0]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1			; CHECK-NEXT: [[SUBA1B2:%.*]] = fsub fast double [[A1]], [[B2]]
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> poison, double [[D0]], i32 0			; CHECK-NEXT: [[SUBA2B1:%.*]] = fsub fast double [[A2]], [[B1]]
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double> [[TMP9]], double [[B1]], i32 1			; CHECK-NEXT: [[ADD0:%.*]] = fadd fast double [[SUBA0B0]], [[SUBC0D0]]
	; CHECK-NEXT: [[TMP11:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP10]]			; CHECK-NEXT: [[ADD1:%.*]] = fadd fast double [[SUBA1B2]], [[SUBA2B1]]
	; CHECK-NEXT: [[TMP12:%.*]] = fadd fast <2 x double> [[TMP6]], [[TMP11]]
	; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0			; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
	; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1			; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
	; CHECK-NEXT: [[TMP13:%.]] = bitcast double [[IDXS0]] to <2 x double>*			; CHECK-NEXT: store double [[ADD0]], double* [[IDXS0]], align 8
	; CHECK-NEXT: store <2 x double> [[TMP12]], <2 x double>* [[TMP13]], align 8			; CHECK-NEXT: store double [[ADD1]], double* [[IDXS1]], align 8
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x double> [[TMP4]], i32 1			; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8
	; CHECK-NEXT: store double [[TMP14]], double* [[EXT1:%.*]], align 8			; CHECK-NEXT: store double [[A1]], double* [[EXT2:%.*]], align 8
	; CHECK-NEXT: store double [[TMP14]], double* [[EXT2:%.*]], align 8			; CHECK-NEXT: store double [[A1]], double* [[EXT3:%.*]], align 8
	; CHECK-NEXT: store double [[TMP14]], double* [[EXT3:%.*]], align 8
	; CHECK-NEXT: store double [[B1]], double* [[EXT4:%.*]], align 8			; CHECK-NEXT: store double [[B1]], double* [[EXT4:%.*]], align 8
	; CHECK-NEXT: store double [[B1]], double* [[EXT5:%.*]], align 8			; CHECK-NEXT: store double [[B1]], double* [[EXT5:%.*]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%IdxA0 = getelementptr inbounds double, double* %A, i64 0			%IdxA0 = getelementptr inbounds double, double* %A, i64 0
	%IdxB0 = getelementptr inbounds double, double* %B, i64 0			%IdxB0 = getelementptr inbounds double, double* %B, i64 0
	%IdxC0 = getelementptr inbounds double, double* %C, i64 0			%IdxC0 = getelementptr inbounds double, double* %C, i64 0
	▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+sse2 -S \| FileCheck %s --check-prefix=SSE			; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+sse2 -S \| FileCheck %s --check-prefix=SSE
	; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+avx -S \| FileCheck %s --check-prefix=AVX			; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+avx -S \| FileCheck %s --check-prefix=AVX
	; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+avx2 -S \| FileCheck %s --check-prefix=AVX			; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -slp-vectorizer -mattr=+avx2 -S \| FileCheck %s --check-prefix=AVX
				RKSimonUnsubmitted Not Done Reply Inline Actions You are missing a AVX512 run - but you have AVX512 check prefixes below? RKSimon: You are missing a AVX512 run - but you have AVX512 check prefixes below?
				ABataevAuthorUnsubmitted Done Reply Inline Actions I'll fix it, thanks ABataev: I'll fix it, thanks

	%class.1 = type { %class.2 }			%class.1 = type { %class.2 }
	%class.2 = type { %"class.3" }			%class.2 = type { %"class.3" }
	%"class.3" = type { %"struct.1", i64 }			%"class.3" = type { %"struct.1", i64 }
	%"struct.1" = type { [8 x i64] }			%"struct.1" = type { [8 x i64] }

	$_ZN1C10SwitchModeEv = comdat any			$_ZN1C10SwitchModeEv = comdat any

	Show All 16 Lines
	; SSE-NEXT: store i64 [[AND_1]], i64* [[BAR4]], align 8			; SSE-NEXT: store i64 [[AND_1]], i64* [[BAR4]], align 8
	; SSE-NEXT: ret void			; SSE-NEXT: ret void
	;			;
	; AVX-LABEL: @_ZN1C10SwitchModeEv(			; AVX-LABEL: @_ZN1C10SwitchModeEv(
	; AVX-NEXT: for.body.lr.ph.i:			; AVX-NEXT: for.body.lr.ph.i:
	; AVX-NEXT: [[OR_1:%.*]] = or i64 undef, 1			; AVX-NEXT: [[OR_1:%.*]] = or i64 undef, 1
	; AVX-NEXT: store i64 [[OR_1]], i64* undef, align 8			; AVX-NEXT: store i64 [[OR_1]], i64* undef, align 8
	; AVX-NEXT: [[FOO_1:%.]] = getelementptr inbounds [[CLASS_1:%.]], %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 0			; AVX-NEXT: [[FOO_1:%.]] = getelementptr inbounds [[CLASS_1:%.]], %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 0
				; AVX-NEXT: [[FOO_3:%.]] = load i64, i64 [[FOO_1]], align 8
				lebedev.riUnsubmitted Not Done Reply Inline Actions Same, no longer vectorized? lebedev.ri: Same, no longer vectorized?
	; AVX-NEXT: [[FOO_2:%.]] = getelementptr inbounds [[CLASS_1]], %class.1 undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 1			; AVX-NEXT: [[FOO_2:%.]] = getelementptr inbounds [[CLASS_1]], %class.1 undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 1
	; AVX-NEXT: [[TMP0:%.]] = bitcast i64 [[FOO_1]] to <2 x i64>*			; AVX-NEXT: [[FOO_4:%.]] = load i64, i64 [[FOO_2]], align 8
	; AVX-NEXT: [[TMP1:%.]] = load <2 x i64>, <2 x i64> [[TMP0]], align 8
	; AVX-NEXT: [[BAR5:%.]] = load i64, i64 undef, align 8			; AVX-NEXT: [[BAR5:%.]] = load i64, i64 undef, align 8
	; AVX-NEXT: [[TMP2:%.*]] = insertelement <2 x i64> poison, i64 [[OR_1]], i32 0			; AVX-NEXT: [[AND_2:%.*]] = and i64 [[OR_1]], [[FOO_3]]
	; AVX-NEXT: [[TMP3:%.*]] = insertelement <2 x i64> [[TMP2]], i64 [[BAR5]], i32 1			; AVX-NEXT: [[AND_1:%.*]] = and i64 [[BAR5]], [[FOO_4]]
	; AVX-NEXT: [[TMP4:%.*]] = and <2 x i64> [[TMP3]], [[TMP1]]
	; AVX-NEXT: [[BAR3:%.]] = getelementptr inbounds [[CLASS_2:%.]], %class.2* undef, i64 0, i32 0, i32 0, i32 0, i64 0			; AVX-NEXT: [[BAR3:%.]] = getelementptr inbounds [[CLASS_2:%.]], %class.2* undef, i64 0, i32 0, i32 0, i32 0, i64 0
				; AVX-NEXT: store i64 [[AND_2]], i64* [[BAR3]], align 8
	; AVX-NEXT: [[BAR4:%.]] = getelementptr inbounds [[CLASS_2]], %class.2 undef, i64 0, i32 0, i32 0, i32 0, i64 1			; AVX-NEXT: [[BAR4:%.]] = getelementptr inbounds [[CLASS_2]], %class.2 undef, i64 0, i32 0, i32 0, i32 0, i64 1
	; AVX-NEXT: [[TMP5:%.]] = bitcast i64 [[BAR3]] to <2 x i64>*			; AVX-NEXT: store i64 [[AND_1]], i64* [[BAR4]], align 8
	; AVX-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP5]], align 8
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	;			;
	for.body.lr.ph.i:			for.body.lr.ph.i:
	%or.1 = or i64 undef, 1			%or.1 = or i64 undef, 1
	store i64 %or.1, i64* undef, align 8			store i64 %or.1, i64* undef, align 8
	%foo.1 = getelementptr inbounds %class.1, %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 0			%foo.1 = getelementptr inbounds %class.1, %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 0
	%foo.3 = load i64, i64* %foo.1, align 8			%foo.3 = load i64, i64* %foo.1, align 8
	%foo.2 = getelementptr inbounds %class.1, %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 1			%foo.2 = getelementptr inbounds %class.1, %class.1* undef, i64 0, i32 0, i32 0, i32 0, i32 0, i64 1
	▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/reduction_unrolled.ll

	Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; int test_mul(unsigned int *p) {			; int test_mul(unsigned int *p) {
	; int result = 0;			; int result = 0;
	; for (int i = 0; i < 8; i++)			; for (int i = 0; i < 8; i++)
	; result *= p[i];			; result *= p[i];
	; return result;			; return result;
	; }			; }

	define i32 @test_mul(i32* nocapture readonly %p) {			define i32 @test_mul(i32* nocapture readonly %p) {
	; AVX-LABEL: @test_mul(			; CHECK-LABEL: @test_mul(
	; AVX-NEXT: entry:			; CHECK-NEXT: entry:
	; AVX-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[P:%.*]], i64 1			; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[P:%.*]], align 4
	; AVX-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[P]], i64 2			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[P]], i64 1
	; AVX-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[P]], i64 3			; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4
	; AVX-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i32, i32 [[P]], i64 4			; CHECK-NEXT: [[MUL_18:%.*]] = mul i32 [[TMP1]], [[TMP0]]
	; AVX-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i32, i32 [[P]], i64 5			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[P]], i64 2
	; AVX-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i32, i32 [[P]], i64 6			; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4
	; AVX-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i32, i32 [[P]], i64 7			; CHECK-NEXT: [[MUL_29:%.*]] = mul i32 [[TMP2]], [[MUL_18]]
	; AVX-NEXT: [[TMP0:%.]] = bitcast i32 [[P]] to <8 x i32>*			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[P]], i64 3
	; AVX-NEXT: [[TMP1:%.]] = load <8 x i32>, <8 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4
	; AVX-NEXT: [[TMP2:%.*]] = call i32 @llvm.vector.reduce.mul.v8i32(<8 x i32> [[TMP1]])			; CHECK-NEXT: [[MUL_310:%.*]] = mul i32 [[TMP3]], [[MUL_29]]
	; AVX-NEXT: ret i32 [[TMP2]]			; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i32, i32 [[P]], i64 4
	;			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX_4]], align 4
	; SSE-LABEL: @test_mul(			; CHECK-NEXT: [[MUL_411:%.*]] = mul i32 [[TMP4]], [[MUL_310]]
	; SSE-NEXT: entry:			; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i32, i32 [[P]], i64 5
	; SSE-NEXT: [[TMP0:%.]] = load i32, i32 [[P:%.*]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX_5]], align 4
	; SSE-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[P]], i64 1			; CHECK-NEXT: [[MUL_512:%.*]] = mul i32 [[TMP5]], [[MUL_411]]
	; SSE-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4			; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i32, i32 [[P]], i64 6
	; SSE-NEXT: [[MUL_18:%.*]] = mul i32 [[TMP1]], [[TMP0]]			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX_6]], align 4
	; SSE-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[P]], i64 2			; CHECK-NEXT: [[MUL_613:%.*]] = mul i32 [[TMP6]], [[MUL_512]]
	; SSE-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4			; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i32, i32 [[P]], i64 7
	; SSE-NEXT: [[MUL_29:%.*]] = mul i32 [[TMP2]], [[MUL_18]]			; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX_7]], align 4
	; SSE-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[P]], i64 3			; CHECK-NEXT: [[MUL_714:%.*]] = mul i32 [[TMP7]], [[MUL_613]]
	; SSE-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4			; CHECK-NEXT: ret i32 [[MUL_714]]
	; SSE-NEXT: [[MUL_310:%.*]] = mul i32 [[TMP3]], [[MUL_29]]
	; SSE-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds i32, i32 [[P]], i64 4
	; SSE-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX_4]], align 4
	; SSE-NEXT: [[MUL_411:%.*]] = mul i32 [[TMP4]], [[MUL_310]]
	; SSE-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds i32, i32 [[P]], i64 5
	; SSE-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX_5]], align 4
	; SSE-NEXT: [[MUL_512:%.*]] = mul i32 [[TMP5]], [[MUL_411]]
	; SSE-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds i32, i32 [[P]], i64 6
	; SSE-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX_6]], align 4
	; SSE-NEXT: [[MUL_613:%.*]] = mul i32 [[TMP6]], [[MUL_512]]
	; SSE-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds i32, i32 [[P]], i64 7
	; SSE-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX_7]], align 4
	; SSE-NEXT: [[MUL_714:%.*]] = mul i32 [[TMP7]], [[MUL_613]]
	; SSE-NEXT: ret i32 [[MUL_714]]
	;			;
	entry:			entry:
	%0 = load i32, i32* %p, align 4			%0 = load i32, i32* %p, align 4
	%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1			%arrayidx.1 = getelementptr inbounds i32, i32* %p, i64 1
	%1 = load i32, i32* %arrayidx.1, align 4			%1 = load i32, i32* %arrayidx.1, align 4
	%mul.18 = mul i32 %1, %0			%mul.18 = mul i32 %1, %0
	%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2			%arrayidx.2 = getelementptr inbounds i32, i32* %p, i64 2
	%2 = load i32, i32* %arrayidx.2, align 4			%2 = load i32, i32* %arrayidx.2, align 4
	▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/remark_horcost.ll

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
%add52 = add nsw i32 %add38, %add45		%add52 = add nsw i32 %add38, %add45

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: StoresVectorized		; YAML-NEXT: Name: StoresVectorized
; YAML-NEXT: Function: foo		; YAML-NEXT: Function: foo
; YAML-NEXT: Args:		; YAML-NEXT: Args:
; YAML-NEXT: - String: 'Stores SLP vectorized with cost '		; YAML-NEXT: - String: 'Stores SLP vectorized with cost '
; YAML-NEXT: - Cost: '-5'		; YAML-NEXT: - Cost: '-2'
; YAML-NEXT: - String: ' and with tree size '		; YAML-NEXT: - String: ' and with tree size '
; YAML-NEXT: - TreeSize: '4'		; YAML-NEXT: - TreeSize: '4'

; YAML: --- !Passed		; YAML: --- !Passed
; YAML-NEXT: Pass: slp-vectorizer		; YAML-NEXT: Pass: slp-vectorizer
; YAML-NEXT: Name: VectorizedHorizontalReduction		; YAML-NEXT: Name: VectorizedHorizontalReduction
; YAML-NEXT: Function: foo		; YAML-NEXT: Function: foo
; YAML-NEXT: Args:		; YAML-NEXT: Args:
Show All 12 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[COST] Fix cost model of load instructions on X86Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 334693

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Analysis/CostModel/X86/interleave-load-i32.ll

llvm/test/Analysis/CostModel/X86/vectorized-loop.ll

llvm/test/Transforms/LoopVectorize/X86/interleaving.ll

llvm/test/Transforms/LoopVectorize/X86/pr34438.ll

llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions-expanded.ll

llvm/test/Transforms/PhaseOrdering/X86/vector-reductions.ll

llvm/test/Transforms/SLPVectorizer/X86/extract_in_tree_user.ll

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll

llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll

llvm/test/Transforms/SLPVectorizer/X86/reduction_unrolled.ll

llvm/test/Transforms/SLPVectorizer/X86/remark_horcost.ll

[COST] Fix cost model of load instructions on X86
Needs ReviewPublic