This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/1
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
fabs-cost-softfp.ll
3
horizontal-list.ll
-
reduction2.ll

Differential D59710

[SLP] remove lower limit for forming reduction patterns
AbandonedPublic

Authored by spatel on Mar 22 2019, 12:35 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
dtemirbulatov
echristo
vporpo
arsenm

Commits

rG7ff57705ba19: [SLP] allow forming 2-way reduction patterns

Summary

We have a vector compare reduction problem seen in PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2

Or slightly reduced here:

define i1 @cmp2(<2 x double> %a0) {
  %a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
  %b = extractelement <2 x i1> %a, i32 0
  %c = extractelement <2 x i1> %a, i32 1
  %d = and i1 %b, %c
  ret i1 %d
}

SLP does not attempt to turn this into a vector reduction because there is an (artificial?) lower limit on that transform. I don't think we should have that limit: if the target's cost model says a reduction is cheaper (and it probably would be on x86), then we should do the transform.

Trying to make up for disallowing the transform in the backend (D59669) is not going to work. We would need to duplicate large chunks of IR optimizations. And it is clear that we can't do this as a target-independent canonicalization in instcombine because it involves creating shuffles and vector ops.

Diff Detail

Event Timeline

spatel created this revision.Mar 22 2019, 12:35 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2019, 12:35 PM

Herald added subscribers: jdoerfert, hiraditya, javed.absar and 3 others. · View Herald Transcript

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

In D59710#1440002, @ABataev wrote:

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

@ABataev Please can you be more specific on what icmp issues you mean? All I'm mainly seeing is a lot of scalar leftovers that InstCombine/InstSimplify will clean up if we added them as a pass in the tests.

spatel mentioned this in D59669: [x86] use movmsk when extracting multiple lanes of a vector compare (PR39665).Mar 23 2019, 6:34 AM

dmgreen added a subscriber: dmgreen.Mar 25 2019, 3:31 AM

ABataev added inline comments.Mar 25 2019, 8:08 AM

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	Check this test and a few next. Previously, we had reduction of 4 elements, now we have a reduction for 2 elements only. This patch makes it worse than it was before.

In D59710#1440002, @ABataev wrote:

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

It's not clear to me how to reorganize SLP to make this happen, so if anyone has suggestions, please let me know. I want to be clear though that this patch is not just about cmp instructions.

This should be turned into a vector reduction too if the cost model says it is profitable:

define double @fadd2(<2 x double> %a0) {
  %a = fadd fast <2 x double> %a0, <double 1.000000e+00, double 1.000000e+00>
  %b = extractelement <2 x double> %a, i32 0
  %c = extractelement <2 x double> %a, i32 1
  %d = fadd fast double %b, %c
  ret double %d
}

RKSimon added a reviewer: arsenm.Apr 1 2019, 6:15 AM

RKSimon added a subscriber: arsenm.

RKSimon added inline comments.

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	@arsenm maybe able to confirm but AFAICT the AMDGPU changes don't appear to be relevant as for anything but i16 types it will scalarize in the backend anyhow and we're just seeing the side-effects of mostly zero costs for min/max, shuffle and extract/insert operations. The i16 reduction tests in AMDGPU\reduction.ll are more relevant and are not affected by this patch.

Herald added a subscriber: wdng. · View Herald TranscriptApr 1 2019, 6:15 AM

ABataev added inline comments.Apr 1 2019, 6:25 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
56	What about this one? This also looks like a regression

RKSimon added inline comments.Apr 1 2019, 8:35 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
56	Sanjay and I hve checked with godbolt/llvm-mca and this looks like a definite win (checked on bdver2, haswell and btver2). Top is scalar, middle is trunk and bottom is patched IR: bdver2: https://godbolt.org/z/jwCPgI haswell: https://godbolt.org/z/R-h8o_

ABataev added inline comments.Apr 1 2019, 9:33 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
56	But it does not mean the patch is correct, it means that we again not quite good with the cost calculation + previous implementation is not quite optimal. But the number of vectorised operations is reduced. It means, that patch introduces some regressions in the vectorization result. And in some cases, it will result in significantly worse code.

arsenm added inline comments.Jun 14 2019, 7:40 AM

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	GFX9 has min/max for <2 x i16>. Just about every 32-bit op is scalarized, except a few that can be treated as i64. This also shrinks the load, which is worse (but SLP for some reason usually does this, and largely why there is LoadStoreVectorizer)

Please can you rebase?

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗	(On Diff #191921)	Isn't [[TMP11:%.*]] already defined at line 22?

spatel marked an inline comment as done.Nov 1 2019, 5:14 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗	(On Diff #191921)	Yes - update_test_checks.py has a bug with input IR that contains explicit "%tmp`" names for values. Those conflict with the script's naming that uses "TMP`" for regex matching of unnamed values. It's only by chance/luck that this test is passing even without this patch.

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

ABataev added inline comments.Nov 1 2019, 7:46 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
82–86 ↗	(On Diff #227446)	This one is worse than it was before for SSE

spatel mentioned this in D68819: [Utils] Allow update_test_checks to check function arguments.Nov 1 2019, 7:48 AM

spatel marked an inline comment as done.Nov 1 2019, 8:22 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll

82–86 ↗

(On Diff #227446)

Here are the SSE alternatives:

Without SLP (original IR):

movq	%xmm0, %rax
pshufd	$78, %xmm0, %xmm0       ## xmm0 = xmm0[2,3,0,1]
movq	%xmm0, %rcx
addq	%rax, %rcx
movq	%xmm1, %rax
pshufd	$78, %xmm1, %xmm0       ## xmm0 = xmm1[2,3,0,1]
movq	%xmm0, %rdx
addq	%rax, %rdx
movq	%rdx, %xmm1
movq	%rcx, %xmm0
punpcklqdq	%xmm1, %xmm0    ## xmm0 = xmm0[0],xmm1[0]

With SLP currently:

movdqa	%xmm0, %xmm2
punpcklqdq	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0]
punpckhqdq	%xmm1, %xmm0    ## xmm0 = xmm0[1],xmm1[1]
paddq	%xmm2, %xmm0

With this SLP patch:

pshufd	$78, %xmm0, %xmm2       ## xmm2 = xmm0[2,3,0,1]
paddq	%xmm2, %xmm0
pshufd	$78, %xmm1, %xmm2       ## xmm2 = xmm1[2,3,0,1]
paddq	%xmm1, %xmm2
punpcklqdq	%xmm2, %xmm0    ## xmm0 = xmm0[0],xmm2[0]

Ideally, we can get SLP to choose the shorter sequence (bypass treating this as a reduction).

I don't think we can ask instcombine to create that sequence because it requires creating semi-arbitrary shuffle instuctions.

Or we can view this as a backend opportunity to reduce a shuffle-binop-shuffle sequence:

    t6: v2i64 = vector_shuffle<1,u> t2, undef:v2i64
  t7: v2i64 = add t6, t2
    t8: v2i64 = vector_shuffle<1,u> t4, undef:v2i64
  t9: v2i64 = add t8, t4
t10: v2i64 = vector_shuffle<0,2> t7, t9

In D59710#1730146, @spatel wrote:

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

Please rerun, D69719 landed and the churn should be gone. Sorry for the noise.

ABataev added inline comments.Nov 1 2019, 11:17 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
82–86 ↗	(On Diff #227446)	Maybe, try to reduce 2 elements only after regular reduction did not work somehow?

Patch updated:
Rebased after D69719 (no real diffs from previous, but does not include unrelated test changes from scripted FileCheck lines); so this should be very similar to 2 revs back.

Patch updated:
Carve out an exception for forming 2-way reductions by threading the minimum elements as a parameter from vectorizeChainsInBlock(). This is more restrictive than necessary (it doesn't get all of the motivating examples), but it does not introduce any obvious regressions either.

RKSimon added inline comments.Nov 6 2019, 2:55 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	SLM has really poor v2i64 add costs - so I'm surprised this happened - we may need SLM special handling in getArithmeticReductionCost?

ABataev added inline comments.Nov 6 2019, 7:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7370	The only problem with this solution that it may increase the compiler time. It would be good to limit it strictly only to try to vectorize 2-vals reductions. Thoughts?
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	I think it is the problem of the cost model, maybe SLM cost model is not aware of very expensive 2i64 add cost?

spatel marked an inline comment as done.Nov 6 2019, 8:04 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll

93 ↗

(On Diff #227972)

Taking a look...debug output shows:

SLP: Calculating cost for tree of size 1.
SLP: Adding cost -2 for bundle that starts with   %a0 = extractelement <2 x i64> %a, i32 0.
SLP: Spill Cost = 0.
SLP: Extract Cost = 0.
SLP: Total Cost = -2.
SLP: Adding cost 1 for reduction that starts with   %a0 = extractelement <2 x i64> %a, i32 0 (It is a splitting reduction)
SLP: Vectorizing horizontal reduction at cost:-1. (HorRdx)

RKSimon mentioned this in rGa091f7061068: [CostModel][X86] Improve add vXi64 + fadd vXf64 reduction tests for SLM.Nov 6 2019, 9:59 AM

spatel marked an inline comment as done.Nov 6 2019, 10:15 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	@RKSimon improved the SLM costs with: rGa091f7061068 So that will remove this test diff from this patch. Based on the x86 asm, we actually do want to vectorize this example, but that's yet another cost model problem.

Patch updated:
Rebased to eliminate SLM distraction.

RKSimon mentioned this in rG1786047b9105: [X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules.Nov 6 2019, 11:13 AM

RKSimon mentioned this in rGad70d5f39ae9: [X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules.

Patch updated:
Limit the extra analysis to 2-way reductions only for efficiency (to save compile-time). This converts the more general minimum-width parameter from the earlier rev to a boolean flag.

spatel marked an inline comment as done.Nov 6 2019, 12:09 PM

xbolva00 added a subscriber: xbolva00.Nov 6 2019, 12:34 PM

xbolva00 added inline comments.

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120 ↗	(On Diff #228118)	bool Try2WayRdx = false ?

spatel marked 2 inline comments as done.Nov 6 2019, 1:13 PM

spatel added inline comments.

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120 ↗	(On Diff #228118)	I don't think it's a good thing in general to have bool args with defaults...it makes reading the code harder. But this is all a hack at this point anyway, so sure, let's reduce the diffs. :)

Patch updated:
Add default for new bool parameter to eliminate diffs for existing calls.

This revision is now accepted and ready to land.Nov 6 2019, 1:19 PM

Closed by commit rG7ff57705ba19: [SLP] allow forming 2-way reduction patterns (authored by spatel). · Explain WhyNov 7 2019, 4:11 AM

This revision was automatically updated to reflect the committed changes.

Reopening - this uncovered existing ways to miscompile, may have created new ways to miscompile, and caused major perf regressions. In the running for best patch ever. :)

Reverted here:
rG714aabacfb0f9b372cf230f1b7113e3ebd0e661d

This revision is now accepted and ready to land.Nov 21 2019, 5:29 AM

spatel added a reverting change: D70607: [x86] make SLM extract vector element more expensive than default.Nov 22 2019, 10:51 AM

spatel added a reverting change: rG5c166f1d1969: [x86] make SLM extract vector element more expensive than default.Nov 27 2019, 11:13 AM

Phab was tricked into saying that D70607 was a r3v3rt of this patch. It was not.

@echristo - any hope of getting tests that show the miscompile and/or perf problems raised by this patch?

spatel planned changes to this revision.Dec 1 2019, 8:02 AM

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

This revision is now accepted and ready to land.Dec 1 2019, 2:00 PM

spatel requested review of this revision.Dec 1 2019, 2:01 PM

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

In D59710#1765236, @ABataev wrote:

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

In D59710#1765299, @spatel wrote:

In D59710#1765236, @ABataev wrote:

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

The reduction is profitable because of the cost model. It is the same problem, the scalar load+add combination in many cases has a too high cost, I think. And because of this problem, the reduction for 2 elements looks profitable in some cases.

arsenm resigned from this revision.Feb 13 2020, 2:51 PM

Abandoning. I don't see a way forward for SLP on this problem. Neither the theoretically correct patch nor practically-limited variations are acceptable.

spatel mentioned this in D82474: [VectorCombine] try to form vector compare and binop to eliminate scalar ops.Jun 24 2020, 9:07 AM

spatel mentioned this in D82602: [SelectionDAG] don't split branch on logic-of-vector-compares.Jun 26 2020, 5:14 AM

spatel mentioned this in rGb6315aee5b42: [VectorCombine] try to form vector compare and binop to eliminate scalar ops.Jun 29 2020, 8:04 AM

spatel mentioned this in D87772: [SLP] sort candidates to increase chance of optimal compare reduction.Sep 16 2020, 9:35 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

13 lines

test/

Transforms/

SLPVectorizer/

X86/

fabs-cost-softfp.ll

8 lines

horizontal-list.ll

88 lines

reduction2.ll

35 lines

Diff 231625

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,663 Lines • ▼ Show 20 Lines	public:
bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {		bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
if (ReducedVals.empty())		if (ReducedVals.empty())
return false;		return false;

// If there is a sufficient number of reduction values, reduce		// If there is a sufficient number of reduction values, reduce
// to a nearby power-of-2. Can safely generate oversized		// to a nearby power-of-2. Can safely generate oversized
// vectors and rely on the backend to split them to legal sizes.		// vectors and rely on the backend to split them to legal sizes.
unsigned NumReducedVals = ReducedVals.size();		unsigned NumReducedVals = ReducedVals.size();
if (NumReducedVals < 4)		if (NumReducedVals < 2)
return false;		return false;

unsigned ReduxWidth = PowerOf2Floor(NumReducedVals);		// Allow 2-way reductions only for comparisons (bool type). Ideally, we
		// would allow this for any type, but it may interfere with other
		// vectorization attempts.
		if (NumReducedVals < 4 &&
		ReductionRoot->getType()->getScalarSizeInBits() != 1)
		return false;

		unsigned ReduxWidth = PowerOf2Floor(NumReducedVals);
		unsigned MinRdxWidth = Log2_32(ReduxWidth);
Value *VectorizedTree = nullptr;		Value *VectorizedTree = nullptr;

// FIXME: Fast-math-flags should be set based on the instructions in the		// FIXME: Fast-math-flags should be set based on the instructions in the
// reduction (not all of 'fast' are required).		// reduction (not all of 'fast' are required).
IRBuilder<> Builder(cast<Instruction>(ReductionRoot));		IRBuilder<> Builder(cast<Instruction>(ReductionRoot));
FastMathFlags Unsafe;		FastMathFlags Unsafe;
Unsafe.setFast();		Unsafe.setFast();
Builder.setFastMathFlags(Unsafe);		Builder.setFastMathFlags(Unsafe);
Show All 19 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
};		};

// The reduction root is used as the insertion point for new instructions,		// The reduction root is used as the insertion point for new instructions,
// so set it as externally used to prevent it from being deleted.		// so set it as externally used to prevent it from being deleted.
ExternallyUsedValues[ReductionRoot];		ExternallyUsedValues[ReductionRoot];
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > MinRdxWidth) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, ExternallyUsedValues, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.		// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == VL.size()) {		if (Order && Order->size() == VL.size()) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());		SmallVector<Value *, 4> ReorderedOps(VL.size());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
▲ Show 20 Lines • Show All 634 Lines • ▼ Show 20 Lines
bool SLPVectorizerPass::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizerPass::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {
auto Changed = false;		auto Changed = false;
for (auto &Entry : GEPs) {		for (auto &Entry : GEPs) {
// If the getelementptr list has fewer than two elements, there's nothing		// If the getelementptr list has fewer than two elements, there's nothing
// to do.		// to do.
if (Entry.second.size() < 2)		if (Entry.second.size() < 2)
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing a getelementptr list of length "		LLVM_DEBUG(dbgs() << "SLP: Analyzing a getelementptr list of length "
		ABataevUnsubmitted Done Reply Inline Actions The only problem with this solution that it may increase the compiler time. It would be good to limit it strictly only to try to vectorize 2-vals reductions. Thoughts? ABataev: The only problem with this solution that it may increase the compiler time. It would be good to…
<< Entry.second.size() << ".\n");		<< Entry.second.size() << ".\n");

// Process the GEP list in chunks suitable for the target's supported		// Process the GEP list in chunks suitable for the target's supported
// vector size. If a vector register can't hold 1 element, we are done.		// vector size. If a vector register can't hold 1 element, we are done.
unsigned MaxVecRegSize = R.getMaxVecRegSize();		unsigned MaxVecRegSize = R.getMaxVecRegSize();
unsigned EltSize = R.getVectorElementSize(Entry.second[0]);		unsigned EltSize = R.getVectorElementSize(Entry.second[0]);
if (MaxVecRegSize < EltSize)		if (MaxVecRegSize < EltSize)
continue;		continue;
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/fabs-cost-softfp.ll

	Show All 9 Lines

	define void @vectorize_fp128(fp128 %c, fp128 %d) #0 {			define void @vectorize_fp128(fp128 %c, fp128 %d) #0 {
	; CHECK-LABEL: @vectorize_fp128(			; CHECK-LABEL: @vectorize_fp128(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x fp128> undef, fp128 [[C:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x fp128> undef, fp128 [[C:%.]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x fp128> [[TMP0]], fp128 [[D:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x fp128> [[TMP0]], fp128 [[D:%.]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = call <2 x fp128> @llvm.fabs.v2f128(<2 x fp128> [[TMP1]])			; CHECK-NEXT: [[TMP2:%.*]] = call <2 x fp128> @llvm.fabs.v2f128(<2 x fp128> [[TMP1]])
	; CHECK-NEXT: [[TMP3:%.*]] = fcmp oeq <2 x fp128> [[TMP2]], <fp128 0xL00000000000000007FFF000000000000, fp128 0xL00000000000000007FFF000000000000>			; CHECK-NEXT: [[TMP3:%.*]] = fcmp oeq <2 x fp128> [[TMP2]], <fp128 0xL00000000000000007FFF000000000000, fp128 0xL00000000000000007FFF000000000000>
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP3]], i32 0			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[TMP3]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i1> [[TMP3]], i32 1			; CHECK-NEXT: [[BIN_RDX:%.*]] = or <2 x i1> [[TMP3]], [[RDX_SHUF]]
	; CHECK-NEXT: [[OR_COND39:%.*]] = or i1 [[TMP4]], [[TMP5]]			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: br i1 [[OR_COND39]], label [[IF_THEN13:%.]], label [[IF_END24:%.]]			; CHECK-NEXT: br i1 [[TMP4]], label [[IF_THEN13:%.]], label [[IF_END24:%.]]
	; CHECK: if.then13:			; CHECK: if.then13:
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	; CHECK: if.end24:			; CHECK: if.end24:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = tail call fp128 @llvm.fabs.f128(fp128 %c)			%0 = tail call fp128 @llvm.fabs.f128(fp128 %c)
	%cmpinf10 = fcmp oeq fp128 %0, 0xL00000000000000007FFF000000000000			%cmpinf10 = fcmp oeq fp128 %0, 0xL00000000000000007FFF000000000000
	Show All 15 Lines

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll

	Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; THRESHOLD-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1			; THRESHOLD-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
	; THRESHOLD-NEXT: [[ADD_1:%.*]] = fadd fast float [[TMP5]], [[ADD]]			; THRESHOLD-NEXT: [[ADD_1:%.*]] = fadd fast float [[TMP5]], [[ADD]]
	; THRESHOLD-NEXT: [[TMP6:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2) to <2 x float>*), align 8			; THRESHOLD-NEXT: [[TMP6:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr, i64 0, i64 2) to <2 x float>*), align 8
	; THRESHOLD-NEXT: [[TMP7:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2) to <2 x float>*), align 8			; THRESHOLD-NEXT: [[TMP7:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([20 x float], [20 x float]* @arr1, i64 0, i64 2) to <2 x float>*), align 8
	; THRESHOLD-NEXT: [[TMP8:%.*]] = fmul fast <2 x float> [[TMP7]], [[TMP6]]			; THRESHOLD-NEXT: [[TMP8:%.*]] = fmul fast <2 x float> [[TMP7]], [[TMP6]]
	; THRESHOLD-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP8]], i32 0			; THRESHOLD-NEXT: [[TMP9:%.*]] = extractelement <2 x float> [[TMP8]], i32 0
	; THRESHOLD-NEXT: [[ADD_2:%.*]] = fadd fast float [[TMP9]], [[ADD_1]]			; THRESHOLD-NEXT: [[ADD_2:%.*]] = fadd fast float [[TMP9]], [[ADD_1]]
	; THRESHOLD-NEXT: [[TMP10:%.*]] = extractelement <2 x float> [[TMP8]], i32 1			; THRESHOLD-NEXT: [[TMP10:%.*]] = extractelement <2 x float> [[TMP8]], i32 1
	; THRESHOLD-NEXT: [[ADD_3:%.*]] = fadd fast float [[TMP10]], [[ADD_2]]			; THRESHOLD-NEXT: [[ADD_3:%.*]] = fadd fast float [[TMP10]], [[ADD_2]]
				ABataevUnsubmitted Not Done Reply Inline Actions What about this one? This also looks like a regression ABataev: What about this one? This also looks like a regression
				RKSimonUnsubmitted Not Done Reply Inline Actions Sanjay and I hve checked with godbolt/llvm-mca and this looks like a definite win (checked on bdver2, haswell and btver2). Top is scalar, middle is trunk and bottom is patched IR: bdver2: https://godbolt.org/z/jwCPgI haswell: https://godbolt.org/z/R-h8o_ RKSimon: Sanjay and I hve checked with godbolt/llvm-mca and this looks like a definite win (checked on…
				ABataevUnsubmitted Not Done Reply Inline Actions But it does not mean the patch is correct, it means that we again not quite good with the cost calculation + previous implementation is not quite optimal. But the number of vectorised operations is reduced. It means, that patch introduces some regressions in the vectorization result. And in some cases, it will result in significantly worse code. ABataev: But it does not mean the patch is correct, it means that we again not quite good with the cost…
	; THRESHOLD-NEXT: [[ADD7:%.*]] = fadd fast float [[ADD_3]], [[CONV]]			; THRESHOLD-NEXT: [[ADD7:%.*]] = fadd fast float [[ADD_3]], [[CONV]]
	; THRESHOLD-NEXT: [[ADD19:%.*]] = fadd fast float [[TMP4]], [[ADD7]]			; THRESHOLD-NEXT: [[ADD19:%.*]] = fadd fast float [[TMP4]], [[ADD7]]
	; THRESHOLD-NEXT: [[ADD19_1:%.*]] = fadd fast float [[TMP5]], [[ADD19]]			; THRESHOLD-NEXT: [[ADD19_1:%.*]] = fadd fast float [[TMP5]], [[ADD19]]
	; THRESHOLD-NEXT: [[ADD19_2:%.*]] = fadd fast float [[TMP9]], [[ADD19_1]]			; THRESHOLD-NEXT: [[ADD19_2:%.*]] = fadd fast float [[TMP9]], [[ADD19_1]]
	; THRESHOLD-NEXT: [[ADD19_3:%.*]] = fadd fast float [[TMP10]], [[ADD19_2]]			; THRESHOLD-NEXT: [[ADD19_3:%.*]] = fadd fast float [[TMP10]], [[ADD19_2]]
	; THRESHOLD-NEXT: store float [[ADD19_3]], float* @res, align 4			; THRESHOLD-NEXT: store float [[ADD19_3]], float* @res, align 4
	; THRESHOLD-NEXT: ret float [[ADD19_3]]			; THRESHOLD-NEXT: ret float [[ADD19_3]]
	;			;
	▲ Show 20 Lines • Show All 789 Lines • ▼ Show 20 Lines
	define float @loadadd31(float* nocapture readonly %x) {			define float @loadadd31(float* nocapture readonly %x) {
	; CHECK-LABEL: @loadadd31(			; CHECK-LABEL: @loadadd31(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3
				; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_2]], align 4
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4
				; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX_3]], align 4
	; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5			; CHECK-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5
				; CHECK-NEXT: [[TMP4:%.]] = load float, float [[ARRAYIDX_4]], align 4
	; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6			; CHECK-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6
	; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[ARRAYIDX_2]] to <4 x float>*			; CHECK-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX_5]], align 4
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x float>, <4 x float> [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7			; CHECK-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7
	; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8			; CHECK-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8
	; CHECK-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9			; CHECK-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9
	; CHECK-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10			; CHECK-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10
	; CHECK-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11			; CHECK-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11
	; CHECK-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12			; CHECK-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12
	; CHECK-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13			; CHECK-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13
	; CHECK-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14			; CHECK-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14
	; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*
	; CHECK-NEXT: [[TMP5:%.]] = load <8 x float>, <8 x float> [[TMP4]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4
	; CHECK-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15			; CHECK-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15
	; CHECK-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16			; CHECK-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16
	; CHECK-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17			; CHECK-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17
	; CHECK-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18			; CHECK-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18
	; CHECK-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19			; CHECK-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19
	; CHECK-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20			; CHECK-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20
	; CHECK-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21			; CHECK-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21
	; CHECK-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22			; CHECK-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22
	; CHECK-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23			; CHECK-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23
	; CHECK-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24			; CHECK-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24
	; CHECK-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25			; CHECK-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25
	; CHECK-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26			; CHECK-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26
	; CHECK-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27			; CHECK-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27
	; CHECK-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28			; CHECK-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28
	; CHECK-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29			; CHECK-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29
	; CHECK-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30			; CHECK-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30
	; CHECK-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*			; CHECK-NEXT: [[TMP8:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*
	; CHECK-NEXT: [[TMP7:%.]] = load <16 x float>, <16 x float> [[TMP6]], align 4			; CHECK-NEXT: [[TMP9:%.]] = load <16 x float>, <16 x float> [[TMP8]], align 4
	; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <16 x float> [[TMP7]], <16 x float> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <16 x float> [[TMP9]], <16 x float> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <16 x float> [[TMP7]], [[RDX_SHUF]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast <16 x float> [[TMP9]], [[RDX_SHUF]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <16 x float> [[BIN_RDX]], <16 x float> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <16 x float> [[BIN_RDX]], <16 x float> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <16 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[BIN_RDX2:%.*]] = fadd fast <16 x float> [[BIN_RDX]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <16 x float> [[BIN_RDX2]], <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <16 x float> [[BIN_RDX2]], <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX4:%.*]] = fadd fast <16 x float> [[BIN_RDX2]], [[RDX_SHUF3]]			; CHECK-NEXT: [[BIN_RDX4:%.*]] = fadd fast <16 x float> [[BIN_RDX2]], [[RDX_SHUF3]]
	; CHECK-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <16 x float> [[BIN_RDX4]], <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <16 x float> [[BIN_RDX4]], <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX6:%.*]] = fadd fast <16 x float> [[BIN_RDX4]], [[RDX_SHUF5]]			; CHECK-NEXT: [[BIN_RDX6:%.*]] = fadd fast <16 x float> [[BIN_RDX4]], [[RDX_SHUF5]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <16 x float> [[BIN_RDX6]], i32 0			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <16 x float> [[BIN_RDX6]], i32 0
	; CHECK-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <8 x float> [[TMP5]], <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX8:%.*]] = fadd fast <8 x float> [[TMP5]], [[RDX_SHUF7]]			; CHECK-NEXT: [[BIN_RDX8:%.*]] = fadd fast <8 x float> [[TMP7]], [[RDX_SHUF7]]
	; CHECK-NEXT: [[RDX_SHUF9:%.*]] = shufflevector <8 x float> [[BIN_RDX8]], <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF9:%.*]] = shufflevector <8 x float> [[BIN_RDX8]], <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX10:%.*]] = fadd fast <8 x float> [[BIN_RDX8]], [[RDX_SHUF9]]			; CHECK-NEXT: [[BIN_RDX10:%.*]] = fadd fast <8 x float> [[BIN_RDX8]], [[RDX_SHUF9]]
	; CHECK-NEXT: [[RDX_SHUF11:%.*]] = shufflevector <8 x float> [[BIN_RDX10]], <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF11:%.*]] = shufflevector <8 x float> [[BIN_RDX10]], <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX12:%.*]] = fadd fast <8 x float> [[BIN_RDX10]], [[RDX_SHUF11]]			; CHECK-NEXT: [[BIN_RDX12:%.*]] = fadd fast <8 x float> [[BIN_RDX10]], [[RDX_SHUF11]]
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <8 x float> [[BIN_RDX12]], i32 0			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <8 x float> [[BIN_RDX12]], i32 0
	; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP8]], [[TMP9]]			; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; CHECK-NEXT: [[RDX_SHUF13:%.*]] = shufflevector <4 x float> [[TMP3]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP12:%.*]] = fadd fast float [[OP_RDX]], [[TMP5]]
	; CHECK-NEXT: [[BIN_RDX14:%.*]] = fadd fast <4 x float> [[TMP3]], [[RDX_SHUF13]]			; CHECK-NEXT: [[TMP13:%.*]] = fadd fast float [[TMP12]], [[TMP4]]
	; CHECK-NEXT: [[RDX_SHUF15:%.*]] = shufflevector <4 x float> [[BIN_RDX14]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[TMP14:%.*]] = fadd fast float [[TMP13]], [[TMP3]]
	; CHECK-NEXT: [[BIN_RDX16:%.*]] = fadd fast <4 x float> [[BIN_RDX14]], [[RDX_SHUF15]]			; CHECK-NEXT: [[TMP15:%.*]] = fadd fast float [[TMP14]], [[TMP2]]
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[BIN_RDX16]], i32 0			; CHECK-NEXT: [[TMP16:%.*]] = fadd fast float [[TMP15]], [[TMP1]]
	; CHECK-NEXT: [[OP_RDX17:%.*]] = fadd fast float [[OP_RDX]], [[TMP10]]			; CHECK-NEXT: [[TMP17:%.*]] = fadd fast float [[TMP16]], [[TMP0]]
	; CHECK-NEXT: [[TMP11:%.*]] = fadd fast float [[OP_RDX17]], [[TMP1]]			; CHECK-NEXT: ret float [[TMP17]]
	; CHECK-NEXT: [[TMP12:%.*]] = fadd fast float [[TMP11]], [[TMP0]]
	; CHECK-NEXT: ret float [[TMP12]]
	;			;
	; THRESHOLD-LABEL: @loadadd31(			; THRESHOLD-LABEL: @loadadd31(
	; THRESHOLD-NEXT: entry:			; THRESHOLD-NEXT: entry:
	; THRESHOLD-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1			; THRESHOLD-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[X:%.*]], i64 1
	; THRESHOLD-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4			; THRESHOLD-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2			; THRESHOLD-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds float, float [[X]], i64 2
	; THRESHOLD-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4			; THRESHOLD-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX_1]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3			; THRESHOLD-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds float, float [[X]], i64 3
				; THRESHOLD-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX_2]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4			; THRESHOLD-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds float, float [[X]], i64 4
				; THRESHOLD-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX_3]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5			; THRESHOLD-NEXT: [[ARRAYIDX_4:%.]] = getelementptr inbounds float, float [[X]], i64 5
				; THRESHOLD-NEXT: [[TMP4:%.]] = load float, float [[ARRAYIDX_4]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6			; THRESHOLD-NEXT: [[ARRAYIDX_5:%.]] = getelementptr inbounds float, float [[X]], i64 6
	; THRESHOLD-NEXT: [[TMP2:%.]] = bitcast float [[ARRAYIDX_2]] to <4 x float>*			; THRESHOLD-NEXT: [[TMP5:%.]] = load float, float [[ARRAYIDX_5]], align 4
	; THRESHOLD-NEXT: [[TMP3:%.]] = load <4 x float>, <4 x float> [[TMP2]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7			; THRESHOLD-NEXT: [[ARRAYIDX_6:%.]] = getelementptr inbounds float, float [[X]], i64 7
	; THRESHOLD-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8			; THRESHOLD-NEXT: [[ARRAYIDX_7:%.]] = getelementptr inbounds float, float [[X]], i64 8
	; THRESHOLD-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9			; THRESHOLD-NEXT: [[ARRAYIDX_8:%.]] = getelementptr inbounds float, float [[X]], i64 9
	; THRESHOLD-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10			; THRESHOLD-NEXT: [[ARRAYIDX_9:%.]] = getelementptr inbounds float, float [[X]], i64 10
	; THRESHOLD-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11			; THRESHOLD-NEXT: [[ARRAYIDX_10:%.]] = getelementptr inbounds float, float [[X]], i64 11
	; THRESHOLD-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12			; THRESHOLD-NEXT: [[ARRAYIDX_11:%.]] = getelementptr inbounds float, float [[X]], i64 12
	; THRESHOLD-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13			; THRESHOLD-NEXT: [[ARRAYIDX_12:%.]] = getelementptr inbounds float, float [[X]], i64 13
	; THRESHOLD-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14			; THRESHOLD-NEXT: [[ARRAYIDX_13:%.]] = getelementptr inbounds float, float [[X]], i64 14
	; THRESHOLD-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*			; THRESHOLD-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_6]] to <8 x float>*
	; THRESHOLD-NEXT: [[TMP5:%.]] = load <8 x float>, <8 x float> [[TMP4]], align 4			; THRESHOLD-NEXT: [[TMP7:%.]] = load <8 x float>, <8 x float> [[TMP6]], align 4
	; THRESHOLD-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15			; THRESHOLD-NEXT: [[ARRAYIDX_14:%.]] = getelementptr inbounds float, float [[X]], i64 15
	; THRESHOLD-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16			; THRESHOLD-NEXT: [[ARRAYIDX_15:%.]] = getelementptr inbounds float, float [[X]], i64 16
	; THRESHOLD-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17			; THRESHOLD-NEXT: [[ARRAYIDX_16:%.]] = getelementptr inbounds float, float [[X]], i64 17
	; THRESHOLD-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18			; THRESHOLD-NEXT: [[ARRAYIDX_17:%.]] = getelementptr inbounds float, float [[X]], i64 18
	; THRESHOLD-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19			; THRESHOLD-NEXT: [[ARRAYIDX_18:%.]] = getelementptr inbounds float, float [[X]], i64 19
	; THRESHOLD-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20			; THRESHOLD-NEXT: [[ARRAYIDX_19:%.]] = getelementptr inbounds float, float [[X]], i64 20
	; THRESHOLD-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21			; THRESHOLD-NEXT: [[ARRAYIDX_20:%.]] = getelementptr inbounds float, float [[X]], i64 21
	; THRESHOLD-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22			; THRESHOLD-NEXT: [[ARRAYIDX_21:%.]] = getelementptr inbounds float, float [[X]], i64 22
	; THRESHOLD-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23			; THRESHOLD-NEXT: [[ARRAYIDX_22:%.]] = getelementptr inbounds float, float [[X]], i64 23
	; THRESHOLD-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24			; THRESHOLD-NEXT: [[ARRAYIDX_23:%.]] = getelementptr inbounds float, float [[X]], i64 24
	; THRESHOLD-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25			; THRESHOLD-NEXT: [[ARRAYIDX_24:%.]] = getelementptr inbounds float, float [[X]], i64 25
	; THRESHOLD-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26			; THRESHOLD-NEXT: [[ARRAYIDX_25:%.]] = getelementptr inbounds float, float [[X]], i64 26
	; THRESHOLD-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27			; THRESHOLD-NEXT: [[ARRAYIDX_26:%.]] = getelementptr inbounds float, float [[X]], i64 27
	; THRESHOLD-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28			; THRESHOLD-NEXT: [[ARRAYIDX_27:%.]] = getelementptr inbounds float, float [[X]], i64 28
	; THRESHOLD-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29			; THRESHOLD-NEXT: [[ARRAYIDX_28:%.]] = getelementptr inbounds float, float [[X]], i64 29
	; THRESHOLD-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30			; THRESHOLD-NEXT: [[ARRAYIDX_29:%.]] = getelementptr inbounds float, float [[X]], i64 30
	; THRESHOLD-NEXT: [[TMP6:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*			; THRESHOLD-NEXT: [[TMP8:%.]] = bitcast float [[ARRAYIDX_14]] to <16 x float>*
	; THRESHOLD-NEXT: [[TMP7:%.]] = load <16 x float>, <16 x float> [[TMP6]], align 4			; THRESHOLD-NEXT: [[TMP9:%.]] = load <16 x float>, <16 x float> [[TMP8]], align 4
	; THRESHOLD-NEXT: [[RDX_SHUF:%.*]] = shufflevector <16 x float> [[TMP7]], <16 x float> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF:%.*]] = shufflevector <16 x float> [[TMP9]], <16 x float> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX:%.*]] = fadd fast <16 x float> [[TMP7]], [[RDX_SHUF]]			; THRESHOLD-NEXT: [[BIN_RDX:%.*]] = fadd fast <16 x float> [[TMP9]], [[RDX_SHUF]]
	; THRESHOLD-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <16 x float> [[BIN_RDX]], <16 x float> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <16 x float> [[BIN_RDX]], <16 x float> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX2:%.*]] = fadd fast <16 x float> [[BIN_RDX]], [[RDX_SHUF1]]			; THRESHOLD-NEXT: [[BIN_RDX2:%.*]] = fadd fast <16 x float> [[BIN_RDX]], [[RDX_SHUF1]]
	; THRESHOLD-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <16 x float> [[BIN_RDX2]], <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF3:%.*]] = shufflevector <16 x float> [[BIN_RDX2]], <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX4:%.*]] = fadd fast <16 x float> [[BIN_RDX2]], [[RDX_SHUF3]]			; THRESHOLD-NEXT: [[BIN_RDX4:%.*]] = fadd fast <16 x float> [[BIN_RDX2]], [[RDX_SHUF3]]
	; THRESHOLD-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <16 x float> [[BIN_RDX4]], <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF5:%.*]] = shufflevector <16 x float> [[BIN_RDX4]], <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX6:%.*]] = fadd fast <16 x float> [[BIN_RDX4]], [[RDX_SHUF5]]			; THRESHOLD-NEXT: [[BIN_RDX6:%.*]] = fadd fast <16 x float> [[BIN_RDX4]], [[RDX_SHUF5]]
	; THRESHOLD-NEXT: [[TMP8:%.*]] = extractelement <16 x float> [[BIN_RDX6]], i32 0			; THRESHOLD-NEXT: [[TMP10:%.*]] = extractelement <16 x float> [[BIN_RDX6]], i32 0
	; THRESHOLD-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <8 x float> [[TMP5]], <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF7:%.*]] = shufflevector <8 x float> [[TMP7]], <8 x float> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX8:%.*]] = fadd fast <8 x float> [[TMP5]], [[RDX_SHUF7]]			; THRESHOLD-NEXT: [[BIN_RDX8:%.*]] = fadd fast <8 x float> [[TMP7]], [[RDX_SHUF7]]
	; THRESHOLD-NEXT: [[RDX_SHUF9:%.*]] = shufflevector <8 x float> [[BIN_RDX8]], <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF9:%.*]] = shufflevector <8 x float> [[BIN_RDX8]], <8 x float> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX10:%.*]] = fadd fast <8 x float> [[BIN_RDX8]], [[RDX_SHUF9]]			; THRESHOLD-NEXT: [[BIN_RDX10:%.*]] = fadd fast <8 x float> [[BIN_RDX8]], [[RDX_SHUF9]]
	; THRESHOLD-NEXT: [[RDX_SHUF11:%.*]] = shufflevector <8 x float> [[BIN_RDX10]], <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[RDX_SHUF11:%.*]] = shufflevector <8 x float> [[BIN_RDX10]], <8 x float> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; THRESHOLD-NEXT: [[BIN_RDX12:%.*]] = fadd fast <8 x float> [[BIN_RDX10]], [[RDX_SHUF11]]			; THRESHOLD-NEXT: [[BIN_RDX12:%.*]] = fadd fast <8 x float> [[BIN_RDX10]], [[RDX_SHUF11]]
	; THRESHOLD-NEXT: [[TMP9:%.*]] = extractelement <8 x float> [[BIN_RDX12]], i32 0			; THRESHOLD-NEXT: [[TMP11:%.*]] = extractelement <8 x float> [[BIN_RDX12]], i32 0
	; THRESHOLD-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP8]], [[TMP9]]			; THRESHOLD-NEXT: [[OP_RDX:%.*]] = fadd fast float [[TMP10]], [[TMP11]]
	; THRESHOLD-NEXT: [[RDX_SHUF13:%.*]] = shufflevector <4 x float> [[TMP3]], <4 x float> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[TMP12:%.*]] = fadd fast float [[OP_RDX]], [[TMP5]]
	; THRESHOLD-NEXT: [[BIN_RDX14:%.*]] = fadd fast <4 x float> [[TMP3]], [[RDX_SHUF13]]			; THRESHOLD-NEXT: [[TMP13:%.*]] = fadd fast float [[TMP12]], [[TMP4]]
	; THRESHOLD-NEXT: [[RDX_SHUF15:%.*]] = shufflevector <4 x float> [[BIN_RDX14]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			; THRESHOLD-NEXT: [[TMP14:%.*]] = fadd fast float [[TMP13]], [[TMP3]]
	; THRESHOLD-NEXT: [[BIN_RDX16:%.*]] = fadd fast <4 x float> [[BIN_RDX14]], [[RDX_SHUF15]]			; THRESHOLD-NEXT: [[TMP15:%.*]] = fadd fast float [[TMP14]], [[TMP2]]
	; THRESHOLD-NEXT: [[TMP10:%.*]] = extractelement <4 x float> [[BIN_RDX16]], i32 0			; THRESHOLD-NEXT: [[TMP16:%.*]] = fadd fast float [[TMP15]], [[TMP1]]
	; THRESHOLD-NEXT: [[OP_RDX17:%.*]] = fadd fast float [[OP_RDX]], [[TMP10]]			; THRESHOLD-NEXT: [[TMP17:%.*]] = fadd fast float [[TMP16]], [[TMP0]]
	; THRESHOLD-NEXT: [[TMP11:%.*]] = fadd fast float [[OP_RDX17]], [[TMP1]]			; THRESHOLD-NEXT: ret float [[TMP17]]
	; THRESHOLD-NEXT: [[TMP12:%.*]] = fadd fast float [[TMP11]], [[TMP0]]
	; THRESHOLD-NEXT: ret float [[TMP12]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds float, float* %x, i64 1			%arrayidx = getelementptr inbounds float, float* %x, i64 1
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%arrayidx.1 = getelementptr inbounds float, float* %x, i64 2			%arrayidx.1 = getelementptr inbounds float, float* %x, i64 2
	%1 = load float, float* %arrayidx.1, align 4			%1 = load float, float* %arrayidx.1, align 4
	%add.1 = fadd fast float %1, %0			%add.1 = fadd fast float %1, %0
	%arrayidx.2 = getelementptr inbounds float, float* %x, i64 3			%arrayidx.2 = getelementptr inbounds float, float* %x, i64 3
	▲ Show 20 Lines • Show All 412 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/reduction2.ll

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines

	; <label>:11 ; preds = %1			; <label>:11 ; preds = %1
	ret double %9			ret double %9
	}			}

	define i1 @two_wide_fcmp_reduction(<2 x double> %a0) {			define i1 @two_wide_fcmp_reduction(<2 x double> %a0) {
	; CHECK-LABEL: @two_wide_fcmp_reduction(			; CHECK-LABEL: @two_wide_fcmp_reduction(
	; CHECK-NEXT: [[A:%.]] = fcmp ogt <2 x double> [[A0:%.]], <double 1.000000e+00, double 1.000000e+00>			; CHECK-NEXT: [[A:%.]] = fcmp ogt <2 x double> [[A0:%.]], <double 1.000000e+00, double 1.000000e+00>
	; CHECK-NEXT: [[B:%.*]] = extractelement <2 x i1> [[A]], i32 0			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[A]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[C:%.*]] = extractelement <2 x i1> [[A]], i32 1			; CHECK-NEXT: [[BIN_RDX:%.*]] = and <2 x i1> [[A]], [[RDX_SHUF]]
	; CHECK-NEXT: [[D:%.*]] = and i1 [[B]], [[C]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: ret i1 [[D]]			; CHECK-NEXT: ret i1 [[TMP1]]
	;			;
	%a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>			%a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
	%b = extractelement <2 x i1> %a, i32 0			%b = extractelement <2 x i1> %a, i32 0
	%c = extractelement <2 x i1> %a, i32 1			%c = extractelement <2 x i1> %a, i32 1
	%d = and i1 %b, %c			%d = and i1 %b, %c
	ret i1 %d			ret i1 %d
	}			}

	Show All 22 Lines
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[FNEG]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[FNEG]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double [[C:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double [[C:%.]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[MUL]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[MUL]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP7]], i32 1			; CHECK-NEXT: [[TMP8:%.*]] = fcmp olt <2 x double> [[TMP7]], <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
	; CHECK-NEXT: [[CMP:%.*]] = fcmp olt double [[TMP8]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <2 x i1> [[TMP8]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP7]], i32 0			; CHECK-NEXT: [[BIN_RDX2:%.*]] = and <2 x i1> [[TMP8]], [[RDX_SHUF1]]
	; CHECK-NEXT: [[CMP4:%.*]] = fcmp olt double [[TMP9]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i1> [[BIN_RDX2]], i32 0
	; CHECK-NEXT: [[OR_COND:%.*]] = and i1 [[CMP]], [[CMP4]]			; CHECK-NEXT: br i1 [[TMP9]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK-NEXT: br i1 [[OR_COND]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK: lor.lhs.false:			; CHECK: lor.lhs.false:
	; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>			; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[TMP10]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1			; CHECK-NEXT: [[BIN_RDX:%.*]] = or <2 x i1> [[TMP10]], [[RDX_SHUF]]
	; CHECK-NEXT: [[NOT_OR_COND9:%.*]] = or i1 [[TMP11]], [[TMP12]]			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: ret i1 [[NOT_OR_COND9]]			; CHECK-NEXT: ret i1 [[TMP11]]
	; CHECK: cleanup:			; CHECK: cleanup:
	; CHECK-NEXT: ret i1 false			; CHECK-NEXT: ret i1 false
	;			;
	entry:			entry:
	%fneg = fneg double %b			%fneg = fneg double %b
	%add = fsub double %c, %b			%add = fsub double %c, %b
	%mul = fmul double %a, 2.000000e+00			%mul = fmul double %a, 2.000000e+00
	%div = fdiv double %add, %mul			%div = fdiv double %add, %mul
	Show All 22 Lines
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> [[TMP1]], double [[C:%.]], i32 1			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> [[TMP1]], double [[C:%.]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[B]], i32 1			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> [[TMP3]], double [[B]], i32 1
	; CHECK-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[TMP2]], [[TMP4]]			; CHECK-NEXT: [[TMP5:%.*]] = fsub <2 x double> [[TMP2]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[MUL]], i32 1			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> [[TMP6]], double [[MUL]], i32 1
	; CHECK-NEXT: [[TMP8:%.*]] = fdiv <2 x double> [[TMP5]], [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = fdiv <2 x double> [[TMP5]], [[TMP7]]
	; CHECK-NEXT: [[TMP9:%.*]] = fcmp uge <2 x double> [[TMP8]], <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>			; CHECK-NEXT: [[TMP9:%.*]] = fcmp uge <2 x double> [[TMP8]], <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
	; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP9]], i32 0			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[TMP9]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP9]], i32 1			; CHECK-NEXT: [[BIN_RDX:%.*]] = or <2 x i1> [[TMP9]], [[RDX_SHUF]]
	; CHECK-NEXT: [[NOT_OR_COND:%.*]] = or i1 [[TMP10]], [[TMP11]]			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: ret i1 [[NOT_OR_COND]]			; CHECK-NEXT: ret i1 [[TMP10]]
	;			;
	%fneg = fneg double %b			%fneg = fneg double %b
	%add = fsub double %c, %b			%add = fsub double %c, %b
	%mul = fmul double %a, 2.000000e+00			%mul = fmul double %a, 2.000000e+00
	%div = fdiv double %add, %mul			%div = fdiv double %add, %mul
	%sub = fsub double %fneg, %c			%sub = fsub double %fneg, %c
	%div3 = fdiv double %sub, %mul			%div3 = fdiv double %sub, %mul
	%cmp = fcmp uge double %div, 0x3EB0C6F7A0B5ED8D			%cmp = fcmp uge double %div, 0x3EB0C6F7A0B5ED8D
	%cmp4 = fcmp uge double %div3, 0x3EB0C6F7A0B5ED8D			%cmp4 = fcmp uge double %div3, 0x3EB0C6F7A0B5ED8D
	%not.or.cond = or i1 %cmp4, %cmp			%not.or.cond = or i1 %cmp4, %cmp
	ret i1 %not.or.cond			ret i1 %not.or.cond
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] remove lower limit for forming reduction patternsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 231625

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/fabs-cost-softfp.ll

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll

llvm/test/Transforms/SLPVectorizer/X86/reduction2.ll

[SLP] remove lower limit for forming reduction patterns
AbandonedPublic