This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
2/2
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/1
SLPVectorizer.cpp
-
test/
-
Feature/
-
weak_constant.ll
-
Transforms/SLPVectorizer/X86/
-
SLPVectorizer/
-
X86/
-
reduction2.ll

Differential D59710

[SLP] remove lower limit for forming reduction patterns
AbandonedPublic

Authored by spatel on Mar 22 2019, 12:35 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
dtemirbulatov
echristo
vporpo
arsenm

Commits

rG7ff57705ba19: [SLP] allow forming 2-way reduction patterns

Summary

We have a vector compare reduction problem seen in PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2

Or slightly reduced here:

define i1 @cmp2(<2 x double> %a0) {
  %a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
  %b = extractelement <2 x i1> %a, i32 0
  %c = extractelement <2 x i1> %a, i32 1
  %d = and i1 %b, %c
  ret i1 %d
}

SLP does not attempt to turn this into a vector reduction because there is an (artificial?) lower limit on that transform. I don't think we should have that limit: if the target's cost model says a reduction is cheaper (and it probably would be on x86), then we should do the transform.

Trying to make up for disallowing the transform in the backend (D59669) is not going to work. We would need to duplicate large chunks of IR optimizations. And it is clear that we can't do this as a target-independent canonicalization in instcombine because it involves creating shuffles and vector ops.

Diff Detail

Event Timeline

spatel created this revision.Mar 22 2019, 12:35 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2019, 12:35 PM

Herald added subscribers: jdoerfert, hiraditya, javed.absar and 3 others. · View Herald Transcript

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

In D59710#1440002, @ABataev wrote:

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

@ABataev Please can you be more specific on what icmp issues you mean? All I'm mainly seeing is a lot of scalar leftovers that InstCombine/InstSimplify will clean up if we added them as a pass in the tests.

spatel mentioned this in D59669: [x86] use movmsk when extracting multiple lanes of a vector compare (PR39665).Mar 23 2019, 6:34 AM

dmgreen added a subscriber: dmgreen.Mar 25 2019, 3:31 AM

ABataev added inline comments.Mar 25 2019, 8:08 AM

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	Check this test and a few next. Previously, we had reduction of 4 elements, now we have a reduction for 2 elements only. This patch makes it worse than it was before.

In D59710#1440002, @ABataev wrote:

It really requires some additional improvvements, because I see a lot of regressions for the cmp instructions. I think, at first you should try to vectorize cmp instructions using the horizontal reductions anf only if it was unsuccessful, you need to try to vectorize the operands of the instruction itself.

It's not clear to me how to reorganize SLP to make this happen, so if anyone has suggestions, please let me know. I want to be clear though that this patch is not just about cmp instructions.

This should be turned into a vector reduction too if the cost model says it is profitable:

define double @fadd2(<2 x double> %a0) {
  %a = fadd fast <2 x double> %a0, <double 1.000000e+00, double 1.000000e+00>
  %b = extractelement <2 x double> %a, i32 0
  %c = extractelement <2 x double> %a, i32 1
  %d = fadd fast double %b, %c
  ret double %d
}

RKSimon added a reviewer: arsenm.Apr 1 2019, 6:15 AM

RKSimon added a subscriber: arsenm.

RKSimon added inline comments.

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	@arsenm maybe able to confirm but AFAICT the AMDGPU changes don't appear to be relevant as for anything but i16 types it will scalarize in the backend anyhow and we're just seeing the side-effects of mostly zero costs for min/max, shuffle and extract/insert operations. The i16 reduction tests in AMDGPU\reduction.ll are more relevant and are not affected by this patch.

Herald added a subscriber: wdng. · View Herald TranscriptApr 1 2019, 6:15 AM

ABataev added inline comments.Apr 1 2019, 6:25 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗	(On Diff #191921)	What about this one? This also looks like a regression

RKSimon added inline comments.Apr 1 2019, 8:35 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗	(On Diff #191921)	Sanjay and I hve checked with godbolt/llvm-mca and this looks like a definite win (checked on bdver2, haswell and btver2). Top is scalar, middle is trunk and bottom is patched IR: bdver2: https://godbolt.org/z/jwCPgI haswell: https://godbolt.org/z/R-h8o_

ABataev added inline comments.Apr 1 2019, 9:33 AM

llvm/test/Transforms/SLPVectorizer/X86/horizontal-list.ll
43 ↗	(On Diff #191921)	But it does not mean the patch is correct, it means that we again not quite good with the cost calculation + previous implementation is not quite optimal. But the number of vectorised operations is reduced. It means, that patch introduces some regressions in the vectorization result. And in some cases, it will result in significantly worse code.

arsenm added inline comments.Jun 14 2019, 7:40 AM

llvm/test/Transforms/SLPVectorizer/AMDGPU/horizontal-store.ll
21 ↗	(On Diff #191921)	GFX9 has min/max for <2 x i16>. Just about every 32-bit op is scalarized, except a few that can be treated as i64. This also shrinks the load, which is worse (but SLP for some reason usually does this, and largely why there is LoadStoreVectorizer)

Please can you rebase?

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗	(On Diff #191921)	Isn't [[TMP11:%.*]] already defined at line 22?

spatel marked an inline comment as done.Nov 1 2019, 5:14 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll
49 ↗	(On Diff #191921)	Yes - update_test_checks.py has a bug with input IR that contains explicit "%tmp`" names for values. Those conflict with the script's naming that uses "TMP`" for regex matching of unnamed values. It's only by chance/luck that this test is passing even without this patch.

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

ABataev added inline comments.Nov 1 2019, 7:46 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
82–86 ↗	(On Diff #227446)	This one is worse than it was before for SSE

spatel mentioned this in D68819: [Utils] Allow update_test_checks to check function arguments.Nov 1 2019, 7:48 AM

spatel marked an inline comment as done.Nov 1 2019, 8:22 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll

82–86 ↗

(On Diff #227446)

Here are the SSE alternatives:

Without SLP (original IR):

movq	%xmm0, %rax
pshufd	$78, %xmm0, %xmm0       ## xmm0 = xmm0[2,3,0,1]
movq	%xmm0, %rcx
addq	%rax, %rcx
movq	%xmm1, %rax
pshufd	$78, %xmm1, %xmm0       ## xmm0 = xmm1[2,3,0,1]
movq	%xmm0, %rdx
addq	%rax, %rdx
movq	%rdx, %xmm1
movq	%rcx, %xmm0
punpcklqdq	%xmm1, %xmm0    ## xmm0 = xmm0[0],xmm1[0]

With SLP currently:

movdqa	%xmm0, %xmm2
punpcklqdq	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0]
punpckhqdq	%xmm1, %xmm0    ## xmm0 = xmm0[1],xmm1[1]
paddq	%xmm2, %xmm0

With this SLP patch:

pshufd	$78, %xmm0, %xmm2       ## xmm2 = xmm0[2,3,0,1]
paddq	%xmm2, %xmm0
pshufd	$78, %xmm1, %xmm2       ## xmm2 = xmm1[2,3,0,1]
paddq	%xmm1, %xmm2
punpcklqdq	%xmm2, %xmm0    ## xmm0 = xmm0[0],xmm2[0]

Ideally, we can get SLP to choose the shorter sequence (bypass treating this as a reduction).

I don't think we can ask instcombine to create that sequence because it requires creating semi-arbitrary shuffle instuctions.

Or we can view this as a backend opportunity to reduce a shuffle-binop-shuffle sequence:

    t6: v2i64 = vector_shuffle<1,u> t2, undef:v2i64
  t7: v2i64 = add t6, t2
    t8: v2i64 = vector_shuffle<1,u> t4, undef:v2i64
  t9: v2i64 = add t8, t4
t10: v2i64 = vector_shuffle<0,2> t7, t9

In D59710#1730146, @spatel wrote:

Patch updated:
Rebased - no code changes, just some regression test fixups. There's a lot of noise here from a script change - see D68819. I can re-run the script on the test files prior to this patch to remove that if it is too distracting.

Please rerun, D69719 landed and the churn should be gone. Sorry for the noise.

ABataev added inline comments.Nov 1 2019, 11:17 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
82–86 ↗	(On Diff #227446)	Maybe, try to reduce 2 elements only after regular reduction did not work somehow?

Patch updated:
Rebased after D69719 (no real diffs from previous, but does not include unrelated test changes from scripted FileCheck lines); so this should be very similar to 2 revs back.

Patch updated:
Carve out an exception for forming 2-way reductions by threading the minimum elements as a parameter from vectorizeChainsInBlock(). This is more restrictive than necessary (it doesn't get all of the motivating examples), but it does not introduce any obvious regressions either.

RKSimon added inline comments.Nov 6 2019, 2:55 AM

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	SLM has really poor v2i64 add costs - so I'm surprised this happened - we may need SLM special handling in getArithmeticReductionCost?

ABataev added inline comments.Nov 6 2019, 7:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7138	The only problem with this solution that it may increase the compiler time. It would be good to limit it strictly only to try to vectorize 2-vals reductions. Thoughts?
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	I think it is the problem of the cost model, maybe SLM cost model is not aware of very expensive 2i64 add cost?

spatel marked an inline comment as done.Nov 6 2019, 8:04 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll

93 ↗

(On Diff #227972)

Taking a look...debug output shows:

SLP: Calculating cost for tree of size 1.
SLP: Adding cost -2 for bundle that starts with   %a0 = extractelement <2 x i64> %a, i32 0.
SLP: Spill Cost = 0.
SLP: Extract Cost = 0.
SLP: Total Cost = -2.
SLP: Adding cost 1 for reduction that starts with   %a0 = extractelement <2 x i64> %a, i32 0 (It is a splitting reduction)
SLP: Vectorizing horizontal reduction at cost:-1. (HorRdx)

RKSimon mentioned this in rGa091f7061068: [CostModel][X86] Improve add vXi64 + fadd vXf64 reduction tests for SLM.Nov 6 2019, 9:59 AM

spatel marked an inline comment as done.Nov 6 2019, 10:15 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
93 ↗	(On Diff #227972)	@RKSimon improved the SLM costs with: rGa091f7061068 So that will remove this test diff from this patch. Based on the x86 asm, we actually do want to vectorize this example, but that's yet another cost model problem.

Patch updated:
Rebased to eliminate SLM distraction.

RKSimon mentioned this in rG1786047b9105: [X86] Fix SLM v2i64 ADD/Sub/CMPEQ instruction schedules.Nov 6 2019, 11:13 AM

RKSimon mentioned this in rGad70d5f39ae9: [X86] Fix SLM v2f64 ADD/MUL + FP BLEND/HADD instruction schedules.

Patch updated:
Limit the extra analysis to 2-way reductions only for efficiency (to save compile-time). This converts the more general minimum-width parameter from the earlier rev to a boolean flag.

spatel marked an inline comment as done.Nov 6 2019, 12:09 PM

xbolva00 added a subscriber: xbolva00.Nov 6 2019, 12:34 PM

xbolva00 added inline comments.

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120–121	bool Try2WayRdx = false ?

spatel marked 2 inline comments as done.Nov 6 2019, 1:13 PM

spatel added inline comments.

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
120–121	I don't think it's a good thing in general to have bool args with defaults...it makes reading the code harder. But this is all a hack at this point anyway, so sure, let's reduce the diffs. :)

Patch updated:
Add default for new bool parameter to eliminate diffs for existing calls.

This revision is now accepted and ready to land.Nov 6 2019, 1:19 PM

Closed by commit rG7ff57705ba19: [SLP] allow forming 2-way reduction patterns (authored by spatel). · Explain WhyNov 7 2019, 4:11 AM

This revision was automatically updated to reflect the committed changes.

Reopening - this uncovered existing ways to miscompile, may have created new ways to miscompile, and caused major perf regressions. In the running for best patch ever. :)

Reverted here:
rG714aabacfb0f9b372cf230f1b7113e3ebd0e661d

This revision is now accepted and ready to land.Nov 21 2019, 5:29 AM

spatel added a reverting change: D70607: [x86] make SLM extract vector element more expensive than default.Nov 22 2019, 10:51 AM

spatel added a reverting change: rG5c166f1d1969: [x86] make SLM extract vector element more expensive than default.Nov 27 2019, 11:13 AM

Phab was tricked into saying that D70607 was a r3v3rt of this patch. It was not.

@echristo - any hope of getting tests that show the miscompile and/or perf problems raised by this patch?

spatel planned changes to this revision.Dec 1 2019, 8:02 AM

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

This revision is now accepted and ready to land.Dec 1 2019, 2:00 PM

spatel requested review of this revision.Dec 1 2019, 2:01 PM

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

In D59710#1765236, @ABataev wrote:

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

In D59710#1765299, @spatel wrote:

In D59710#1765236, @ABataev wrote:

In D59710#1764582, @spatel wrote:

Patch updated:
Try a different limitation on the 2-way reduction patterns that we consider as candidates. Here I've limited it to compare (boolean type) reductions to avoid regressions on math reductions. This catches the motivating cmp cases from PR39665 and doesn't seem to interfere with any existing cmp vectorization regression tests.

It looks mostly like a hack. And I assume in some cases it still may lead to problems with performance. Better to try to fix the cost model for x86, I think.

I agree it's a hack. If you go back to the very first draft of this patch using the 'History' tab, we have the ideal code patch.

But I still don't see how we would edit the cost model to get around the regressions seen in that first attempt. The reductions seen here are profitable. The extra reductions without the 'cmp' hack are also profitable, but they are maybe just not as profitable as some other vectorization strategy. We don't seem to have the mechanism to try multiple transforms and choose the best in SLP (IIUC, this is what VPlan will allow).

The reduction is profitable because of the cost model. It is the same problem, the scalar load+add combination in many cases has a too high cost, I think. And because of this problem, the reduction for 2 elements looks profitable in some cases.

arsenm resigned from this revision.Feb 13 2020, 2:51 PM

Abandoning. I don't see a way forward for SLP on this problem. Neither the theoretically correct patch nor practically-limited variations are acceptable.

spatel mentioned this in D82474: [VectorCombine] try to form vector compare and binop to eliminate scalar ops.Jun 24 2020, 9:07 AM

spatel mentioned this in D82602: [SelectionDAG] don't split branch on logic-of-vector-compares.Jun 26 2020, 5:14 AM

spatel mentioned this in rGb6315aee5b42: [VectorCombine] try to form vector compare and binop to eliminate scalar ops.Jun 29 2020, 8:04 AM

spatel mentioned this in D87772: [SLP] sort candidates to increase chance of optimal compare reduction.Sep 16 2020, 9:35 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

5 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

37 lines

test/

Feature/

weak_constant.ll

2 lines

Transforms/

SLPVectorizer/

X86/

reduction2.ll

19 lines

Diff 228129

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	private:
bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);		bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);

/// Vectorize the index computations of the getelementptr instructions		/// Vectorize the index computations of the getelementptr instructions
/// collected in GEPs.		/// collected in GEPs.
bool vectorizeGEPIndices(BasicBlock *BB, slpvectorizer::BoUpSLP &R);		bool vectorizeGEPIndices(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

/// Try to find horizontal reduction or otherwise vectorize a chain of binary		/// Try to find horizontal reduction or otherwise vectorize a chain of binary
/// operators.		/// operators.
		/// \p Try2WayRdx specializes the analysis to only attempt a 2-element
		/// reduction.
bool vectorizeRootInstruction(PHINode P, Value V, BasicBlock *BB,		bool vectorizeRootInstruction(PHINode P, Value V, BasicBlock *BB,
slpvectorizer::BoUpSLP &R,		slpvectorizer::BoUpSLP &R,
TargetTransformInfo *TTI);		TargetTransformInfo *TTI,
		bool Try2WayRdx = false);
		xbolva00Unsubmitted Done Reply Inline Actions bool Try2WayRdx = false ? xbolva00: bool Try2WayRdx = false ?
		spatelAuthorUnsubmitted Done Reply Inline Actions I don't think it's a good thing in general to have bool args with defaults...it makes reading the code harder. But this is all a hack at this point anyway, so sure, let's reduce the diffs. :) spatel: I don't think it's a good thing in general to have bool args with defaults...it makes reading…

/// Try to vectorize trees that start at insertvalue instructions.		/// Try to vectorize trees that start at insertvalue instructions.
bool vectorizeInsertValueInst(InsertValueInst IVI, BasicBlock BB,		bool vectorizeInsertValueInst(InsertValueInst IVI, BasicBlock BB,
slpvectorizer::BoUpSLP &R);		slpvectorizer::BoUpSLP &R);

/// Try to vectorize trees that start at insertelement instructions.		/// Try to vectorize trees that start at insertelement instructions.
bool vectorizeInsertElementInst(InsertElementInst IEI, BasicBlock BB,		bool vectorizeInsertElementInst(InsertElementInst IEI, BasicBlock BB,
slpvectorizer::BoUpSLP &R);		slpvectorizer::BoUpSLP &R);
Show All 28 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,469 Lines • ▼ Show 20 Lines	while (!Stack.empty()) {
// NextV is an extra argument for TreeN (its parent operation).		// NextV is an extra argument for TreeN (its parent operation).
markExtraArg(Stack.back(), NextV);		markExtraArg(Stack.back(), NextV);
}		}
return true;		return true;
}		}

/// Attempt to vectorize the tree found by		/// Attempt to vectorize the tree found by
/// matchAssociativeReduction.		/// matchAssociativeReduction.
bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {		bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI, bool Try2WayRdx) {
if (ReducedVals.empty())		if (ReducedVals.empty())
return false;		return false;

// If there is a sufficient number of reduction values, reduce		// If there is a sufficient number of reduction values, reduce
// to a nearby power-of-2. Can safely generate oversized		// to a nearby power-of-2. Can safely generate oversized
// vectors and rely on the backend to split them to legal sizes.		// vectors and rely on the backend to split them to legal sizes.
unsigned NumReducedVals = ReducedVals.size();		unsigned NumReducedVals = ReducedVals.size();
if (NumReducedVals < 4)		if (Try2WayRdx && NumReducedVals != 2)
		return false;
		unsigned MinRdxVals = Try2WayRdx ? 2 : 4;
		if (NumReducedVals < MinRdxVals)
return false;		return false;

unsigned ReduxWidth = PowerOf2Floor(NumReducedVals);		unsigned ReduxWidth = PowerOf2Floor(NumReducedVals);
		unsigned MinRdxWidth = Log2_32(MinRdxVals);
Value *VectorizedTree = nullptr;		Value *VectorizedTree = nullptr;

// FIXME: Fast-math-flags should be set based on the instructions in the		// FIXME: Fast-math-flags should be set based on the instructions in the
// reduction (not all of 'fast' are required).		// reduction (not all of 'fast' are required).
IRBuilder<> Builder(cast<Instruction>(ReductionRoot));		IRBuilder<> Builder(cast<Instruction>(ReductionRoot));
FastMathFlags Unsafe;		FastMathFlags Unsafe;
Unsafe.setFast();		Unsafe.setFast();
Builder.setFastMathFlags(Unsafe);		Builder.setFastMathFlags(Unsafe);
unsigned i = 0;		unsigned i = 0;

BoUpSLP::ExtraValueToDebugLocsMap ExternallyUsedValues;		BoUpSLP::ExtraValueToDebugLocsMap ExternallyUsedValues;
// The same extra argument may be used several time, so log each attempt		// The same extra argument may be used several time, so log each attempt
// to use it.		// to use it.
for (auto &Pair : ExtraArgs) {		for (auto &Pair : ExtraArgs) {
assert(Pair.first && "DebugLoc must be set.");		assert(Pair.first && "DebugLoc must be set.");
ExternallyUsedValues[Pair.second].push_back(Pair.first);		ExternallyUsedValues[Pair.second].push_back(Pair.first);
}		}
// The reduction root is used as the insertion point for new instructions,		// The reduction root is used as the insertion point for new instructions,
// so set it as externally used to prevent it from being deleted.		// so set it as externally used to prevent it from being deleted.
ExternallyUsedValues[ReductionRoot];		ExternallyUsedValues[ReductionRoot];
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > MinRdxWidth) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, ExternallyUsedValues, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.		// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == VL.size()) {		if (Order && Order->size() == VL.size()) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());		SmallVector<Value *, 4> ReorderedOps(VL.size());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
▲ Show 20 Lines • Show All 309 Lines • ▼ Show 20 Lines
/// attempted.		/// attempted.
/// \returns true if a horizontal reduction was matched and reduced or operands		/// \returns true if a horizontal reduction was matched and reduced or operands
/// of one of the binary instruction were vectorized.		/// of one of the binary instruction were vectorized.
/// \returns false if a horizontal reduction was not matched (or not possible)		/// \returns false if a horizontal reduction was not matched (or not possible)
/// or no vectorization of any binary operation feeding \a Root instruction was		/// or no vectorization of any binary operation feeding \a Root instruction was
/// performed.		/// performed.
static bool tryToVectorizeHorReductionOrInstOperands(		static bool tryToVectorizeHorReductionOrInstOperands(
PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,		PHINode P, Instruction Root, BasicBlock *BB, BoUpSLP &R,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI, bool Try2WayRdx,
const function_ref<bool(Instruction *, BoUpSLP &)> Vectorize) {		const function_ref<bool(Instruction *, BoUpSLP &)> Vectorize) {
if (!ShouldVectorizeHor)		if (!ShouldVectorizeHor)
return false;		return false;

if (!Root)		if (!Root)
return false;		return false;

if (Root->getParent() != BB \|\| isa<PHINode>(Root))		if (Root->getParent() != BB \|\| isa<PHINode>(Root))
Show All 14 Lines	while (!Stack.empty()) {
Instruction *Inst;		Instruction *Inst;
unsigned Level;		unsigned Level;
std::tie(Inst, Level) = Stack.pop_back_val();		std::tie(Inst, Level) = Stack.pop_back_val();
auto *BI = dyn_cast<BinaryOperator>(Inst);		auto *BI = dyn_cast<BinaryOperator>(Inst);
auto *SI = dyn_cast<SelectInst>(Inst);		auto *SI = dyn_cast<SelectInst>(Inst);
if (BI \|\| SI) {		if (BI \|\| SI) {
HorizontalReduction HorRdx;		HorizontalReduction HorRdx;
if (HorRdx.matchAssociativeReduction(P, Inst)) {		if (HorRdx.matchAssociativeReduction(P, Inst)) {
if (HorRdx.tryToReduce(R, TTI)) {		if (HorRdx.tryToReduce(R, TTI, Try2WayRdx)) {
Res = true;		Res = true;
// Set P to nullptr to avoid re-analysis of phi node in		// Set P to nullptr to avoid re-analysis of phi node in
// matchAssociativeReduction function unless this is the root node.		// matchAssociativeReduction function unless this is the root node.
P = nullptr;		P = nullptr;
continue;		continue;
}		}
}		}
if (P && BI) {		if (P && BI) {
Show All 26 Lines	if (++Level < RecursionMaxDepth)
if (!isa<PHINode>(I) && !R.isDeleted(I) && I->getParent() == BB)		if (!isa<PHINode>(I) && !R.isDeleted(I) && I->getParent() == BB)
Stack.emplace_back(I, Level);		Stack.emplace_back(I, Level);
}		}
return Res;		return Res;
}		}

bool SLPVectorizerPass::vectorizeRootInstruction(PHINode P, Value V,		bool SLPVectorizerPass::vectorizeRootInstruction(PHINode P, Value V,
BasicBlock *BB, BoUpSLP &R,		BasicBlock *BB, BoUpSLP &R,
TargetTransformInfo *TTI) {		TargetTransformInfo *TTI,
		bool Try2WayRdx) {
if (!V)		if (!V)
return false;		return false;
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
if (!I)		if (!I)
return false;		return false;

if (!isa<BinaryOperator>(I))		if (!isa<BinaryOperator>(I))
P = nullptr;		P = nullptr;
// Try to match and vectorize a horizontal reduction.		// Try to match and vectorize a horizontal reduction.
auto &&ExtraVectorization = [this](Instruction *I, BoUpSLP &R) -> bool {		auto &&ExtraVectorization = [this](Instruction *I, BoUpSLP &R) -> bool {
return tryToVectorize(I, R);		return tryToVectorize(I, R);
};		};
return tryToVectorizeHorReductionOrInstOperands(P, I, BB, R, TTI,		return tryToVectorizeHorReductionOrInstOperands(P, I, BB, R, TTI, Try2WayRdx,
ExtraVectorization);		ExtraVectorization);
}		}

bool SLPVectorizerPass::vectorizeInsertValueInst(InsertValueInst *IVI,		bool SLPVectorizerPass::vectorizeInsertValueInst(InsertValueInst *IVI,
BasicBlock *BB, BoUpSLP &R) {		BasicBlock *BB, BoUpSLP &R) {
const DataLayout &DL = BB->getModule()->getDataLayout();		const DataLayout &DL = BB->getModule()->getDataLayout();
if (!R.canMapToVector(IVI->getType(), DL))		if (!R.canMapToVector(IVI->getType(), DL))
return false;		return false;
▲ Show 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	if (it->use_empty() && (it->getType()->isVoidTy() \|\| isa<CallInst>(it) \|\|
}		}
}		}

if (isa<InsertElementInst>(it) \|\| isa<CmpInst>(it) \|\|		if (isa<InsertElementInst>(it) \|\| isa<CmpInst>(it) \|\|
isa<InsertValueInst>(it))		isa<InsertValueInst>(it))
PostProcessInstructions.push_back(&*it);		PostProcessInstructions.push_back(&*it);
}		}

		// Make a final attempt to match a 2-way reduction if nothing else worked.
		// We do not try this above because it may interfere with other vectorization
		// attempts.
		// TODO: The constraints are copied from the above call to
		// vectorizeRootInstruction(), but that might be too restrictive?
		BasicBlock::iterator LastInst = --BB->end();
		if (!Changed && LastInst->use_empty() &&
		(LastInst->getType()->isVoidTy() \|\| isa<CallInst>(LastInst) \|\|
		isa<InvokeInst>(LastInst))) {
		if (ShouldStartVectorizeHorAtStore \|\| !isa<StoreInst>(LastInst)) {
		for (auto *V : LastInst->operand_values()) {
		Changed \|= vectorizeRootInstruction(nullptr, V, BB, R, TTI,
		ABataevUnsubmitted Done Reply Inline Actions The only problem with this solution that it may increase the compiler time. It would be good to limit it strictly only to try to vectorize 2-vals reductions. Thoughts? ABataev: The only problem with this solution that it may increase the compiler time. It would be good to…
		/* Try2WayRdx */ true);
		}
		}
		}

return Changed;		return Changed;
}		}

bool SLPVectorizerPass::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizerPass::vectorizeGEPIndices(BasicBlock *BB, BoUpSLP &R) {
auto Changed = false;		auto Changed = false;
for (auto &Entry : GEPs) {		for (auto &Entry : GEPs) {
// If the getelementptr list has fewer than two elements, there's nothing		// If the getelementptr list has fewer than two elements, there's nothing
// to do.		// to do.
▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

llvm/test/Feature/weak_constant.ll

	; RUN: opt < %s -O3 -S > %t			; RUN: opt < %s -O3 -S > %t
	; RUN: grep undef %t \| count 1			; RUN: grep undef %t \| count 2
	; RUN: grep 5 %t \| count 1			; RUN: grep 5 %t \| count 1
	; RUN: grep 7 %t \| count 1			; RUN: grep 7 %t \| count 1
	; RUN: grep 9 %t \| count 1			; RUN: grep 9 %t \| count 1

	%0 = type { i32, i32 } ; type %0			%0 = type { i32, i32 } ; type %0
	@a = weak constant i32 undef ; <i32*> [#uses=1]			@a = weak constant i32 undef ; <i32*> [#uses=1]
	@b = weak constant i32 5 ; <i32*> [#uses=1]			@b = weak constant i32 5 ; <i32*> [#uses=1]
	@c = weak constant %0 { i32 7, i32 9 } ; <%0*> [#uses=1]			@c = weak constant %0 { i32 7, i32 9 } ; <%0*> [#uses=1]
	Show All 28 Lines

llvm/test/Transforms/SLPVectorizer/X86/reduction2.ll

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines

	; <label>:11 ; preds = %1			; <label>:11 ; preds = %1
	ret double %9			ret double %9
	}			}

	define i1 @two_wide_fcmp_reduction(<2 x double> %a0) {			define i1 @two_wide_fcmp_reduction(<2 x double> %a0) {
	; CHECK-LABEL: @two_wide_fcmp_reduction(			; CHECK-LABEL: @two_wide_fcmp_reduction(
	; CHECK-NEXT: [[A:%.]] = fcmp ogt <2 x double> [[A0:%.]], <double 1.000000e+00, double 1.000000e+00>			; CHECK-NEXT: [[A:%.]] = fcmp ogt <2 x double> [[A0:%.]], <double 1.000000e+00, double 1.000000e+00>
	; CHECK-NEXT: [[B:%.*]] = extractelement <2 x i1> [[A]], i32 0			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[A]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[C:%.*]] = extractelement <2 x i1> [[A]], i32 1			; CHECK-NEXT: [[BIN_RDX:%.*]] = and <2 x i1> [[A]], [[RDX_SHUF]]
	; CHECK-NEXT: [[D:%.*]] = and i1 [[B]], [[C]]			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: ret i1 [[D]]			; CHECK-NEXT: ret i1 [[TMP1]]
	;			;
	%a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>			%a = fcmp ogt <2 x double> %a0, <double 1.0, double 1.0>
	%b = extractelement <2 x i1> %a, i32 0			%b = extractelement <2 x i1> %a, i32 0
	%c = extractelement <2 x i1> %a, i32 1			%c = extractelement <2 x i1> %a, i32 1
	%d = and i1 %b, %c			%d = and i1 %b, %c
	ret i1 %d			ret i1 %d
	}			}

	Show All 22 Lines
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[FNEG]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> undef, double [[FNEG]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double [[C:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double [[C:%.]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]			; CHECK-NEXT: [[TMP4:%.*]] = fsub <2 x double> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[MUL]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[MUL]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[MUL]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]			; CHECK-NEXT: [[TMP7:%.*]] = fdiv <2 x double> [[TMP4]], [[TMP6]]
	; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x double> [[TMP7]], i32 1			; CHECK-NEXT: [[TMP8:%.*]] = fcmp olt <2 x double> [[TMP7]], <double 0x3EB0C6F7A0B5ED8D, double 0x3EB0C6F7A0B5ED8D>
	; CHECK-NEXT: [[CMP:%.*]] = fcmp olt double [[TMP8]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[RDX_SHUF:%.*]] = shufflevector <2 x i1> [[TMP8]], <2 x i1> undef, <2 x i32> <i32 1, i32 undef>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x double> [[TMP7]], i32 0			; CHECK-NEXT: [[BIN_RDX:%.*]] = and <2 x i1> [[TMP8]], [[RDX_SHUF]]
	; CHECK-NEXT: [[CMP4:%.*]] = fcmp olt double [[TMP9]], 0x3EB0C6F7A0B5ED8D			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <2 x i1> [[BIN_RDX]], i32 0
	; CHECK-NEXT: [[OR_COND:%.*]] = and i1 [[CMP]], [[CMP4]]			; CHECK-NEXT: br i1 [[TMP9]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK-NEXT: br i1 [[OR_COND]], label [[CLEANUP:%.]], label [[LOR_LHS_FALSE:%.]]
	; CHECK: lor.lhs.false:			; CHECK: lor.lhs.false:
	; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>			; CHECK-NEXT: [[TMP10:%.*]] = fcmp ule <2 x double> [[TMP7]], <double 1.000000e+00, double 1.000000e+00>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i1> [[TMP10]], i32 0
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i1> [[TMP10]], i32 1
	; CHECK-NEXT: [[NOT_OR_COND9:%.*]] = or i1 [[TMP11]], [[TMP12]]			; CHECK-NEXT: [[NOT_OR_COND9:%.*]] = or i1 [[TMP11]], [[TMP12]]
	; CHECK-NEXT: ret i1 [[NOT_OR_COND9]]			; CHECK-NEXT: ret i1 [[NOT_OR_COND9]]
	; CHECK: cleanup:			; CHECK: cleanup:
	; CHECK-NEXT: ret i1 false			; CHECK-NEXT: ret i1 false
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] remove lower limit for forming reduction patternsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 228129

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Feature/weak_constant.ll

llvm/test/Transforms/SLPVectorizer/X86/reduction2.ll

[SLP] remove lower limit for forming reduction patterns
AbandonedPublic