This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
swar.ll
-
tiny-tree.ll

Differential D48725

[SLP] Vectorize bit-parallel operations with SWAR.
AbandonedPublic

Authored by courbet on Jun 28 2018, 8:12 AM.

Download Raw Diff

Details

Reviewers

RKSimon
ABataev

Summary

Consider the following code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

The two load/stores do not vectorize:

mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
mov eax, dword ptr [rsi + 4]
mov dword ptr [rdi + 4], eax
mov rax, rdi
ret

This is because the SLP vectorizer only considers 4xi32=i128 as a candidate,
because there exists such a vector register. It never considers 2xi32=i64,
because the only register that exists for this is a GPR.
However, all operations that only manipulate values as arrays of
bits (e.g. Load, Store, Bitcast, and potentially Xor/And/Or) do not
strictly require vector registers. Let's call these bit-parallel
operations.

This change lets the SLP vectorizer vectorize trees composed of only bit-parallel operations using the native GPR size.

The example above will vectorize to:

mov rax, qword ptr [rsi]
mov qword ptr [rdi], rax
mov rax, rdi
ret

For now this only handles the most trivial bit-parallel instructions (Load, Store, Bitcast), and only homogeneous types (it will not vectorize <4xi8, 1xi32>), but this can be added later.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 19872
Build 19872: arc lint + arc unit

Event Timeline

courbet created this revision.Jun 28 2018, 8:12 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptJun 28 2018, 8:12 AM

courbet edited the summary of this revision. (Show Details)Jun 28 2018, 8:19 AM

courbet added a subscriber: chandlerc.

lebedev.ri added a subscriber: lebedev.ri.Jun 28 2018, 8:37 AM

This topic has come up in bugzilla as 'SWAR':
https://bugs.llvm.org/show_bug.cgi?id=32119
https://bugs.llvm.org/show_bug.cgi?id=34526

rkruppe added a subscriber: rkruppe.Jun 28 2018, 9:51 AM

In D48725#1146838, @spatel wrote:

This topic has come up in bugzilla as 'SWAR':
https://bugs.llvm.org/show_bug.cgi?id=32119
https://bugs.llvm.org/show_bug.cgi?id=34526

Thanks for the pointers Sanjay ! I'll use this terminology and add a link to this in the code.

Add pointers to SWAR, rename bit-parallel.ll to swar.ll.

courbet retitled this revision from [RFC][SLP] Vectorize bit-parallel operations with GPR. to [SLP] Vectorize bit-parallel operations with SWAR..Jun 28 2018, 11:22 PM

Harbormaster completed remote builds in B19872: Diff 153439.Jun 28 2018, 11:24 PM

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

My main interest in SWAR patterns was mainly for bitfield arithmetic cases such as PR34526 which I figured we could perform in InstCombine with some suitable overflow/demandedbits magic

Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?
You need to update the cost model for this kind of transformation.
I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

About the division of labor:

I don't think instcombine can handle any of the cases shown here because it doesn't have the machinery to combine multiple independent values. So SLP or DAG are the options AFAIK.
Instcombine could handle something like the xor example from https://bugs.llvm.org/show_bug.cgi?id=32119 , but it's probably better suited for AggressiveInstCombine because that's not a fixed pattern (we have to increase the matcher as the width of the value grows).

In D48725#1147883, @RKSimon wrote:

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

There are other advantages that we get from reusing the infrastructure of the SLP vectorizer. Besides load/stores and logicals we also get shuffles for free. Consider this code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S copy_2xi32(const S& s) {
  S result;
  result.a = s.b;
  result.b = s.a;
  return result;
}

Without the change this lowers to:

copy_2xi32(S): # @copy_2xi32(S)
  mov eax, dword ptr [rsp + 12]
  mov dword ptr [rdi], eax
  mov eax, dword ptr [rsp + 8]
  mov dword ptr [rdi + 4], eax
  mov rax, rdi
  ret

With the change this lowers to:

0000000000000000 <_Z10copy_2xi32RK1S>:
   0:	f3 0f 7e 06          	movq   (%rsi),%xmm0
   4:	66 0f 70 c0 e1       	pshufd $0xe1,%xmm0,%xmm0
   9:	66 0f d6 07          	movq   %xmm0,(%rdi)
   d:	48 89 f8             	mov    %rdi,%rax
  10:	c3                   	retq

In D48725#1147898, @ABataev wrote:

Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?

For this change I've restricted the range of operations to load/store and bitcast, which I presume is guaranteed to work efficiently on all targets. Then if the target can shuffle efficiently, the SLP vectorizer might decide to also do shuffles.

You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

Would InstCombine be able to also do stuff like shuffles ? I like how we can leverage all that's been done in SLP to get all that functionality for free.

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

In D48725#1148407, @efriedma wrote:

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

Thanks for the explanation; this is very useful.

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.
B - Do this in DAG, inside or near to LoadCombine.

I'll explore both solutions and create a patch with (B) so that we can compare.

In D48725#1147990, @courbet wrote:

In D48725#1147898, @ABataev wrote:

You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

Actually since I only do load/store now I think this already covers it: X86TTIImpl::getMemoryOpCost. Am I right ?

Emit scalar values instead of vector values for SWAR.

Harbormaster completed remote builds in B19938: Diff 153721.Jul 2 2018, 8:33 AM

In D48725#1149214, @courbet wrote:

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.

I've updated the change with a crude implementation.

Shuffles and extracts are disabled because we can no longer rely on the
DAG to transform shuffles and extracts into the appropriate operations.
If we want to support them, we will have to reimplement them as integer
operations.

So now that I've done this I think I understand what @efriedma was saying: another take on it is to say that (taking X86 as an example), 128 is not the smallest vector, because we can do partial load/stores.

B - Do this in DAG, inside or near to LoadCombine.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together. In the case of two completely independant load/stores without anything in the middle as the two_i32 test, there is nothing linking the instructions can could provide an entry point to try to merge the instructions.

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

In D48725#1150047, @efriedma wrote:

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

Hm actually I had a look at MergeConsecutiveStores and it can actually merge non-vector and/or heterogeneous-sized values (D52643). It won't handle my case though because it considers load/store in chain order and considers any store to be potentially aliasing the following loads:

This gets merged (the chain is load-load-store-store):

S PartialCopy(const S& s) {
  S result;
  const auto ta = s.a;
  const auto tb = s.b;
  result.a = ta;
  result.b = tb;
  return result;
}

But not this (the chain is load-store-load-store):

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

Or did I miss something ?

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

In D48725#1249545, @efriedma wrote:

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

Oh, I see, thanks. I missed the fact that using this flag will actually reorder the load/stores in the chain before entering CombineLoadStores (I was looking for analysis usage from CombineLoadStores).

The flag Eli pointer to is sufficient for my needs, so I'm going to abandon this revision for now.

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

2 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

77 lines

test/

Transforms/

SLPVectorizer/

X86/

swar.ll

176 lines

tiny-tree.ll

8 lines

Diff 153439

include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	private:
bool vectorizeSimpleInstructions(SmallVectorImpl<WeakVH> &Instructions,		bool vectorizeSimpleInstructions(SmallVectorImpl<WeakVH> &Instructions,
BasicBlock *BB, slpvectorizer::BoUpSLP &R);		BasicBlock *BB, slpvectorizer::BoUpSLP &R);

/// Scan the basic block and look for patterns that are likely to start		/// Scan the basic block and look for patterns that are likely to start
/// a vectorization chain.		/// a vectorization chain.
bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,
unsigned VecRegSize);		unsigned VecRegSize, bool OnlyBitParallel);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);		bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);

/// The store instructions in a basic block organized by base pointer.		/// The store instructions in a basic block organized by base pointer.
StoreListMap Stores;		StoreListMap Stores;

/// The getelementptr instructions in a basic block organized by base pointer.		/// The getelementptr instructions in a basic block organized by base pointer.
WeakTrackingVHListMap GEPs;		WeakTrackingVHListMap GEPs;
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 330 Lines • ▼ Show 20 Lines
static Value isOneOf(const InstructionsState &S, Value Op) {		static Value isOneOf(const InstructionsState &S, Value Op) {
auto *I = dyn_cast<Instruction>(Op);		auto *I = dyn_cast<Instruction>(Op);
if (I && sameOpcodeOrAlt(S.Opcode, S.AltOpcode, I->getOpcode()))		if (I && sameOpcodeOrAlt(S.Opcode, S.AltOpcode, I->getOpcode()))
return Op;		return Op;
return S.OpValue;		return S.OpValue;
}		}

/// \returns analysis of the Instructions in \p VL described in		/// \returns analysis of the Instructions in \p VL described in
/// InstructionsState, the Opcode that we suppose the whole list		/// InstructionsState, the Opcode that we suppose the whole list
/// could be vectorized even if its structure is diverse.		/// could be vectorized even if its structure is diverse.
static InstructionsState getSameOpcode(ArrayRef<Value *> VL,		static InstructionsState getSameOpcode(ArrayRef<Value *> VL,
unsigned BaseIndex = 0) {		unsigned BaseIndex = 0) {
// Make sure these are all Instructions.		// Make sure these are all Instructions.
if (llvm::any_of(VL, [](Value *V) { return !isa<Instruction>(V); }))		if (llvm::any_of(VL, [](Value *V) { return !isa<Instruction>(V); }))
return InstructionsState(VL[BaseIndex], 0, 0);		return InstructionsState(VL[BaseIndex], 0, 0);

bool IsBinOp = isa<BinaryOperator>(VL[BaseIndex]);		bool IsBinOp = isa<BinaryOperator>(VL[BaseIndex]);
▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	public:

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
		bool OnlyBitParallel,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
/// into account (anf updating it, if required) list of externally used		/// into account (anf updating it, if required) list of externally used
/// values stored in \p ExternallyUsedValues.		/// values stored in \p ExternallyUsedValues.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
		bool OnlyBitParallel,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	private:

/// Checks if all users of \p I are the part of the vectorization tree.		/// Checks if all users of \p I are the part of the vectorization tree.
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

/// \returns the cost of the vectorizable entry.		/// \returns the cost of the vectorizable entry.
int getEntryCost(TreeEntry *E);		int getEntryCost(TreeEntry *E);

/// This is the recursive part of buildTree.		/// This is the recursive part of buildTree.
void buildTree_rec(ArrayRef<Value *> Roots, unsigned Depth, int);		void buildTree_rec(ArrayRef<Value *> Roots, bool OnlyBitParallel, unsigned Depth, int);

/// \returns true if the ExtractElement/ExtractValue instructions in \p VL can		/// \returns true if the ExtractElement/ExtractValue instructions in \p VL can
/// be vectorized to use the original vector (or aggregate "bitcast" to a		/// be vectorized to use the original vector (or aggregate "bitcast" to a
/// vector) and sets \p CurrentOrder to the identity permutation; otherwise		/// vector) and sets \p CurrentOrder to the identity permutation; otherwise
/// returns false, setting \p CurrentOrder to either an empty vector or a		/// returns false, setting \p CurrentOrder to either an empty vector or a
/// non-identity permutation that allows to reuse extract instructions.		/// non-identity permutation that allows to reuse extract instructions.
bool canReuseExtract(ArrayRef<Value > VL, Value OpValue,		bool canReuseExtract(ArrayRef<Value > VL, Value OpValue,
SmallVectorImpl<unsigned> &CurrentOrder) const;		SmallVectorImpl<unsigned> &CurrentOrder) const;
▲ Show 20 Lines • Show All 689 Lines • ▼ Show 20 Lines	if (Entry->NeedToGather)
return "color=red";		return "color=red";
return "";		return "";
}		}
};		};

} // end namespace llvm		} // end namespace llvm

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		bool OnlyBitParallel,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		buildTree(Roots, OnlyBitParallel, ExternallyUsedValues, UserIgnoreLst);
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		bool OnlyBitParallel,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		deleteTree();
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, -1);		buildTree_rec(Roots, OnlyBitParallel, 0, -1);

// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (TreeEntry &EIdx : VectorizableTree) {		for (TreeEntry &EIdx : VectorizableTree) {
TreeEntry *Entry = &EIdx;		TreeEntry *Entry = &EIdx;

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->NeedToGather)		if (Entry->NeedToGather)
continue;		continue;
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));		ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
}		}
}		}
}		}
}		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		static bool isBitParallel(unsigned Op) {
		// FIXME: Handle ICmp, And, Or, Xor.
		return Op == Instruction::Load \|\| Op == Instruction::BitCast \|\| Op == Instruction::Store;
		}

		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, bool OnlyBitParallel, unsigned Depth,
int UserTreeIdx) {		int UserTreeIdx) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
return;		return;
▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	assert((!BS.getScheduleData(VL0) \|\|
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

unsigned ShuffleOrOp = S.isAltShuffle() ?		unsigned ShuffleOrOp = S.isAltShuffle() ?
(unsigned) Instruction::ShuffleVector : S.Opcode;		(unsigned) Instruction::ShuffleVector : S.Opcode;

		if (OnlyBitParallel && !isBitParallel(ShuffleOrOp)) {
		LLVM_DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
		newTreeEntry(VL, false, UserTreeIdx);
		return;
		}

switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
Show All 13 Lines	case Instruction::PHI: {

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(		Operands.push_back(cast<PHINode>(j)->getIncomingValueForBlock(
PH->getIncomingBlock(i)));		PH->getIncomingBlock(i)));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
OrdersType CurrentOrder;		OrdersType CurrentOrder;
bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
if (Reuse) {		if (Reuse) {
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	case Instruction::BitCast: {
LLVM_DEBUG(dbgs() << "SLP: added a vector of casts.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of casts.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::FCmp: {		case Instruction::FCmp: {
// Check that all of the compares have the same predicate.		// Check that all of the compares have the same predicate.
CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Type *ComparedTy = VL0->getOperand(0)->getType();		Type *ComparedTy = VL0->getOperand(0)->getType();
Show All 13 Lines	case Instruction::FCmp: {
LLVM_DEBUG(dbgs() << "SLP: added a vector of compares.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of compares.\n");

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::Select:		case Instruction::Select:
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
Show All 14 Lines	case Instruction::Xor:
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);		reorderInputsAccordingToOpcode(S.Opcode, VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, OnlyBitParallel, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, OnlyBitParallel, Depth + 1, UserTreeIdx);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;

case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
// We don't combine GEPs with complicated (nested) indexing.		// We don't combine GEPs with complicated (nested) indexing.
for (unsigned j = 0; j < VL.size(); ++j) {		for (unsigned j = 0; j < VL.size(); ++j) {
if (cast<Instruction>(VL[j])->getNumOperands() != 2) {		if (cast<Instruction>(VL[j])->getNumOperands() != 2) {
LLVM_DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");		LLVM_DEBUG(dbgs() << "SLP: not-vectorizable GEP (nested indexes).\n");
Show All 32 Lines	case Instruction::GetElementPtr: {
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of GEPs.\n");
for (unsigned i = 0, e = 2; i < e; ++i) {		for (unsigned i = 0, e = 2; i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::Store: {		case Instruction::Store: {
// Check if the stores are consecutive or of we need to swizzle them.		// Check if the stores are consecutive or of we need to swizzle them.
for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)		for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {		if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Non-consecutive store.\n");		LLVM_DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
return;		return;
}		}

newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");

ValueList Operands;		ValueList Operands;
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(0));		Operands.push_back(cast<Instruction>(j)->getOperand(0));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
return;		return;
}		}
case Instruction::Call: {		case Instruction::Call: {
// Check if the calls are all to the same vectorizable intrinsic.		// Check if the calls are all to the same vectorizable intrinsic.
CallInst *CI = cast<CallInst>(VL0);		CallInst *CI = cast<CallInst>(VL0);
// Check if this is an Intrinsic call or something that can be		// Check if this is an Intrinsic call or something that can be
// represented by an intrinsic call		// represented by an intrinsic call
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	case Instruction::Call: {
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {		for (unsigned i = 0, e = CI->getNumArgOperands(); i != e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL) {		for (Value *j : VL) {
CallInst *CI2 = dyn_cast<CallInst>(j);		CallInst *CI2 = dyn_cast<CallInst>(j);
Operands.push_back(CI2->getArgOperand(i));		Operands.push_back(CI2->getArgOperand(i));
}		}
buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;
}		}
case Instruction::ShuffleVector:		case Instruction::ShuffleVector:
// If this is not an alternate sequence of opcode like add-sub		// If this is not an alternate sequence of opcode like add-sub
// then do not vectorize this instruction.		// then do not vectorize this instruction.
if (!S.isAltShuffle()) {		if (!S.isAltShuffle()) {
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");		LLVM_DEBUG(dbgs() << "SLP: ShuffleVector are not vectorized.\n");
return;		return;
}		}
newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, true, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");		LLVM_DEBUG(dbgs() << "SLP: added a ShuffleVector op.\n");

// Reorder operands if reordering would enable vectorization.		// Reorder operands if reordering would enable vectorization.
if (isa<BinaryOperator>(VL0)) {		if (isa<BinaryOperator>(VL0)) {
ValueList Left, Right;		ValueList Left, Right;
reorderAltShuffleOperands(S, VL, Left, Right);		reorderAltShuffleOperands(S, VL, Left, Right);
buildTree_rec(Left, Depth + 1, UserTreeIdx);		buildTree_rec(Left, OnlyBitParallel, Depth + 1, UserTreeIdx);
buildTree_rec(Right, Depth + 1, UserTreeIdx);		buildTree_rec(Right, OnlyBitParallel, Depth + 1, UserTreeIdx);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
for (Value *j : VL)		for (Value *j : VL)
Operands.push_back(cast<Instruction>(j)->getOperand(i));		Operands.push_back(cast<Instruction>(j)->getOperand(i));

buildTree_rec(Operands, Depth + 1, UserTreeIdx);		buildTree_rec(Operands, OnlyBitParallel, Depth + 1, UserTreeIdx);
}		}
return;		return;

default:		default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
▲ Show 20 Lines • Show All 2,712 Lines • ▼ Show 20 Lines	static bool hasValueBeenRAUWed(ArrayRef<Value *> VL,
ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,		ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,
unsigned SliceSize) {		unsigned SliceSize) {
VL = VL.slice(SliceBegin, SliceSize);		VL = VL.slice(SliceBegin, SliceSize);
VH = VH.slice(SliceBegin, SliceSize);		VH = VH.slice(SliceBegin, SliceSize);
return !std::equal(VL.begin(), VL.end(), VH.begin());		return !std::equal(VL.begin(), VL.end(), VH.begin());
}		}

bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned VecRegSize) {		unsigned VecRegSize, const bool OnlyBitParallel) {
const unsigned ChainLen = Chain.size();		const unsigned ChainLen = Chain.size();
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
const unsigned VF = VecRegSize / Sz;		const unsigned VF = VecRegSize / Sz;

if (!isPowerOf2_32(Sz) \|\| VF < 2)		if (!isPowerOf2_32(Sz) \|\| VF < 2)
return false;		return false;

// Keep track of values that were deleted by vectorizing in the loop below.		// Keep track of values that were deleted by vectorizing in the loop below.
const SmallVector<WeakTrackingVH, 8> TrackValues(Chain.begin(), Chain.end());		const SmallVector<WeakTrackingVH, 8> TrackValues(Chain.begin(), Chain.end());

bool Changed = false;		bool Changed = false;
// Look for profitable vectorizable trees at all offsets, starting at zero.		// Look for profitable vectorizable trees at all offsets, starting at zero.
for (unsigned i = 0, e = ChainLen; i + VF <= e; ++i) {		for (unsigned i = 0, e = ChainLen; i + VF <= e; ++i) {

// Check that a previous iteration of this loop did not delete the Value.		// Check that a previous iteration of this loop did not delete the Value.
if (hasValueBeenRAUWed(Chain, TrackValues, i, VF))		if (hasValueBeenRAUWed(Chain, TrackValues, i, VF))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << i		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << i
<< "\n");		<< "\n");
ArrayRef<Value *> Operands = Chain.slice(i, VF);		ArrayRef<Value *> Operands = Chain.slice(i, VF);

R.buildTree(Operands);		R.buildTree(Operands, OnlyBitParallel);
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

int Cost = R.getTreeCost();		int Cost = R.getTreeCost();

LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF		LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	while ((Tails.count(I) \|\| Heads.count(I)) && !VectorizedStores.count(I)) {
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();		for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();
Size /= 2) {		Size /= 2) {
if (vectorizeStoreChain(Operands, R, Size)) {		if (vectorizeStoreChain(Operands, R, Size, false)) {
// Mark the vectorized stores so that we don't vectorize them again.		// Mark the vectorized stores so that we don't vectorize them again.
VectorizedStores.insert(Operands.begin(), Operands.end());		VectorizedStores.insert(Operands.begin(), Operands.end());
Changed = true;		Changed = true;
break;		break;
}		}
}		}
		// Now try to vectorize using SWAR (https://en.wikipedia.org/wiki/SWAR).
		// Only allow operations that are instrinsically bit-parallel.
		// FIXME: Extend to logical bitwise operations (e.g. XOR/OR/AND). We will
		// need to check flags.
		// FIXME: Extend to heterogeneous sizes (< 2xi8, 1xi16, 1xi32>). This is
		// easy for copies but requires careful handling of shuffles to avoid
		// generating inefficient code.
		if (!Changed && vectorizeStoreChain(Operands, R, TTI->getRegisterBitWidth(false), true)) {
		// Mark the vectorized stores so that we don't vectorize them again.
		VectorizedStores.insert(Operands.begin(), Operands.end());
		Changed = true;
		break;
		}
}		}

return Changed;		return Changed;
}		}

void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {		void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {
// Initialize the collections. We will make a single pass over the block.		// Initialize the collections. We will make a single pass over the block.
Stores.clear();		Stores.clear();
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
// Check that a previous iteration of this loop did not delete the Value.		// Check that a previous iteration of this loop did not delete the Value.
if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))		if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");
ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);

R.buildTree(Ops);		R.buildTree(Ops, false);
Optional<ArrayRef<unsigned>> Order = R.bestOrder();		Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: check if we can allow reordering for more cases.		// TODO: check if we can allow reordering for more cases.
if (AllowReorder && Order) {		if (AllowReorder && Order) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
// Conceptually, there is nothing actually preventing us from trying to		// Conceptually, there is nothing actually preventing us from trying to
// reorder a larger list. In fact, we do exactly this when vectorizing		// reorder a larger list. In fact, we do exactly this when vectorizing
// reductions. However, at this point, we only expect to get here when		// reductions. However, at this point, we only expect to get here when
// there are exactly two operations.		// there are exactly two operations.
assert(Ops.size() == 2);		assert(Ops.size() == 2);
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, false, None);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost() - UserCost;		int Cost = R.getTreeCost() - UserCost;
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);
▲ Show 20 Lines • Show All 721 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
// to use it.		// to use it.
for (auto &Pair : ExtraArgs)		for (auto &Pair : ExtraArgs)
ExternallyUsedValues[Pair.second].push_back(Pair.first);		ExternallyUsedValues[Pair.second].push_back(Pair.first);
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, false, ExternallyUsedValues, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.		// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == VL.size()) {		if (Order && Order->size() == VL.size()) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());		SmallVector<Value *, 4> ReorderedOps(VL.size());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
[VL](const unsigned Idx) { return VL[Idx]; });		[VL](const unsigned Idx) { return VL[Idx]; });
V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);		V.buildTree(ReorderedOps, false, ExternallyUsedValues, IgnoreList);
}		}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int TreeCost = V.getTreeCost();		int TreeCost = V.getTreeCost();
▲ Show 20 Lines • Show All 711 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/swar.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mcpu=corei7 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux"

				; This tests vectorization of bit-parallel operations (e.g. COPY) using SWAR.
				;
				; four_i32 tests vectorization of 4xi32 copy. This is vectorized using a vector
				; register.
				;
				; two_i32 tests vectorization of 2xi32 copy. Copying (load/store without
				; modifications) is trivially bit-parallel and can be vectorized using SWAR.
				;
				; two_i32_swap tests vectorization of 2xi32 copy with swapping.
				;
				; two_i32_add negative-tests vectorization of 2xi32 ADD. This should NOT be
				; vectorized as ADD is not bit-parallel.


				; four_i32
				;
				;struct S {
				; int32_t a;
				; int32_t b;
				; int32_t c;
				; int32_t d;
				; int64_t e;
				; int32_t f;
				;};
				;
				;S copy_2xi32(const S& s) {
				; S result;
				; result.a = s.a;
				; result.b = s.b;
				; result.c = s.c;
				; result.d = s.d;
				; return result;
				;}

				%struct.S4x32 = type { i32, i32, i32, i32, i64, i32 }

				define void @four_i32(%struct.S4x32* noalias nocapture sret, %struct.S4x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @four_i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32:%.]], %struct.S4x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[C_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 2
				; CHECK-NEXT: [[C_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 2
				; CHECK-NEXT: [[D_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 3
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[D_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 3
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				%c_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 2
				%c = load i32, i32* %c_src_ptr, align 8
				%c_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 2
				store i32 %c, i32* %c_dst_ptr, align 8
				%d_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 3
				%d = load i32, i32* %d_src_ptr, align 8
				%d_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 3
				store i32 %d, i32* %d_dst_ptr, align 8
				ret void
				}

				; two_i32
				;
				;struct S {
				; int32_t a;
				; int32_t b;
				; int64_t c;
				; int32_t d;
				;};
				;
				;S copy_2xi32(const S& s) {
				; S result;
				; result.a = s.a;
				; result.b = s.b;
				; return result;
				;}

				%struct.S2x32 = type { i32, i32, i64, i32 }

				define void @two_i32(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[TMP3]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				ret void
				}

				define void @two_i32_swap(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_swap(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[B_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[REORDER_SHUFFLE]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %b, i32* %b_dst_ptr, align 8
				ret void
				}

				define void @two_i32_add(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_add(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A:%.]] = load i32, i32 [[A_SRC_PTR]], align 8
				; CHECK-NEXT: [[A_PLUS_1:%.*]] = add nsw i32 [[A]], 1
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: store i32 [[A_PLUS_1]], i32* [[A_DST_PTR]], align 8
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[B:%.]] = load i32, i32 [[B_SRC_PTR]], align 8
				; CHECK-NEXT: [[B_PLUS_1:%.*]] = add nsw i32 [[B]], 1
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: store i32 [[B_PLUS_1]], i32* [[B_DST_PTR]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_plus_1 = add nsw i32 %a, 1
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a_plus_1, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_plus_1 = add nsw i32 %b, 1
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b_plus_1, i32* %b_dst_ptr, align 8
				ret void
				}

test/Transforms/SLPVectorizer/X86/tiny-tree.ll

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[SRC_ADDR_021:%.]] = phi float [ [[ADD_PTR:%.]], [[FOR_BODY]] ], [ [[SRC:%.]], [[ENTRY]] ]			; CHECK-NEXT: [[SRC_ADDR_021:%.]] = phi float [ [[ADD_PTR:%.]], [[FOR_BODY]] ], [ [[SRC:%.]], [[ENTRY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = load float, float [[SRC_ADDR_021]], align 4			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[SRC_ADDR_021]], align 4
	; CHECK-NEXT: store float [[TMP0]], float* [[DST_ADDR_022]], align 4			; CHECK-NEXT: store float [[TMP0]], float* [[DST_ADDR_022]], align 4
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 4			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 4
	; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX2]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX2]], align 4
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 1			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 1
	; CHECK-NEXT: store float [[TMP1]], float* [[ARRAYIDX3]], align 4			; CHECK-NEXT: store float [[TMP1]], float* [[ARRAYIDX3]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 2			; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 2
	; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX4]], align 4
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 2			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 2
	; CHECK-NEXT: store float [[TMP2]], float* [[ARRAYIDX5]], align 4
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 3			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 3
	; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX6]], align 4			; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[ARRAYIDX4]] to <2 x float>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x float>, <2 x float> [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 3			; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 3
	; CHECK-NEXT: store float [[TMP3]], float* [[ARRAYIDX7]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX5]] to <2 x float>*
				; CHECK-NEXT: store <2 x float> [[TMP3]], <2 x float>* [[TMP4]], align 4
	; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds float, float* [[SRC_ADDR_021]], i64 [[I_023]]			; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds float, float* [[SRC_ADDR_021]], i64 [[I_023]]
	; CHECK-NEXT: [[ADD_PTR8]] = getelementptr inbounds float, float* [[DST_ADDR_022]], i64 [[I_023]]			; CHECK-NEXT: [[ADD_PTR8]] = getelementptr inbounds float, float* [[DST_ADDR_022]], i64 [[I_023]]
	; CHECK-NEXT: [[INC]] = add i64 [[I_023]], 1			; CHECK-NEXT: [[INC]] = add i64 [[I_023]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INC]], [[COUNT]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INC]], [[COUNT]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize bit-parallel operations with SWAR.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 153439

include/llvm/Transforms/Vectorize/SLPVectorizer.h

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/swar.ll

test/Transforms/SLPVectorizer/X86/tiny-tree.ll

[SLP] Vectorize bit-parallel operations with SWAR.
AbandonedPublic