This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
swar.ll
-
tiny-tree.ll

Differential D48725

[SLP] Vectorize bit-parallel operations with SWAR.
AbandonedPublic

Authored by courbet on Jun 28 2018, 8:12 AM.

Download Raw Diff

Details

Reviewers

RKSimon
ABataev

Summary

Consider the following code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

The two load/stores do not vectorize:

mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
mov eax, dword ptr [rsi + 4]
mov dword ptr [rdi + 4], eax
mov rax, rdi
ret

This is because the SLP vectorizer only considers 4xi32=i128 as a candidate,
because there exists such a vector register. It never considers 2xi32=i64,
because the only register that exists for this is a GPR.
However, all operations that only manipulate values as arrays of
bits (e.g. Load, Store, Bitcast, and potentially Xor/And/Or) do not
strictly require vector registers. Let's call these bit-parallel
operations.

This change lets the SLP vectorizer vectorize trees composed of only bit-parallel operations using the native GPR size.

The example above will vectorize to:

mov rax, qword ptr [rsi]
mov qword ptr [rdi], rax
mov rax, rdi
ret

For now this only handles the most trivial bit-parallel instructions (Load, Store, Bitcast), and only homogeneous types (it will not vectorize <4xi8, 1xi32>), but this can be added later.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 19938
Build 19938: arc lint + arc unit

Event Timeline

courbet created this revision.Jun 28 2018, 8:12 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptJun 28 2018, 8:12 AM

courbet edited the summary of this revision. (Show Details)Jun 28 2018, 8:19 AM

courbet added a subscriber: chandlerc.

lebedev.ri added a subscriber: lebedev.ri.Jun 28 2018, 8:37 AM

This topic has come up in bugzilla as 'SWAR':
https://bugs.llvm.org/show_bug.cgi?id=32119
https://bugs.llvm.org/show_bug.cgi?id=34526

rkruppe added a subscriber: rkruppe.Jun 28 2018, 9:51 AM

In D48725#1146838, @spatel wrote:

This topic has come up in bugzilla as 'SWAR':
https://bugs.llvm.org/show_bug.cgi?id=32119
https://bugs.llvm.org/show_bug.cgi?id=34526

Thanks for the pointers Sanjay ! I'll use this terminology and add a link to this in the code.

Add pointers to SWAR, rename bit-parallel.ll to swar.ll.

courbet retitled this revision from [RFC][SLP] Vectorize bit-parallel operations with GPR. to [SLP] Vectorize bit-parallel operations with SWAR..Jun 28 2018, 11:22 PM

Harbormaster completed remote builds in B19872: Diff 153439.Jun 28 2018, 11:24 PM

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

My main interest in SWAR patterns was mainly for bitfield arithmetic cases such as PR34526 which I figured we could perform in InstCombine with some suitable overflow/demandedbits magic

Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?
You need to update the cost model for this kind of transformation.
I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

About the division of labor:

I don't think instcombine can handle any of the cases shown here because it doesn't have the machinery to combine multiple independent values. So SLP or DAG are the options AFAIK.
Instcombine could handle something like the xor example from https://bugs.llvm.org/show_bug.cgi?id=32119 , but it's probably better suited for AggressiveInstCombine because that's not a fixed pattern (we have to increase the matcher as the width of the value grows).

In D48725#1147883, @RKSimon wrote:

If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

There are other advantages that we get from reusing the infrastructure of the SLP vectorizer. Besides load/stores and logicals we also get shuffles for free. Consider this code:

struct S {
  int32_t a;
  int32_t b;
  int64_t c;
  int32_t d;
};

S copy_2xi32(const S& s) {
  S result;
  result.a = s.b;
  result.b = s.a;
  return result;
}

Without the change this lowers to:

copy_2xi32(S): # @copy_2xi32(S)
  mov eax, dword ptr [rsp + 12]
  mov dword ptr [rdi], eax
  mov eax, dword ptr [rsp + 8]
  mov dword ptr [rdi + 4], eax
  mov rax, rdi
  ret

With the change this lowers to:

0000000000000000 <_Z10copy_2xi32RK1S>:
   0:	f3 0f 7e 06          	movq   (%rsi),%xmm0
   4:	66 0f 70 c0 e1       	pshufd $0xe1,%xmm0,%xmm0
   9:	66 0f d6 07          	movq   %xmm0,(%rdi)
   d:	48 89 f8             	mov    %rdi,%rax
  10:	c3                   	retq

In D48725#1147898, @ABataev wrote:

Do all targets support this kind of transformation? Are they aware of transformation of operations with small vectors into operations on GPR?

For this change I've restricted the range of operations to load/store and bitcast, which I presume is guaranteed to work efficiently on all targets. Then if the target can shuffle efficiently, the SLP vectorizer might decide to also do shuffles.

You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

I think this should not the part of SLPVectorizer, looks like it is part of InstCombiner.

Would InstCombine be able to also do stuff like shuffles ? I like how we can leverage all that's been done in SLP to get all that functionality for free.

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

In D48725#1148407, @efriedma wrote:

Currently, SLPVectorizer does not generate vector types smaller than TargetTransformInfo::getMinVectorRegisterBitWidth(). This is 128 on many targets, including x86. But that doesn't really make sense, in general; even if a target doesn't have 64-bit vector registers, it can emulate them using 128-bit vector registers. The loop vectorizer frequently takes advantage of this; the SLP vectorizer should also take advantage of this, independent of anything else.

There's also the possibility of emitting "vector" operations using GPRs. This generally makes sense; it's basically the same transform even if the available instructions are more limited. But this patch doesn't really do that: it emits IR operations using vector types. SelectionDAG legalization will generally prefer to emit vector operations to vector registers, if they're available, or just scalarize if there aren't any vector registers. There's basically one exception to that rule, which you've stumbled across; DAGCombine will transform a vector or float load+store into an integer load+store, if the loaded value doesn't have any other uses. But we shouldn't rely on that, I think; if we're doing cost modeling based on the cost of integer operations, we should explicitly emit integer operations in IR.

Thanks for the explanation; this is very useful.

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.
B - Do this in DAG, inside or near to LoadCombine.

I'll explore both solutions and create a patch with (B) so that we can compare.

In D48725#1147990, @courbet wrote:

In D48725#1147898, @ABataev wrote:

You need to update the cost model for this kind of transformation.

Thanks. This is my first change to the SLP vectorizer, do you have any pointers to documentation on how to do this ?

Actually since I only do load/store now I think this already covers it: X86TTIImpl::getMemoryOpCost. Am I right ?

Emit scalar values instead of vector values for SWAR.

Harbormaster completed remote builds in B19938: Diff 153721.Jul 2 2018, 8:33 AM

In D48725#1149214, @courbet wrote:

Thank you all for your comments.

So let me sum up the options from the various comments here:
A - Keep this change in the SLP vectorizer. This requires emitting GPR operations instead of vector operations, and updating the cost model.

I've updated the change with a crude implementation.

Shuffles and extracts are disabled because we can no longer rely on the
DAG to transform shuffles and extracts into the appropriate operations.
If we want to support them, we will have to reimplement them as integer
operations.

So now that I've done this I think I understand what @efriedma was saying: another take on it is to say that (taking X86 as an example), 128 is not the smallest vector, because we can do partial load/stores.

B - Do this in DAG, inside or near to LoadCombine.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together. In the case of two completely independant load/stores without anything in the middle as the two_i32 test, there is nothing linking the instructions can could provide an entry point to try to merge the instructions.

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

In D48725#1150047, @efriedma wrote:

128 is not the smallest vector, because we can do partial load/stores

Essentially, yes.

Actually I don't think the current case can be handled in the same way as MatchLoadCombine: in the case the MatchLoadCombine, the "or" instruction provides a way to link the stores together.

We have code to do this sort of merging in DAGCombiner::MergeConsecutiveStores. But it misses cases like the ones in your patch because combiner-global-alias-analysis is off by default. (I don't remember the full history of that, but IIRC the compile-time penalty was too large.)

Hm actually I had a look at MergeConsecutiveStores and it can actually merge non-vector and/or heterogeneous-sized values (D52643). It won't handle my case though because it considers load/store in chain order and considers any store to be potentially aliasing the following loads:

This gets merged (the chain is load-load-store-store):

S PartialCopy(const S& s) {
  S result;
  const auto ta = s.a;
  const auto tb = s.b;
  result.a = ta;
  result.b = tb;
  return result;
}

But not this (the chain is load-store-load-store):

S PartialCopy(const S& s) {
  S result;
  result.a = s.a;
  result.b = s.b;
  return result;
}

Or did I miss something ?

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

In D48725#1249545, @efriedma wrote:

Yes, like I said, your original testcase doesn't get merged by DAGCombine unless "-combiner-global-alias-analysis" is enabled (which it isn't, by default).

Oh, I see, thanks. I missed the fact that using this flag will actually reorder the load/stores in the chain before entering CombineLoadStores (I was looking for analysis usage from CombineLoadStores).

The flag Eli pointer to is sufficient for my needs, so I'm going to abandon this revision for now.

Revision Contents

Path

Size

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

2 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

78 lines

test/

Transforms/

SLPVectorizer/

X86/

swar.ll

228 lines

tiny-tree.ll

8 lines

Diff 153721

include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	private:
bool vectorizeSimpleInstructions(SmallVectorImpl<WeakVH> &Instructions,		bool vectorizeSimpleInstructions(SmallVectorImpl<WeakVH> &Instructions,
BasicBlock *BB, slpvectorizer::BoUpSLP &R);		BasicBlock *BB, slpvectorizer::BoUpSLP &R);

/// Scan the basic block and look for patterns that are likely to start		/// Scan the basic block and look for patterns that are likely to start
/// a vectorization chain.		/// a vectorization chain.
bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,
unsigned VecRegSize);		unsigned VecRegSize, bool OnlyBitParallel);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);		bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);

/// The store instructions in a basic block organized by base pointer.		/// The store instructions in a basic block organized by base pointer.
StoreListMap Stores;		StoreListMap Stores;

/// The getelementptr instructions in a basic block organized by base pointer.		/// The getelementptr instructions in a basic block organized by base pointer.
WeakTrackingVHListMap GEPs;		WeakTrackingVHListMap GEPs;
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 330 Lines • ▼ Show 20 Lines
static Value isOneOf(const InstructionsState &S, Value Op) {		static Value isOneOf(const InstructionsState &S, Value Op) {
auto *I = dyn_cast<Instruction>(Op);		auto *I = dyn_cast<Instruction>(Op);
if (I && S.isOpcodeOrAlt(I))		if (I && S.isOpcodeOrAlt(I))
return Op;		return Op;
return S.OpValue;		return S.OpValue;
}		}

/// \returns analysis of the Instructions in \p VL described in		/// \returns analysis of the Instructions in \p VL described in
/// InstructionsState, the Opcode that we suppose the whole list		/// InstructionsState, the Opcode that we suppose the whole list
/// could be vectorized even if its structure is diverse.		/// could be vectorized even if its structure is diverse.
static InstructionsState getSameOpcode(ArrayRef<Value *> VL,		static InstructionsState getSameOpcode(ArrayRef<Value *> VL,
unsigned BaseIndex = 0) {		unsigned BaseIndex = 0) {
// Make sure these are all Instructions.		// Make sure these are all Instructions.
if (llvm::any_of(VL, [](Value *V) { return !isa<Instruction>(V); }))		if (llvm::any_of(VL, [](Value *V) { return !isa<Instruction>(V); }))
return InstructionsState(VL[BaseIndex], 0, 0);		return InstructionsState(VL[BaseIndex], 0, 0);

bool IsBinOp = isa<BinaryOperator>(VL[BaseIndex]);		bool IsBinOp = isa<BinaryOperator>(VL[BaseIndex]);
▲ Show 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	public:

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
		bool IsSwar,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
/// into account (anf updating it, if required) list of externally used		/// into account (anf updating it, if required) list of externally used
/// values stored in \p ExternallyUsedValues.		/// values stored in \p ExternallyUsedValues.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
		bool IsSwar,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
NumOpsWantToKeepOrder.clear();		NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;		NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
		IsSwar = false;
}		}

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns The best order of instructions for vectorization.		/// \returns The best order of instructions for vectorization.
Show All 36 Lines	public:
///		///
/// \returns number of elements in vector if isomorphism exists, 0 otherwise.		/// \returns number of elements in vector if isomorphism exists, 0 otherwise.
unsigned canMapToVector(Type *T, const DataLayout &DL) const;		unsigned canMapToVector(Type *T, const DataLayout &DL) const;

/// \returns True if the VectorizableTree is both tiny and not fully		/// \returns True if the VectorizableTree is both tiny and not fully
/// vectorizable. We do not vectorize such trees.		/// vectorizable. We do not vectorize such trees.
bool isTreeTinyAndNotFullyVectorizable();		bool isTreeTinyAndNotFullyVectorizable();

		/// \returns whether the VectorizableTree has external uses.
		bool hasExternalUses() const { return !ExternalUses.empty(); }

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

private:		private:
struct TreeEntry;		struct TreeEntry;

/// Checks if all users of \p I are the part of the vectorization tree.		/// Checks if all users of \p I are the part of the vectorization tree.
bool areAllUsersVectorized(Instruction *I) const;		bool areAllUsersVectorized(Instruction *I) const;

▲ Show 20 Lines • Show All 619 Lines • ▼ Show 20 Lines	#endif
IRBuilder<> Builder;		IRBuilder<> Builder;

/// A map of scalar integer values to the smallest bit width with which they		/// A map of scalar integer values to the smallest bit width with which they
/// can legally be represented. The values map to (width, signed) pairs,		/// can legally be represented. The values map to (width, signed) pairs,
/// where "width" indicates the minimum bit width and "signed" is True if the		/// where "width" indicates the minimum bit width and "signed" is True if the
/// value must be signed-extended, rather than zero-extended, back to its		/// value must be signed-extended, rather than zero-extended, back to its
/// original width.		/// original width.
MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;		MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;

		/// Is this a SWAR vectorization ? If true, the result type is a scalar type
		/// and not a vector type. The "lanes" of the vector are contiguous bit
		/// intervals (e.g. i64 is split into bits [63-32] and [31-0]).
		bool IsSwar = false;
};		};

} // end namespace slpvectorizer		} // end namespace slpvectorizer

template <> struct GraphTraits<BoUpSLP *> {		template <> struct GraphTraits<BoUpSLP *> {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

/// NodeRef has to be a pointer per the GraphWriter.		/// NodeRef has to be a pointer per the GraphWriter.
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	if (Entry->NeedToGather)
return "color=red";		return "color=red";
return "";		return "";
}		}
};		};

} // end namespace llvm		} // end namespace llvm

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		bool IsSwar,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		buildTree(Roots, IsSwar, ExternallyUsedValues, UserIgnoreLst);
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		bool IsSwar,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		deleteTree();
		this->IsSwar = IsSwar;
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, -1);		buildTree_rec(Roots, 0, -1);

// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (TreeEntry &EIdx : VectorizableTree) {		for (TreeEntry &EIdx : VectorizableTree) {
TreeEntry *Entry = &EIdx;		TreeEntry *Entry = &EIdx;
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));		ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
}		}
}		}
}		}
}		}

		static bool isBitParallel(unsigned Op) {
		// FIXME: Handle ICmp, And, Or, Xor, BitCast.
		return Op == Instruction::Load \|\| Op == Instruction::Store;
		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
int UserTreeIdx) {		int UserTreeIdx) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, false, UserTreeIdx);		newTreeEntry(VL, false, UserTreeIdx);
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	assert((!BS.getScheduleData(VL0) \|\|
"tryScheduleBundle should cancelScheduling on failure");		"tryScheduleBundle should cancelScheduling on failure");
newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);		newTreeEntry(VL, false, UserTreeIdx, ReuseShuffleIndicies);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: We are able to schedule this bundle.\n");

unsigned ShuffleOrOp = S.isAltShuffle() ?		unsigned ShuffleOrOp = S.isAltShuffle() ?
(unsigned) Instruction::ShuffleVector : S.Opcode;		(unsigned) Instruction::ShuffleVector : S.Opcode;

		if (IsSwar && !isBitParallel(ShuffleOrOp)) {
		LLVM_DEBUG(dbgs() << "SLP: Gathering due to non bit-parallel SWAR.\n");
		newTreeEntry(VL, false, UserTreeIdx);
		return;
		}

switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PH = dyn_cast<PHINode>(VL0);		PHINode *PH = dyn_cast<PHINode>(VL0);

// Check for terminator values (e.g. invoke).		// Check for terminator values (e.g. invoke).
for (unsigned j = 0; j < VL.size(); ++j)		for (unsigned j = 0; j < VL.size(); ++j)
for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
TerminatorInst *Term = dyn_cast<TerminatorInst>(		TerminatorInst *Term = dyn_cast<TerminatorInst>(
▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	case Instruction::Load: {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original loads are consecutive and does not require reordering.		// Original loads are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;		++NumOpsWantToKeepOriginalOrder;
newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");
} else {		} else {
// Need to reorder.		// Need to reorder.
		if (IsSwar) {
		LLVM_DEBUG(dbgs() << "SLP: shuffle in SWAR.\n");
		newTreeEntry(VL, false, UserTreeIdx);
		return;
		}
auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;		auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
++I->getSecond();		++I->getSecond();
newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,		newTreeEntry(VL, /Vectorized=/true, UserTreeIdx,
ReuseShuffleIndicies, I->getFirst());		ReuseShuffleIndicies, I->getFirst());
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
}		}
return;		return;
}		}
▲ Show 20 Lines • Show All 1,367 Lines • ▼ Show 20 Lines	if (E->VectorizedValue) {
return E->VectorizedValue;		return E->VectorizedValue;
}		}

InstructionsState S = getSameOpcode(E->Scalars);		InstructionsState S = getSameOpcode(E->Scalars);
Instruction *VL0 = cast<Instruction>(S.OpValue);		Instruction *VL0 = cast<Instruction>(S.OpValue);
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL0))		if (StoreInst *SI = dyn_cast<StoreInst>(VL0))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, E->Scalars.size());		VectorType *const VecTy = IsSwar ? nullptr : VectorType::get(ScalarTy, E->Scalars.size());
		IntegerType const SwarTy = IsSwar ? IntegerType::get(F->getContext(), ScalarTy->getIntegerBitWidth() E->Scalars.size()) : nullptr;
		Type* const VecOrSwarTy = IsSwar ? static_cast<Type>(SwarTy) : static_cast<Type>(VecTy);

bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();		bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();

if (E->NeedToGather) {		if (E->NeedToGather) {
setInsertPointAfterBundle(E->Scalars, S);		setInsertPointAfterBundle(E->Scalars, S);
auto *V = Gather(E->Scalars, VecTy);		auto *V = Gather(E->Scalars, VecTy);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	case Instruction::Load: {
}		}
setInsertPointAfterBundle(E->Scalars, S);		setInsertPointAfterBundle(E->Scalars, S);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
Type *ScalarLoadTy = LI->getType();		Type *ScalarLoadTy = LI->getType();
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecOrSwarTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));

unsigned Alignment = LI->getAlignment();		unsigned Alignment = LI->getAlignment();
LI = Builder.CreateLoad(VecPtr);		LI = Builder.CreateLoad(VecPtr);
if (!Alignment) {		if (!Alignment) {
Alignment = DL->getABITypeAlignment(ScalarLoadTy);		Alignment = DL->getABITypeAlignment(ScalarLoadTy);
}		}
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReorder) {		if (IsReorder) {
		assert(!IsSwar);
OrdersType Mask;		OrdersType Mask;
inversePermutation(E->ReorderIndices, Mask);		inversePermutation(E->ReorderIndices, Mask);
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),
Mask, "reorder_shuffle");		Mask, "reorder_shuffle");
}		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		assert(!IsSwar);
// TODO: Merge this shuffle with the ReorderShuffleMask.		// TODO: Merge this shuffle with the ReorderShuffleMask.
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Store: {		case Instruction::Store: {
StoreInst *SI = cast<StoreInst>(VL0);		StoreInst *SI = cast<StoreInst>(VL0);
unsigned Alignment = SI->getAlignment();		unsigned Alignment = SI->getAlignment();
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

ValueList ScalarStoreValues;		ValueList ScalarStoreValues;
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());		ScalarStoreValues.push_back(cast<StoreInst>(V)->getValueOperand());

setInsertPointAfterBundle(E->Scalars, S);		setInsertPointAfterBundle(E->Scalars, S);

Value *VecValue = vectorizeTree(ScalarStoreValues);		Value *VecValue = vectorizeTree(ScalarStoreValues);
Value *ScalarPtr = SI->getPointerOperand();		Value *ScalarPtr = SI->getPointerOperand();
Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));		Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecOrSwarTy->getPointerTo(AS));
StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);		StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (getTreeEntry(ScalarPtr))
ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));

if (!Alignment)		if (!Alignment)
Alignment = DL->getABITypeAlignment(SI->getValueOperand()->getType());		Alignment = DL->getABITypeAlignment(SI->getValueOperand()->getType());

ST->setAlignment(Alignment);		ST->setAlignment(Alignment);
Value *V = propagateMetadata(ST, E->Scalars);		Value *V = propagateMetadata(ST, E->Scalars);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
		assert(!IsSwar);
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	if (!MinBWs.count(ScalarRoot))
return Ex;		return Ex;
if (MinBWs[ScalarRoot].second)		if (MinBWs[ScalarRoot].second)
return Builder.CreateSExt(Ex, ScalarType);		return Builder.CreateSExt(Ex, ScalarType);
return Builder.CreateZExt(Ex, ScalarType);		return Builder.CreateZExt(Ex, ScalarType);
};		};

// Extract all of the elements with the external uses.		// Extract all of the elements with the external uses.
for (const auto &ExternalUse : ExternalUses) {		for (const auto &ExternalUse : ExternalUses) {
		assert(!IsSwar && "not implemented: extract in SWAR");
Value *Scalar = ExternalUse.Scalar;		Value *Scalar = ExternalUse.Scalar;
llvm::User *User = ExternalUse.User;		llvm::User *User = ExternalUse.User;

// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
▲ Show 20 Lines • Show All 1,051 Lines • ▼ Show 20 Lines	static bool hasValueBeenRAUWed(ArrayRef<Value *> VL,
ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,		ArrayRef<WeakTrackingVH> VH, unsigned SliceBegin,
unsigned SliceSize) {		unsigned SliceSize) {
VL = VL.slice(SliceBegin, SliceSize);		VL = VL.slice(SliceBegin, SliceSize);
VH = VH.slice(SliceBegin, SliceSize);		VH = VH.slice(SliceBegin, SliceSize);
return !std::equal(VL.begin(), VL.end(), VH.begin());		return !std::equal(VL.begin(), VL.end(), VH.begin());
}		}

bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned VecRegSize) {		unsigned VecRegSize, const bool IsSwar) {
const unsigned ChainLen = Chain.size();		const unsigned ChainLen = Chain.size();
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
const unsigned VF = VecRegSize / Sz;		const unsigned VF = VecRegSize / Sz;

if (!isPowerOf2_32(Sz) \|\| VF < 2)		if (!isPowerOf2_32(Sz) \|\| VF < 2)
return false;		return false;

// Keep track of values that were deleted by vectorizing in the loop below.		// Keep track of values that were deleted by vectorizing in the loop below.
const SmallVector<WeakTrackingVH, 8> TrackValues(Chain.begin(), Chain.end());		const SmallVector<WeakTrackingVH, 8> TrackValues(Chain.begin(), Chain.end());

bool Changed = false;		bool Changed = false;
// Look for profitable vectorizable trees at all offsets, starting at zero.		// Look for profitable vectorizable trees at all offsets, starting at zero.
for (unsigned i = 0, e = ChainLen; i + VF <= e; ++i) {		for (unsigned i = 0, e = ChainLen; i + VF <= e; ++i) {

// Check that a previous iteration of this loop did not delete the Value.		// Check that a previous iteration of this loop did not delete the Value.
if (hasValueBeenRAUWed(Chain, TrackValues, i, VF))		if (hasValueBeenRAUWed(Chain, TrackValues, i, VF))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << i		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << i
<< "\n");		<< "\n");
ArrayRef<Value *> Operands = Chain.slice(i, VF);		ArrayRef<Value *> Operands = Chain.slice(i, VF);

R.buildTree(Operands);		R.buildTree(Operands, IsSwar);
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
		if (IsSwar && R.hasExternalUses()) {
		LLVM_DEBUG(dbgs() << "SLP: Ignoring SWAR tree with external uses\n");
		continue;
		}

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

int Cost = R.getTreeCost();		int Cost = R.getTreeCost();

LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF		LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF
<< "\n");		<< "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	while ((Tails.count(I) \|\| Heads.count(I)) && !VectorizedStores.count(I)) {
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();		for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();
Size /= 2) {		Size /= 2) {
if (vectorizeStoreChain(Operands, R, Size)) {		if (vectorizeStoreChain(Operands, R, Size, false)) {
// Mark the vectorized stores so that we don't vectorize them again.		// Mark the vectorized stores so that we don't vectorize them again.
VectorizedStores.insert(Operands.begin(), Operands.end());		VectorizedStores.insert(Operands.begin(), Operands.end());
Changed = true;		Changed = true;
break;		break;
}		}
}		}
		// Now try to vectorize using SWAR (https://en.wikipedia.org/wiki/SWAR).
		// Only allow operations that are instrinsically bit-parallel.
		// FIXME: Extend to logical bitwise operations (e.g. XOR/OR/AND). We will
		// need to check flags.
		// FIXME: Extend to heterogeneous sizes (< 2xi8, 1xi16, 1xi32>). This is
		// easy for copies but requires careful handling of shuffles to avoid
		// generating inefficient code.
		if (!Changed && vectorizeStoreChain(Operands, R, TTI->getRegisterBitWidth(false), true)) {
		// Mark the vectorized stores so that we don't vectorize them again.
		VectorizedStores.insert(Operands.begin(), Operands.end());
		Changed = true;
		break;
		}
}		}

return Changed;		return Changed;
}		}

void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {		void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {
// Initialize the collections. We will make a single pass over the block.		// Initialize the collections. We will make a single pass over the block.
Stores.clear();		Stores.clear();
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
// Check that a previous iteration of this loop did not delete the Value.		// Check that a previous iteration of this loop did not delete the Value.
if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))		if (hasValueBeenRAUWed(VL, TrackValues, I, OpsWidth))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");
ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);

R.buildTree(Ops);		R.buildTree(Ops, false);
Optional<ArrayRef<unsigned>> Order = R.bestOrder();		Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: check if we can allow reordering for more cases.		// TODO: check if we can allow reordering for more cases.
if (AllowReorder && Order) {		if (AllowReorder && Order) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
// Conceptually, there is nothing actually preventing us from trying to		// Conceptually, there is nothing actually preventing us from trying to
// reorder a larger list. In fact, we do exactly this when vectorizing		// reorder a larger list. In fact, we do exactly this when vectorizing
// reductions. However, at this point, we only expect to get here when		// reductions. However, at this point, we only expect to get here when
// there are exactly two operations.		// there are exactly two operations.
assert(Ops.size() == 2);		assert(Ops.size() == 2);
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, false, None);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost() - UserCost;		int Cost = R.getTreeCost() - UserCost;
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);
▲ Show 20 Lines • Show All 721 Lines • ▼ Show 20 Lines	bool tryToReduce(BoUpSLP &V, TargetTransformInfo *TTI) {
// to use it.		// to use it.
for (auto &Pair : ExtraArgs)		for (auto &Pair : ExtraArgs)
ExternallyUsedValues[Pair.second].push_back(Pair.first);		ExternallyUsedValues[Pair.second].push_back(Pair.first);
SmallVector<Value *, 16> IgnoreList;		SmallVector<Value *, 16> IgnoreList;
for (auto &V : ReductionOps)		for (auto &V : ReductionOps)
IgnoreList.append(V.begin(), V.end());		IgnoreList.append(V.begin(), V.end());
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);		auto VL = makeArrayRef(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, false, ExternallyUsedValues, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();		Optional<ArrayRef<unsigned>> Order = V.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.		// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == VL.size()) {		if (Order && Order->size() == VL.size()) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());		SmallVector<Value *, 4> ReorderedOps(VL.size());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
[VL](const unsigned Idx) { return VL[Idx]; });		[VL](const unsigned Idx) { return VL[Idx]; });
V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);		V.buildTree(ReorderedOps, false, ExternallyUsedValues, IgnoreList);
}		}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int TreeCost = V.getTreeCost();		int TreeCost = V.getTreeCost();
▲ Show 20 Lines • Show All 711 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/swar.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mcpu=corei7 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux"

				; This tests vectorization of bit-parallel operations (e.g. COPY) using SWAR.
				;
				; four_i32 tests vectorization of 4xi32 copy. This is vectorized using a vector
				; register.
				;
				; two_i32 tests vectorization of 2xi32 copy. Copying (load/store without
				; modifications) is trivially bit-parallel and can be vectorized using SWAR.
				;
				; two_i32_swap tests vectorization of 2xi32 copy with swapping.
				;
				; two_i32_add negative-tests vectorization of 2xi32 ADD. This should NOT be
				; vectorized as ADD is not bit-parallel.


				; four_i32
				;
				;struct S {
				; int32_t a;
				; int32_t b;
				; int32_t c;
				; int32_t d;
				; int64_t e;
				; int32_t f;
				;};
				;
				;S copy_2xi32(const S& s) {
				; S result;
				; result.a = s.a;
				; result.b = s.b;
				; result.c = s.c;
				; result.d = s.d;
				; return result;
				;}

				%struct.S4x32 = type { i32, i32, i32, i32, i64, i32 }

				define void @four_i32(%struct.S4x32* noalias nocapture sret, %struct.S4x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @four_i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32:%.]], %struct.S4x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[C_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 2
				; CHECK-NEXT: [[C_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 2
				; CHECK-NEXT: [[D_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP1]], i64 0, i32 3
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <4 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[D_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S4X32]], %struct.S4x32 [[TMP0]], i64 0, i32 3
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				%c_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 2
				%c = load i32, i32* %c_src_ptr, align 8
				%c_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 2
				store i32 %c, i32* %c_dst_ptr, align 8
				%d_src_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %1, i64 0, i32 3
				%d = load i32, i32* %d_src_ptr, align 8
				%d_dst_ptr = getelementptr inbounds %struct.S4x32, %struct.S4x32* %0, i64 0, i32 3
				store i32 %d, i32* %d_dst_ptr, align 8
				ret void
				}

				; two_i32
				;
				;struct S {
				; int32_t a;
				; int32_t b;
				; int64_t c;
				; int32_t d;
				;};
				;
				;S copy_2xi32(const S& s) {
				; S result;
				; result.a = s.a;
				; result.b = s.b;
				; return result;
				;}

				%struct.S2x32 = type { i32, i32, i64, i32 }

				define void @two_i32(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[TMP3]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				ret void
				}

				define void @two_i32_swap(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_swap(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 1
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 0
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[B_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[REORDER_SHUFFLE]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %b, i32* %b_dst_ptr, align 8
				ret void
				}

				define void @two_i32_add(%struct.S2x32* noalias nocapture sret, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_add(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A:%.]] = load i32, i32 [[A_SRC_PTR]], align 8
				; CHECK-NEXT: [[A_PLUS_1:%.*]] = add nsw i32 [[A]], 1
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: store i32 [[A_PLUS_1]], i32* [[A_DST_PTR]], align 8
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[B:%.]] = load i32, i32 [[B_SRC_PTR]], align 8
				; CHECK-NEXT: [[B_PLUS_1:%.*]] = add nsw i32 [[B]], 1
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: store i32 [[B_PLUS_1]], i32* [[B_DST_PTR]], align 8
				; CHECK-NEXT: ret void
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_plus_1 = add nsw i32 %a, 1
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a_plus_1, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_plus_1 = add nsw i32 %b, 1
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b_plus_1, i32* %b_dst_ptr, align 8
				ret void
				}

				define i32 @two_i32_extract(%struct.S2x32* noalias nocapture, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_extuse(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[TMP3]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP3]], i32 1
				; CHECK-NEXT: ret i32 [[TMP5]]
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				ret i32 %b
				}

				define i32 @two_i32_insert(%struct.S2x32* noalias nocapture, %struct.S2x32* nocapture readonly dereferenceable(24)) {
				; CHECK-LABEL: @two_i32_extuse(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32:%.]], %struct.S2x32* [[TMP1:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[A_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0:%.*]], i64 0, i32 0
				; CHECK-NEXT: [[B_SRC_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP1]], i64 0, i32 1
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A_SRC_PTR]] to <2 x i32>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x i32>, <2 x i32> [[TMP2]], align 8
				; CHECK-NEXT: [[B_DST_PTR:%.]] = getelementptr inbounds [[STRUCT_S2X32]], %struct.S2x32 [[TMP0]], i64 0, i32 1
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[A_DST_PTR]] to <2 x i32>*
				; CHECK-NEXT: store <2 x i32> [[TMP3]], <2 x i32>* [[TMP4]], align 8
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i32> [[TMP3]], i32 1
				; CHECK-NEXT: ret i32 [[TMP5]]
				;
				entry:
				%a_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 0
				%a = load i32, i32* %a_src_ptr, align 8
				%a_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 0
				store i32 %a, i32* %a_dst_ptr, align 8
				%b_src_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %1, i64 0, i32 1
				%b = load i32, i32* %b_src_ptr, align 8
				%b_dst_ptr = getelementptr inbounds %struct.S2x32, %struct.S2x32* %0, i64 0, i32 1
				store i32 %b, i32* %b_dst_ptr, align 8
				ret i32 %b
				}

test/Transforms/SLPVectorizer/X86/tiny-tree.ll

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[SRC_ADDR_021:%.]] = phi float [ [[ADD_PTR:%.]], [[FOR_BODY]] ], [ [[SRC:%.]], [[ENTRY]] ]			; CHECK-NEXT: [[SRC_ADDR_021:%.]] = phi float [ [[ADD_PTR:%.]], [[FOR_BODY]] ], [ [[SRC:%.]], [[ENTRY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = load float, float [[SRC_ADDR_021]], align 4			; CHECK-NEXT: [[TMP0:%.]] = load float, float [[SRC_ADDR_021]], align 4
	; CHECK-NEXT: store float [[TMP0]], float* [[DST_ADDR_022]], align 4			; CHECK-NEXT: store float [[TMP0]], float* [[DST_ADDR_022]], align 4
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 4			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 4
	; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX2]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX2]], align 4
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 1			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 1
	; CHECK-NEXT: store float [[TMP1]], float* [[ARRAYIDX3]], align 4			; CHECK-NEXT: store float [[TMP1]], float* [[ARRAYIDX3]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 2			; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 2
	; CHECK-NEXT: [[TMP2:%.]] = load float, float [[ARRAYIDX4]], align 4
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 2			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 2
	; CHECK-NEXT: store float [[TMP2]], float* [[ARRAYIDX5]], align 4
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 3			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds float, float [[SRC_ADDR_021]], i64 3
	; CHECK-NEXT: [[TMP3:%.]] = load float, float [[ARRAYIDX6]], align 4			; CHECK-NEXT: [[TMP2:%.]] = bitcast float [[ARRAYIDX4]] to <2 x float>*
				; CHECK-NEXT: [[TMP3:%.]] = load <2 x float>, <2 x float> [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 3			; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds float, float [[DST_ADDR_022]], i64 3
	; CHECK-NEXT: store float [[TMP3]], float* [[ARRAYIDX7]], align 4			; CHECK-NEXT: [[TMP4:%.]] = bitcast float [[ARRAYIDX5]] to <2 x float>*
				; CHECK-NEXT: store <2 x float> [[TMP3]], <2 x float>* [[TMP4]], align 4
	; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds float, float* [[SRC_ADDR_021]], i64 [[I_023]]			; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds float, float* [[SRC_ADDR_021]], i64 [[I_023]]
	; CHECK-NEXT: [[ADD_PTR8]] = getelementptr inbounds float, float* [[DST_ADDR_022]], i64 [[I_023]]			; CHECK-NEXT: [[ADD_PTR8]] = getelementptr inbounds float, float* [[DST_ADDR_022]], i64 [[I_023]]
	; CHECK-NEXT: [[INC]] = add i64 [[I_023]], 1			; CHECK-NEXT: [[INC]] = add i64 [[I_023]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INC]], [[COUNT]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INC]], [[COUNT]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize bit-parallel operations with SWAR.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 153721

include/llvm/Transforms/Vectorize/SLPVectorizer.h

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/X86/swar.ll

test/Transforms/SLPVectorizer/X86/tiny-tree.ll

[SLP] Vectorize bit-parallel operations with SWAR.
AbandonedPublic