This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/2
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
bad-reduction.ll

Differential D78997

[SLP] add another bailout for load-combine patterns
ClosedPublic

Authored by spatel on Apr 28 2020, 6:01 AM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
vdmitrie
craig.topper

Commits

rG02051c7f3ae9: [SLP] add another bailout for load-combine patterns (2nd try)
rG86dfbc676ebe: [SLP] add another bailout for load-combine patterns

Summary

This builds on the or-reduction bailout that was added with D67841. We still do not have IR-level load combining, although that could be a target-specific enhancement for -vector-combiner.

The heuristic is narrowly defined to catch the motivating case from PR39538:
https://bugs.llvm.org/show_bug.cgi?id=39538
...while preserving existing functionality.

That is, there's an unmodified test of pure load/zext/store that is not seen in this patch at llvm/test/Transforms/SLPVectorizer/X86/cast.ll. That's the reason for the logic difference to require the 'or' instructions. The chances that vectorization would actually help a memory-bound sequence like that seem small, but it looks nicer with:

vpmovzxwd	(%rsi), %xmm0
vmovdqu	%xmm0, (%rdi)

rather than:

movzwl	(%rsi), %eax
movl	%eax, (%rdi)
...

In the motivating test, we avoid creating a vector mess that is unrecoverable in the backend, and SDAG forms the expected bswap instructions after load combining:

movzbl (%rdi), %eax
vmovd %eax, %xmm0
movzbl 1(%rdi), %eax
vmovd %eax, %xmm1
movzbl 2(%rdi), %eax
vpinsrb $4, 4(%rdi), %xmm0, %xmm0
vpinsrb $8, 8(%rdi), %xmm0, %xmm0
vpinsrb $12, 12(%rdi), %xmm0, %xmm0
vmovd %eax, %xmm2
movzbl 3(%rdi), %eax
vpinsrb $1, 5(%rdi), %xmm1, %xmm1
vpinsrb $2, 9(%rdi), %xmm1, %xmm1
vpinsrb $3, 13(%rdi), %xmm1, %xmm1
vpslld $24, %xmm0, %xmm0
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpslld $16, %xmm1, %xmm1
vpor %xmm0, %xmm1, %xmm0
vpinsrb $1, 6(%rdi), %xmm2, %xmm1
vmovd %eax, %xmm2
vpinsrb $2, 10(%rdi), %xmm1, %xmm1
vpinsrb $3, 14(%rdi), %xmm1, %xmm1
vpinsrb $1, 7(%rdi), %xmm2, %xmm2
vpinsrb $2, 11(%rdi), %xmm2, %xmm2
vpmovzxbd %xmm1, %xmm1 # xmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero
vpinsrb $3, 15(%rdi), %xmm2, %xmm2
vpslld $8, %xmm1, %xmm1
vpmovzxbd %xmm2, %xmm2 # xmm2 = xmm2[0],zero,zero,zero,xmm2[1],zero,zero,zero,xmm2[2],zero,zero,zero,xmm2[3],zero,zero,zero
vpor %xmm2, %xmm1, %xmm1
vpor %xmm1, %xmm0, %xmm0
vmovdqu %xmm0, (%rsi)

movl	(%rdi), %eax
movl	4(%rdi), %ecx
movl	8(%rdi), %edx
movbel	%eax, (%rsi)
movbel	%ecx, 4(%rsi)
movl	12(%rdi), %ecx
movbel	%edx, 8(%rsi)
movbel	%ecx, 12(%rsi)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Apr 28 2020, 6:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 6:01 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

ABataev added inline comments.Apr 28 2020, 6:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3725–3736	Maybe make it a part of `isTreeTinyAndNotFullyVectorizable()` function?

spatel marked an inline comment as done.Apr 28 2020, 7:13 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3725–3736	I was against that in D67841 because: "I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :)" Also, there's still hope that we can eventually remove these heuristics, so that will be simpler if we keep this logic independent.

Ping.

May it happen at all that on some platforms it might be profitable to vectorize such sequences rather than use scalarized code? If so, better to implement cost analysis.

In D78997#2020340, @ABataev wrote:

May it happen at all that on some platforms it might be profitable to vectorize such sequences rather than use scalarized code? If so, better to implement cost analysis.

I think we already went down this path as an intermediate step with:
https://reviews.llvm.org/D67841?id=223377

And that had to be reverted.

There is no question of whether the vectorization may be profitable. The cost model -- as shown on this x86 example -- says the vectorization is profitable, and that is correct given the IR.

The problem is that the cost model can't tell us that the scalar code will eventually be transformed by the backend into something much simpler. So we need a heuristic somewhere, and we agreed in the last patch that this is only a problem for SLP. Trying to make the patch a general cost model fix exposed much larger problems in the definition/usage of the cost models themselves (see for example https://bugs.llvm.org/show_bug.cgi?id=43591 ). That's being addressed with a major rework of the cost models (D76124, D78547, D78922, etc).

This revision is now accepted and ready to land.May 5 2020, 8:06 AM

Closed by commit rG86dfbc676ebe: [SLP] add another bailout for load-combine patterns (authored by spatel). · Explain WhyMay 5 2020, 10:14 AM

This revision was automatically updated to reflect the committed changes.

This caused asserts in Chromium builds, see https://bugs.chromium.org/p/chromium/issues/detail?id=1079294#c2 for a stand-alone reproducer.

I've reverted in c54c6ee1a7a26c994eff32ce35862641db47f305 for now.

In D78997#2025033, @hans wrote:

This caused asserts in Chromium builds, see https://bugs.chromium.org/p/chromium/issues/detail?id=1079294#c2 for a stand-alone reproducer.

I've reverted in c54c6ee1a7a26c994eff32ce35862641db47f305 for now.

Thanks. Looks like constant expressions strike again.

spatel mentioned this in rG62ea77ec022b: [SLP] add test for constant expression fake of load-combine pattern; NFC.May 7 2020, 12:28 PM

dtemirbulatov added a subscriber: dtemirbulatov.May 21 2020, 8:27 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

36 lines

test/

Transforms/

SLPVectorizer/

X86/

bad-reduction.ll

72 lines

Diff 262153

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 660 Lines • ▼ Show 20 Lines	public:
/// can be load combined in the backend. Load combining may not be allowed in		/// can be load combined in the backend. Load combining may not be allowed in
/// the IR optimizer, so we do not want to alter the pattern. For example,		/// the IR optimizer, so we do not want to alter the pattern. For example,
/// partially transforming a scalar bswap() pattern into vector code is		/// partially transforming a scalar bswap() pattern into vector code is
/// effectively impossible for the backend to undo.		/// effectively impossible for the backend to undo.
/// TODO: If load combining is allowed in the IR optimizer, this analysis		/// TODO: If load combining is allowed in the IR optimizer, this analysis
/// may not be necessary.		/// may not be necessary.
bool isLoadCombineReductionCandidate(unsigned ReductionOpcode) const;		bool isLoadCombineReductionCandidate(unsigned ReductionOpcode) const;

		/// Assume that a vector of stores of bitwise-or/shifted/zexted loaded values
		/// can be load combined in the backend. Load combining may not be allowed in
		/// the IR optimizer, so we do not want to alter the pattern. For example,
		/// partially transforming a scalar bswap() pattern into vector code is
		/// effectively impossible for the backend to undo.
		/// TODO: If load combining is allowed in the IR optimizer, this analysis
		/// may not be necessary.
		bool isLoadCombineCandidate() const;

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

/// This structure holds any data we need about the edges being traversed		/// This structure holds any data we need about the edges being traversed
/// during buildTree_rec(). We keep track of:		/// during buildTree_rec(). We keep track of:
/// (i) the user TreeEntry index, and		/// (i) the user TreeEntry index, and
/// (ii) the index of the edge.		/// (ii) the index of the edge.
struct EdgeInfo {		struct EdgeInfo {
EdgeInfo() = default;		EdgeInfo() = default;
▲ Show 20 Lines • Show All 2,991 Lines • ▼ Show 20 Lines	bool BoUpSLP::isFullyVectorizableTinyTree() const {
// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees.
if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|		if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|
VectorizableTree[1]->State == TreeEntry::NeedToGather)		VectorizableTree[1]->State == TreeEntry::NeedToGather)
return false;		return false;

return true;		return true;
}		}

static bool isLoadCombineCandidate(Value *Root, unsigned NumElts,		static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,
TargetTransformInfo *TTI) {		TargetTransformInfo *TTI) {
// Look past the root to find a source value. Arbitrarily follow the		// Look past the root to find a source value. Arbitrarily follow the
// path through operand 0 of any 'or'. Also, peek through optional		// path through operand 0 of any 'or'. Also, peek through optional
// shift-left-by-constant.		// shift-left-by-constant.
Value *ZextLoad = Root;		Value *ZextLoad = Root;
while (match(ZextLoad, m_Or(m_Value(), m_Value())) \|\|		while (match(ZextLoad, m_Or(m_Value(), m_Value())) \|\|
match(ZextLoad, m_Shl(m_Value(), m_Constant())))		match(ZextLoad, m_Shl(m_Value(), m_Constant())))
ZextLoad = cast<BinaryOperator>(ZextLoad)->getOperand(0);		ZextLoad = cast<BinaryOperator>(ZextLoad)->getOperand(0);

// Check if the input is an extended load.		// Check if the input is an extended load of the required or/shift expression.
Value *LoadPtr;		Value *LoadPtr;
if (!match(ZextLoad, m_ZExt(m_Load(m_Value(LoadPtr)))))		if (ZextLoad == Root \|\| !match(ZextLoad, m_ZExt(m_Load(m_Value(LoadPtr)))))
return false;		return false;

// Require that the total load bit width is a legal integer type.		// Require that the total load bit width is a legal integer type.
// For example, <8 x i8> --> i64 is a legal integer on a 64-bit target.		// For example, <8 x i8> --> i64 is a legal integer on a 64-bit target.
// But <16 x i8> --> i128 is not, so the backend probably can't reduce it.		// But <16 x i8> --> i128 is not, so the backend probably can't reduce it.
Type *SrcTy = LoadPtr->getType()->getPointerElementType();		Type *SrcTy = LoadPtr->getType()->getPointerElementType();
unsigned LoadBitWidth = SrcTy->getIntegerBitWidth() * NumElts;		unsigned LoadBitWidth = SrcTy->getIntegerBitWidth() * NumElts;
if (!TTI->isTypeLegal(IntegerType::get(Root->getContext(), LoadBitWidth)))		if (!TTI->isTypeLegal(IntegerType::get(Root->getContext(), LoadBitWidth)))
return false;		return false;

// Everything matched - assume that we can fold the whole sequence using		// Everything matched - assume that we can fold the whole sequence using
// load combining.		// load combining.
LLVM_DEBUG(dbgs() << "SLP: Assume load combining for tree starting at "		LLVM_DEBUG(dbgs() << "SLP: Assume load combining for tree starting at "
<< *(cast<Instruction>(Root)) << "\n");		<< *(cast<Instruction>(Root)) << "\n");

return true;		return true;
}		}

bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {		bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {
if (RdxOpcode != Instruction::Or)		if (RdxOpcode != Instruction::Or)
return false;		return false;

unsigned NumElts = VectorizableTree[0]->Scalars.size();		unsigned NumElts = VectorizableTree[0]->Scalars.size();
Value *FirstReduced = VectorizableTree[0]->Scalars[0];		Value *FirstReduced = VectorizableTree[0]->Scalars[0];
return isLoadCombineCandidate(FirstReduced, NumElts, TTI);		return isLoadCombineCandidateImpl(FirstReduced, NumElts, TTI);
		}

		bool BoUpSLP::isLoadCombineCandidate() const {
		// Peek through a final sequence of stores and check if all operations are
		// likely to be load-combined.
		unsigned NumElts = VectorizableTree[0]->Scalars.size();
		for (Value *Scalar : VectorizableTree[0]->Scalars) {
		Value *X;
		if (!match(Scalar, m_Store(m_Value(X), m_Value())) \|\|
		!isLoadCombineCandidateImpl(X, NumElts, TTI))
		return false;
		}
		return true;
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Maybe make it a part of `isTreeTinyAndNotFullyVectorizable()` function? ABataev: Maybe make it a part of `isTreeTinyAndNotFullyVectorizable()` function?
		spatelAuthorUnsubmitted Done Reply Inline Actions I was against that in D67841 because: "I don't think we should put this directly into isTreeTinyAndNotFullyVectorizable() though. In this case, the tree may not be tiny, and it may be fully vectorizable, so that would be wrong on both counts. :)" Also, there's still hope that we can eventually remove these heuristics, so that will be simpler if we keep this logic independent. spatel: I was against that in D67841 because: "I don't think we should put this directly into…

bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {		bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {
// We can vectorize the tree if its size is greater than or equal to the		// We can vectorize the tree if its size is greater than or equal to the
// minimum size specified by the MinTreeSize command line option.		// minimum size specified by the MinTreeSize command line option.
if (VectorizableTree.size() >= MinTreeSize)		if (VectorizableTree.size() >= MinTreeSize)
return false;		return false;

// If we have a tiny tree (a tree whose size is less than MinTreeSize), we		// If we have a tiny tree (a tree whose size is less than MinTreeSize), we
▲ Show 20 Lines • Show All 2,030 Lines • ▼ Show 20 Lines	if (Order && Order->size() == Chain.size()) {
// TODO: reorder tree nodes without tree rebuilding.		// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(Chain.rbegin(), Chain.rend());		SmallVector<Value *, 4> ReorderedOps(Chain.rbegin(), Chain.rend());
llvm::transform(*Order, ReorderedOps.begin(),		llvm::transform(*Order, ReorderedOps.begin(),
[Chain](const unsigned Idx) { return Chain[Idx]; });		[Chain](const unsigned Idx) { return Chain[Idx]; });
R.buildTree(ReorderedOps);		R.buildTree(ReorderedOps);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
return false;		return false;
		if (R.isLoadCombineCandidate())
		return false;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

int Cost = R.getTreeCost();		int Cost = R.getTreeCost();

LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF << "\n");		LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF << "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");		LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
// reductions. However, at this point, we only expect to get here when		// reductions. However, at this point, we only expect to get here when
// there are exactly two operations.		// there are exactly two operations.
assert(Ops.size() == 2);		assert(Ops.size() == 2);
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
		if (R.isLoadCombineCandidate())
		return false;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost() - UserCost;		int Cost = R.getTreeCost() - UserCost;
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
▲ Show 20 Lines • Show All 1,558 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/bad-reduction.ll

	Show First 20 Lines • Show All 387 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[T40:%.]] = load i8, i8 [[T39]], align 1			; CHECK-NEXT: [[T40:%.]] = load i8, i8 [[T39]], align 1
	; CHECK-NEXT: [[T44:%.]] = load i8, i8 [[T43]], align 1			; CHECK-NEXT: [[T44:%.]] = load i8, i8 [[T43]], align 1
	; CHECK-NEXT: [[T49:%.]] = load i8, i8 [[T48]], align 1			; CHECK-NEXT: [[T49:%.]] = load i8, i8 [[T48]], align 1
	; CHECK-NEXT: [[T54:%.]] = load i8, i8 [[T53]], align 1			; CHECK-NEXT: [[T54:%.]] = load i8, i8 [[T53]], align 1
	; CHECK-NEXT: [[T59:%.]] = load i8, i8 [[T58]], align 1			; CHECK-NEXT: [[T59:%.]] = load i8, i8 [[T58]], align 1
	; CHECK-NEXT: [[T63:%.]] = load i8, i8 [[T62]], align 1			; CHECK-NEXT: [[T63:%.]] = load i8, i8 [[T62]], align 1
	; CHECK-NEXT: [[T68:%.]] = load i8, i8 [[T67]], align 1			; CHECK-NEXT: [[T68:%.]] = load i8, i8 [[T67]], align 1
	; CHECK-NEXT: [[T73:%.]] = load i8, i8 [[T72]], align 1			; CHECK-NEXT: [[T73:%.]] = load i8, i8 [[T72]], align 1
	; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x i8> undef, i8 [[T3]], i32 0			; CHECK-NEXT: [[T4:%.*]] = zext i8 [[T3]] to i32
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <4 x i8> [[TMP1]], i8 [[T21]], i32 1			; CHECK-NEXT: [[T8:%.*]] = zext i8 [[T7]] to i32
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <4 x i8> [[TMP2]], i8 [[T40]], i32 2			; CHECK-NEXT: [[T13:%.*]] = zext i8 [[T12]] to i32
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i8> [[TMP3]], i8 [[T59]], i32 3			; CHECK-NEXT: [[T18:%.*]] = zext i8 [[T17]] to i32
	; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i8> [[TMP4]] to <4 x i32>			; CHECK-NEXT: [[T22:%.*]] = zext i8 [[T21]] to i32
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i8> undef, i8 [[T7]], i32 0			; CHECK-NEXT: [[T26:%.*]] = zext i8 [[T25]] to i32
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x i8> [[TMP6]], i8 [[T25]], i32 1			; CHECK-NEXT: [[T31:%.*]] = zext i8 [[T30]] to i32
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i8> [[TMP7]], i8 [[T44]], i32 2			; CHECK-NEXT: [[T36:%.*]] = zext i8 [[T35]] to i32
	; CHECK-NEXT: [[TMP9:%.*]] = insertelement <4 x i8> [[TMP8]], i8 [[T63]], i32 3			; CHECK-NEXT: [[T41:%.*]] = zext i8 [[T40]] to i32
	; CHECK-NEXT: [[TMP10:%.*]] = zext <4 x i8> [[TMP9]] to <4 x i32>			; CHECK-NEXT: [[T45:%.*]] = zext i8 [[T44]] to i32
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i8> undef, i8 [[T12]], i32 0			; CHECK-NEXT: [[T50:%.*]] = zext i8 [[T49]] to i32
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i8> [[TMP11]], i8 [[T30]], i32 1			; CHECK-NEXT: [[T55:%.*]] = zext i8 [[T54]] to i32
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i8> [[TMP12]], i8 [[T49]], i32 2			; CHECK-NEXT: [[T60:%.*]] = zext i8 [[T59]] to i32
	; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x i8> [[TMP13]], i8 [[T68]], i32 3			; CHECK-NEXT: [[T64:%.*]] = zext i8 [[T63]] to i32
	; CHECK-NEXT: [[TMP15:%.*]] = zext <4 x i8> [[TMP14]] to <4 x i32>			; CHECK-NEXT: [[T69:%.*]] = zext i8 [[T68]] to i32
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x i8> undef, i8 [[T17]], i32 0			; CHECK-NEXT: [[T74:%.*]] = zext i8 [[T73]] to i32
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i8> [[TMP16]], i8 [[T35]], i32 1			; CHECK-NEXT: [[T5:%.*]] = shl nuw i32 [[T4]], 24
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x i8> [[TMP17]], i8 [[T54]], i32 2			; CHECK-NEXT: [[T23:%.*]] = shl nuw i32 [[T22]], 24
	; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x i8> [[TMP18]], i8 [[T73]], i32 3			; CHECK-NEXT: [[T42:%.*]] = shl nuw i32 [[T41]], 24
	; CHECK-NEXT: [[TMP20:%.*]] = zext <4 x i8> [[TMP19]] to <4 x i32>			; CHECK-NEXT: [[T61:%.*]] = shl nuw i32 [[T60]], 24
	; CHECK-NEXT: [[TMP21:%.*]] = shl nuw <4 x i32> [[TMP5]], <i32 24, i32 24, i32 24, i32 24>			; CHECK-NEXT: [[T9:%.*]] = shl nuw nsw i32 [[T8]], 16
	; CHECK-NEXT: [[TMP22:%.*]] = shl nuw nsw <4 x i32> [[TMP10]], <i32 16, i32 16, i32 16, i32 16>			; CHECK-NEXT: [[T27:%.*]] = shl nuw nsw i32 [[T26]], 16
	; CHECK-NEXT: [[TMP23:%.*]] = shl nuw nsw <4 x i32> [[TMP15]], <i32 8, i32 8, i32 8, i32 8>			; CHECK-NEXT: [[T46:%.*]] = shl nuw nsw i32 [[T45]], 16
	; CHECK-NEXT: [[TMP24:%.*]] = or <4 x i32> [[TMP22]], [[TMP21]]			; CHECK-NEXT: [[T65:%.*]] = shl nuw nsw i32 [[T64]], 16
	; CHECK-NEXT: [[TMP25:%.*]] = or <4 x i32> [[TMP24]], [[TMP23]]			; CHECK-NEXT: [[T14:%.*]] = shl nuw nsw i32 [[T13]], 8
	; CHECK-NEXT: [[TMP26:%.*]] = or <4 x i32> [[TMP25]], [[TMP20]]			; CHECK-NEXT: [[T32:%.*]] = shl nuw nsw i32 [[T31]], 8
	; CHECK-NEXT: [[TMP27:%.]] = bitcast i32 [[T1]] to <4 x i32>*			; CHECK-NEXT: [[T51:%.*]] = shl nuw nsw i32 [[T50]], 8
	; CHECK-NEXT: store <4 x i32> [[TMP26]], <4 x i32>* [[TMP27]], align 4			; CHECK-NEXT: [[T70:%.*]] = shl nuw nsw i32 [[T69]], 8
				; CHECK-NEXT: [[T10:%.*]] = or i32 [[T9]], [[T5]]
				; CHECK-NEXT: [[T15:%.*]] = or i32 [[T10]], [[T14]]
				; CHECK-NEXT: [[T19:%.*]] = or i32 [[T15]], [[T18]]
				; CHECK-NEXT: [[T28:%.*]] = or i32 [[T27]], [[T23]]
				; CHECK-NEXT: [[T33:%.*]] = or i32 [[T28]], [[T32]]
				; CHECK-NEXT: [[T37:%.*]] = or i32 [[T33]], [[T36]]
				; CHECK-NEXT: [[T47:%.*]] = or i32 [[T46]], [[T42]]
				; CHECK-NEXT: [[T52:%.*]] = or i32 [[T47]], [[T51]]
				; CHECK-NEXT: [[T56:%.*]] = or i32 [[T52]], [[T55]]
				; CHECK-NEXT: [[T66:%.*]] = or i32 [[T65]], [[T61]]
				; CHECK-NEXT: [[T71:%.*]] = or i32 [[T66]], [[T70]]
				; CHECK-NEXT: [[T75:%.*]] = or i32 [[T71]], [[T74]]
				; CHECK-NEXT: store i32 [[T19]], i32* [[T1]], align 4
				; CHECK-NEXT: store i32 [[T37]], i32* [[T38]], align 4
				; CHECK-NEXT: store i32 [[T56]], i32* [[T57]], align 4
				; CHECK-NEXT: store i32 [[T75]], i32* [[T76]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%t6 = getelementptr inbounds i8, i8* %t0, i64 1			%t6 = getelementptr inbounds i8, i8* %t0, i64 1
	%t11 = getelementptr inbounds i8, i8* %t0, i64 2			%t11 = getelementptr inbounds i8, i8* %t0, i64 2
	%t16 = getelementptr inbounds i8, i8* %t0, i64 3			%t16 = getelementptr inbounds i8, i8* %t0, i64 3
	%t20 = getelementptr inbounds i8, i8* %t0, i64 4			%t20 = getelementptr inbounds i8, i8* %t0, i64 4
	%t24 = getelementptr inbounds i8, i8* %t0, i64 5			%t24 = getelementptr inbounds i8, i8* %t0, i64 5
	%t29 = getelementptr inbounds i8, i8* %t0, i64 6			%t29 = getelementptr inbounds i8, i8* %t0, i64 6
	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines