This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into isLoadBitCastBeneficial/isStoreBitCastBeneficial to allow X86 to bypass it
ClosedPublic

Authored by craig.topper on Jul 6 2019, 11:45 PM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
arsenm

Commits

rG84a1f0736340: [X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into…
rL365549: [X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into…

Summary

Basically the problem is that X86 doesn't set the Fast flag from
allowsMemoryAccess on certain CPUs due to slow unaligned memory
subtarget features. This prevents bitcasts from being folded into
loads and stores. But all vector loads and stores of the same width
are the same cost on X86.

This patch merges the allowsMemoryAccess call into isLoadBitCastBeneficial to allow X86 to skip it.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 34467
Build 34466: arc lint + arc unit

Event Timeline

craig.topper created this revision.Jul 6 2019, 11:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 6 2019, 11:45 PM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald Transcript

Harbormaster completed remote builds in B34467: Diff 208290.Jul 6 2019, 11:47 PM

arsenm added inline comments.Jul 7 2019, 3:17 PM

llvm/include/llvm/CodeGen/TargetLowering.h
405	It isn't clear to me what this does the name, so needs a better name? Also needs a documentation comment for it

craig.topper marked an inline comment as done.Jul 7 2019, 3:45 PM

craig.topper added inline comments.

llvm/include/llvm/CodeGen/TargetLowering.h
405	Do you have a better suggestion for a name? 'CallAllowsMemoryAccess'? Or maybe I should just make isLoadBitCastBeneficial call the allowsMemoryAccess instead of DAGCombiner? Then X86 can control the behavior directly?

arsenm added inline comments.Jul 8 2019, 9:38 AM

llvm/include/llvm/CodeGen/TargetLowering.h
405	Calling allowsMemoryAccess directly would probably be better

Move the allowsMemoryAccess call into isLoadBitCastBeneficial

Harbormaster completed remote builds in B34563: Diff 208595.Jul 8 2019, 10:06 PM

craig.topper retitled this revision from [X86][AMDGPU] Add an out parameter to isLoadBitCastBeneficial/isStoreBitCastBeneficial to indicate we shouldn't both checking the alignment. to [X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into isLoadBitCastBeneficial/isStoreBitCastBeneficial to allow X86 to bypass it.Jul 8 2019, 10:06 PM

craig.topper edited the summary of this revision. (Show Details)

arsenm accepted this revision.Jul 9 2019, 7:19 AM

This revision is now accepted and ready to land.Jul 9 2019, 7:19 AM

Closed by commit rL365549: [X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into… (authored by ctopper). · Explain WhyJul 9 2019, 12:56 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

11 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

23 lines

Target/

AMDGPU/

AMDGPUISelLowering.h

2 lines

AMDGPUISelLowering.cpp

5 lines

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

13 lines

test/

CodeGen/

X86/

merge-consecutive-stores-nt.ll

24 lines

vector-shuffle-128-v4.ll

12 lines

Diff 208290

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 395 Lines • ▼ Show 20 Lines	public:
virtual BranchProbability getPredictableBranchThreshold() const;		virtual BranchProbability getPredictableBranchThreshold() const;

/// Return true if the following transform is beneficial:		/// Return true if the following transform is beneficial:
/// fold (conv (load x)) -> (load (conv*)x)		/// fold (conv (load x)) -> (load (conv*)x)
/// On architectures that don't natively support some vector loads		/// On architectures that don't natively support some vector loads
/// efficiently, casting the load to a smaller vector of larger types and		/// efficiently, casting the load to a smaller vector of larger types and
/// loading is more efficient, however, this can be undone by optimizations in		/// loading is more efficient, however, this can be undone by optimizations in
/// dag combiner.		/// dag combiner.
virtual bool isLoadBitCastBeneficial(EVT LoadVT,		virtual bool isLoadBitCastBeneficial(EVT LoadVT, EVT BitcastVT,
EVT BitcastVT) const {		bool &CheckAlignment) const {
		arsenmUnsubmitted Not Done Reply Inline Actions It isn't clear to me what this does the name, so needs a better name? Also needs a documentation comment for it arsenm: It isn't clear to me what this does the name, so needs a better name? Also needs a…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions Do you have a better suggestion for a name? 'CallAllowsMemoryAccess'? Or maybe I should just make isLoadBitCastBeneficial call the allowsMemoryAccess instead of DAGCombiner? Then X86 can control the behavior directly? craig.topper: Do you have a better suggestion for a name? 'CallAllowsMemoryAccess'? Or maybe I should just…
		arsenmUnsubmitted Not Done Reply Inline Actions Calling allowsMemoryAccess directly would probably be better arsenm: Calling allowsMemoryAccess directly would probably be better
		CheckAlignment = true;

// Don't do if we could do an indexed load on the original type, but not on		// Don't do if we could do an indexed load on the original type, but not on
// the new one.		// the new one.
if (!LoadVT.isSimple() \|\| !BitcastVT.isSimple())		if (!LoadVT.isSimple() \|\| !BitcastVT.isSimple())
return true;		return true;

MVT LoadMVT = LoadVT.getSimpleVT();		MVT LoadMVT = LoadVT.getSimpleVT();

// Don't bother doing this if it's just going to be promoted again later, as		// Don't bother doing this if it's just going to be promoted again later, as
// doing so might interfere with other combines.		// doing so might interfere with other combines.
if (getOperationAction(ISD::LOAD, LoadMVT) == Promote &&		if (getOperationAction(ISD::LOAD, LoadMVT) == Promote &&
getTypeToPromoteTo(ISD::LOAD, LoadMVT) == BitcastVT.getSimpleVT())		getTypeToPromoteTo(ISD::LOAD, LoadMVT) == BitcastVT.getSimpleVT())
return false;		return false;

return true;		return true;
}		}

/// Return true if the following transform is beneficial:		/// Return true if the following transform is beneficial:
/// (store (y (conv x)), y)) -> (store x, (x))		/// (store (y (conv x)), y)) -> (store x, (x))
virtual bool isStoreBitCastBeneficial(EVT StoreVT, EVT BitcastVT) const {		virtual bool isStoreBitCastBeneficial(EVT StoreVT, EVT BitcastVT,
		bool &CheckAlignment) const {
// Default to the same logic as loads.		// Default to the same logic as loads.
return isLoadBitCastBeneficial(StoreVT, BitcastVT);		return isLoadBitCastBeneficial(StoreVT, BitcastVT, CheckAlignment);
}		}

/// Return true if it is expected to be cheaper to do a store of a non-zero		/// Return true if it is expected to be cheaper to do a store of a non-zero
/// vector constant with the given size and type for the address space than to		/// vector constant with the given size and type for the address space than to
/// store the individual scalar element constants.		/// store the individual scalar element constants.
virtual bool storeOfVectorConstantIsCheap(EVT MemVT,		virtual bool storeOfVectorConstantIsCheap(EVT MemVT,
unsigned NumElem,		unsigned NumElem,
unsigned AddrSpace) const {		unsigned AddrSpace) const {
▲ Show 20 Lines • Show All 3,652 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,034 Lines • ▼ Show 20 Lines	if (ISD::isNormalLoad(N0.getNode()) && N0.hasOneUse() &&
TLI.hasBigEndianPartOrdering(N0.getValueType(), DAG.getDataLayout()) ==		TLI.hasBigEndianPartOrdering(N0.getValueType(), DAG.getDataLayout()) ==
TLI.hasBigEndianPartOrdering(VT, DAG.getDataLayout()) &&		TLI.hasBigEndianPartOrdering(VT, DAG.getDataLayout()) &&
// If the load is volatile, we only want to change the load type if the		// If the load is volatile, we only want to change the load type if the
// resulting load is legal. Otherwise we might increase the number of		// resulting load is legal. Otherwise we might increase the number of
// memory accesses. We don't care if the original type was legal or not		// memory accesses. We don't care if the original type was legal or not
// as we assume software couldn't rely on the number of accesses of an		// as we assume software couldn't rely on the number of accesses of an
// illegal type.		// illegal type.
((!LegalOperations && !cast<LoadSDNode>(N0)->isVolatile()) \|\|		((!LegalOperations && !cast<LoadSDNode>(N0)->isVolatile()) \|\|
TLI.isOperationLegal(ISD::LOAD, VT)) &&		TLI.isOperationLegal(ISD::LOAD, VT))) {
TLI.isLoadBitCastBeneficial(N0.getValueType(), VT)) {
LoadSDNode *LN0 = cast<LoadSDNode>(N0);		LoadSDNode *LN0 = cast<LoadSDNode>(N0);

		bool CheckAlignment = true;
bool Fast = false;		bool Fast = false;
if (TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), VT,		if (TLI.isLoadBitCastBeneficial(N0.getValueType(), VT, CheckAlignment) &&
*LN0->getMemOperand(), &Fast) &&		(!CheckAlignment \|\|
Fast) {		(TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), VT,
		*LN0->getMemOperand(), &Fast) && Fast))) {
SDValue Load =		SDValue Load =
DAG.getLoad(VT, SDLoc(N), LN0->getChain(), LN0->getBasePtr(),		DAG.getLoad(VT, SDLoc(N), LN0->getChain(), LN0->getBasePtr(),
LN0->getPointerInfo(), LN0->getAlignment(),		LN0->getPointerInfo(), LN0->getAlignment(),
LN0->getMemOperand()->getFlags(), LN0->getAAInfo());		LN0->getMemOperand()->getFlags(), LN0->getAAInfo());
DAG.ReplaceAllUsesOfValueWith(N0.getValue(1), Load.getValue(1));		DAG.ReplaceAllUsesOfValueWith(N0.getValue(1), Load.getValue(1));
return Load;		return Load;
}		}
}		}
▲ Show 20 Lines • Show All 5,109 Lines • ▼ Show 20 Lines	if (Value.getOpcode() == ISD::BITCAST && !ST->isTruncatingStore() &&
ST->isUnindexed()) {		ST->isUnindexed()) {
EVT SVT = Value.getOperand(0).getValueType();		EVT SVT = Value.getOperand(0).getValueType();
// If the store is volatile, we only want to change the store type if the		// If the store is volatile, we only want to change the store type if the
// resulting store is legal. Otherwise we might increase the number of		// resulting store is legal. Otherwise we might increase the number of
// memory accesses. We don't care if the original type was legal or not		// memory accesses. We don't care if the original type was legal or not
// as we assume software couldn't rely on the number of accesses of an		// as we assume software couldn't rely on the number of accesses of an
// illegal type.		// illegal type.
if (((!LegalOperations && !ST->isVolatile()) \|\|		if (((!LegalOperations && !ST->isVolatile()) \|\|
TLI.isOperationLegal(ISD::STORE, SVT)) &&		TLI.isOperationLegal(ISD::STORE, SVT))) {
TLI.isStoreBitCastBeneficial(Value.getValueType(), SVT)) {		bool CheckAlignment = true;
bool Fast = false;		bool Fast = false;
if (TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), SVT,		if (TLI.isStoreBitCastBeneficial(Value.getValueType(), SVT,
*ST->getMemOperand(), &Fast) &&		CheckAlignment) &&
Fast) {		(!CheckAlignment \|\|
		(TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), SVT,
		*ST->getMemOperand(), &Fast) && Fast))) {
return DAG.getStore(Chain, SDLoc(N), Value.getOperand(0), Ptr,		return DAG.getStore(Chain, SDLoc(N), Value.getOperand(0), Ptr,
ST->getPointerInfo(), ST->getAlignment(),		ST->getPointerInfo(), ST->getAlignment(),
ST->getMemOperand()->getFlags(), ST->getAAInfo());		ST->getMemOperand()->getFlags(), ST->getAAInfo());
}		}
}		}
}		}

// Turn 'store undef, Ptr' -> nothing.		// Turn 'store undef, Ptr' -> nothing.
▲ Show 20 Lines • Show All 4,632 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	public:

bool isFPImmLegal(const APFloat &Imm, EVT VT,		bool isFPImmLegal(const APFloat &Imm, EVT VT,
bool ForCodeSize) const override;		bool ForCodeSize) const override;
bool ShouldShrinkFPConstant(EVT VT) const override;		bool ShouldShrinkFPConstant(EVT VT) const override;
bool shouldReduceLoadWidth(SDNode *Load,		bool shouldReduceLoadWidth(SDNode *Load,
ISD::LoadExtType ExtType,		ISD::LoadExtType ExtType,
EVT ExtVT) const override;		EVT ExtVT) const override;

bool isLoadBitCastBeneficial(EVT, EVT) const final;		bool isLoadBitCastBeneficial(EVT, EVT, bool &CheckAlignment) const final;

bool storeOfVectorConstantIsCheap(EVT MemVT,		bool storeOfVectorConstantIsCheap(EVT MemVT,
unsigned NumElem,		unsigned NumElem,
unsigned AS) const override;		unsigned AS) const override;
bool aggressivelyPreferBuildVectorSources(EVT VecVT) const override;		bool aggressivelyPreferBuildVectorSources(EVT VecVT) const override;
bool isCheapToSpeculateCttz() const override;		bool isCheapToSpeculateCttz() const override;
bool isCheapToSpeculateCtlz() const override;		bool isCheapToSpeculateCtlz() const override;

▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 713 Lines • ▼ Show 20 Lines	bool AMDGPUTargetLowering::shouldReduceLoadWidth(SDNode *N,
// still couldn't use a scalar load, using the wider load shouldn't really		// still couldn't use a scalar load, using the wider load shouldn't really
// hurt anything.		// hurt anything.

// If the old size already had to be an extload, there's no harm in continuing		// If the old size already had to be an extload, there's no harm in continuing
// to reduce the width.		// to reduce the width.
return (OldSize < 32);		return (OldSize < 32);
}		}

bool AMDGPUTargetLowering::isLoadBitCastBeneficial(EVT LoadTy,		bool AMDGPUTargetLowering::isLoadBitCastBeneficial(EVT LoadTy, EVT CastTy,
EVT CastTy) const {		bool &CheckAlignment) const {
		CheckAlignment = true;

assert(LoadTy.getSizeInBits() == CastTy.getSizeInBits());		assert(LoadTy.getSizeInBits() == CastTy.getSizeInBits());

if (LoadTy.getScalarType() == MVT::i32)		if (LoadTy.getScalarType() == MVT::i32)
return false;		return false;

unsigned LScalarSize = LoadTy.getScalarSizeInBits();		unsigned LScalarSize = LoadTy.getScalarSizeInBits();
unsigned CastScalarSize = CastTy.getScalarSizeInBits();		unsigned CastScalarSize = CastTy.getScalarSizeInBits();
▲ Show 20 Lines • Show All 4,031 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,121 Lines • ▼ Show 20 Lines	public:

bool storeOfVectorConstantIsCheap(EVT MemVT, unsigned NumElem,		bool storeOfVectorConstantIsCheap(EVT MemVT, unsigned NumElem,
unsigned AddrSpace) const override {		unsigned AddrSpace) const override {
// If we can replace more than 2 scalar stores, there will be a reduction		// If we can replace more than 2 scalar stores, there will be a reduction
// in instructions even after we add a vector constant load.		// in instructions even after we add a vector constant load.
return NumElem > 2;		return NumElem > 2;
}		}

bool isLoadBitCastBeneficial(EVT LoadVT, EVT BitcastVT) const override;		bool isLoadBitCastBeneficial(EVT LoadVT, EVT BitcastVT,
		bool &CheckAlignment) const override;

/// Intel processors have a unified instruction and data cache		/// Intel processors have a unified instruction and data cache
const char * getClearCacheBuiltinName() const override {		const char * getClearCacheBuiltinName() const override {
return nullptr; // nothing to do, move along.		return nullptr; // nothing to do, move along.
}		}

unsigned getRegisterByName(const char* RegName, EVT VT,		unsigned getRegisterByName(const char* RegName, EVT VT,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;
▲ Show 20 Lines • Show All 511 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,935 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isCheapToSpeculateCttz() const {
return Subtarget.hasBMI();		return Subtarget.hasBMI();
}		}

bool X86TargetLowering::isCheapToSpeculateCtlz() const {		bool X86TargetLowering::isCheapToSpeculateCtlz() const {
// Speculate ctlz only if we can directly use LZCNT.		// Speculate ctlz only if we can directly use LZCNT.
return Subtarget.hasLZCNT();		return Subtarget.hasLZCNT();
}		}

bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,		bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT, EVT BitcastVT,
EVT BitcastVT) const {		bool &CheckAlignment) const {
if (!Subtarget.hasAVX512() && !LoadVT.isVector() && BitcastVT.isVector() &&		if (!Subtarget.hasAVX512() && !LoadVT.isVector() && BitcastVT.isVector() &&
BitcastVT.getVectorElementType() == MVT::i1)		BitcastVT.getVectorElementType() == MVT::i1)
return false;		return false;

if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1 && LoadVT == MVT::i8)		if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1 && LoadVT == MVT::i8)
return false;		return false;

return TargetLowering::isLoadBitCastBeneficial(LoadVT, BitcastVT);		if (LoadVT.isVector() && BitcastVT.isVector() &&
		isTypeLegal(LoadVT) && isTypeLegal(BitcastVT)) {
		CheckAlignment = false;
		return true;
		}

		return TargetLowering::isLoadBitCastBeneficial(LoadVT, BitcastVT,
		CheckAlignment);
}		}

bool X86TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,		bool X86TargetLowering::canMergeStoresTo(unsigned AddressSpace, EVT MemVT,
const SelectionDAG &DAG) const {		const SelectionDAG &DAG) const {
// Do not merge to float value size (128 bytes) if no implicit		// Do not merge to float value size (128 bytes) if no implicit
// float attribute is set.		// float attribute is set.
bool NoFloat = DAG.getMachineFunction().getFunction().hasFnAttribute(		bool NoFloat = DAG.getMachineFunction().getFunction().hasFnAttribute(
Attribute::NoImplicitFloat);		Attribute::NoImplicitFloat);
▲ Show 20 Lines • Show All 32,759 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/merge-consecutive-stores-nt.ll

	Show First 20 Lines • Show All 300 Lines • ▼ Show 20 Lines
	; X86-SSE2-LABEL: merge_2_v4f32_align1_ntstore:			; X86-SSE2-LABEL: merge_2_v4f32_align1_ntstore:
	; X86-SSE2: # %bb.0:			; X86-SSE2: # %bb.0:
	; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE2-NEXT: movdqu (%ecx), %xmm0			; X86-SSE2-NEXT: movdqu (%ecx), %xmm0
	; X86-SSE2-NEXT: movdqu 16(%ecx), %xmm1			; X86-SSE2-NEXT: movdqu 16(%ecx), %xmm1
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, (%eax)			; X86-SSE2-NEXT: movntil %ecx, (%eax)
	; X86-SSE2-NEXT: movdqa %xmm0, %xmm2			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,1,2,3]
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1],xmm0[2,3]
	; X86-SSE2-NEXT: movd %xmm2, %ecx			; X86-SSE2-NEXT: movd %xmm2, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 12(%eax)			; X86-SSE2-NEXT: movntil %ecx, 12(%eax)
	; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
	; X86-SSE2-NEXT: movd %xmm2, %ecx			; X86-SSE2-NEXT: movd %xmm2, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 8(%eax)			; X86-SSE2-NEXT: movntil %ecx, 8(%eax)
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 4(%eax)			; X86-SSE2-NEXT: movntil %ecx, 4(%eax)
	; X86-SSE2-NEXT: movd %xmm1, %ecx			; X86-SSE2-NEXT: movd %xmm1, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 16(%eax)			; X86-SSE2-NEXT: movntil %ecx, 16(%eax)
	; X86-SSE2-NEXT: movdqa %xmm1, %xmm0			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[3,1,2,3]
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1],xmm1[2,3]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 28(%eax)			; X86-SSE2-NEXT: movntil %ecx, 28(%eax)
	; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 24(%eax)			; X86-SSE2-NEXT: movntil %ecx, 24(%eax)
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,2,3]
	; X86-SSE2-NEXT: movd %xmm1, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 20(%eax)			; X86-SSE2-NEXT: movntil %ecx, 20(%eax)
	; X86-SSE2-NEXT: retl			; X86-SSE2-NEXT: retl
	;			;
	; X86-SSE4A-LABEL: merge_2_v4f32_align1_ntstore:			; X86-SSE4A-LABEL: merge_2_v4f32_align1_ntstore:
	; X86-SSE4A: # %bb.0:			; X86-SSE4A: # %bb.0:
	; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE4A-NEXT: movups (%ecx), %xmm0			; X86-SSE4A-NEXT: movups (%ecx), %xmm0
	▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	; X86-SSE2-LABEL: merge_2_v4f32_align1:			; X86-SSE2-LABEL: merge_2_v4f32_align1:
	; X86-SSE2: # %bb.0:			; X86-SSE2: # %bb.0:
	; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE2-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE2-NEXT: movdqu (%ecx), %xmm0			; X86-SSE2-NEXT: movdqu (%ecx), %xmm0
	; X86-SSE2-NEXT: movdqu 16(%ecx), %xmm1			; X86-SSE2-NEXT: movdqu 16(%ecx), %xmm1
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, (%eax)			; X86-SSE2-NEXT: movntil %ecx, (%eax)
	; X86-SSE2-NEXT: movdqa %xmm0, %xmm2			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[3,1,2,3]
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1],xmm0[2,3]
	; X86-SSE2-NEXT: movd %xmm2, %ecx			; X86-SSE2-NEXT: movd %xmm2, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 12(%eax)			; X86-SSE2-NEXT: movntil %ecx, 12(%eax)
	; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[2,3,0,1]
	; X86-SSE2-NEXT: movd %xmm2, %ecx			; X86-SSE2-NEXT: movd %xmm2, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 8(%eax)			; X86-SSE2-NEXT: movntil %ecx, 8(%eax)
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,1,2,3]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 4(%eax)			; X86-SSE2-NEXT: movntil %ecx, 4(%eax)
	; X86-SSE2-NEXT: movd %xmm1, %ecx			; X86-SSE2-NEXT: movd %xmm1, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 16(%eax)			; X86-SSE2-NEXT: movntil %ecx, 16(%eax)
	; X86-SSE2-NEXT: movdqa %xmm1, %xmm0			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[3,1,2,3]
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[3,1],xmm1[2,3]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 28(%eax)			; X86-SSE2-NEXT: movntil %ecx, 28(%eax)
	; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]
	; X86-SSE2-NEXT: movd %xmm0, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 24(%eax)			; X86-SSE2-NEXT: movntil %ecx, 24(%eax)
	; X86-SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,1,2,3]			; X86-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[1,1,2,3]
	; X86-SSE2-NEXT: movd %xmm1, %ecx			; X86-SSE2-NEXT: movd %xmm0, %ecx
	; X86-SSE2-NEXT: movntil %ecx, 20(%eax)			; X86-SSE2-NEXT: movntil %ecx, 20(%eax)
	; X86-SSE2-NEXT: retl			; X86-SSE2-NEXT: retl
	;			;
	; X86-SSE4A-LABEL: merge_2_v4f32_align1:			; X86-SSE4A-LABEL: merge_2_v4f32_align1:
	; X86-SSE4A: # %bb.0:			; X86-SSE4A: # %bb.0:
	; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-SSE4A-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-SSE4A-NEXT: movups (%ecx), %xmm0			; X86-SSE4A-NEXT: movups (%ecx), %xmm0
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-shuffle-128-v4.ll

	Show First 20 Lines • Show All 2,435 Lines • ▼ Show 20 Lines
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuffle = shufflevector <4 x i32> %a, <4 x i32> zeroinitializer, <4 x i32> <i32 1, i32 4, i32 3, i32 4>			%shuffle = shufflevector <4 x i32> %a, <4 x i32> zeroinitializer, <4 x i32> <i32 1, i32 4, i32 3, i32 4>
	ret <4 x i32> %shuffle			ret <4 x i32> %shuffle
	}			}

	define <4 x float> @shuffle_mem_v4f32_0145(<4 x float> %a, <4 x float>* %pb) {			define <4 x float> @shuffle_mem_v4f32_0145(<4 x float> %a, <4 x float>* %pb) {
	; SSE-LABEL: shuffle_mem_v4f32_0145:			; SSE-LABEL: shuffle_mem_v4f32_0145:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: movups (%rdi), %xmm1			; SSE-NEXT: movhps {{.*#+}} xmm0 = xmm0[0,1],mem[0,1]
	; SSE-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: shuffle_mem_v4f32_0145:			; AVX-LABEL: shuffle_mem_v4f32_0145:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],mem[0]			; AVX-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],mem[0]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%b = load <4 x float>, <4 x float>* %pb, align 1			%b = load <4 x float>, <4 x float>* %pb, align 1
	%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 1, i32 4, i32 5>			%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
	ret <4 x float> %shuffle			ret <4 x float> %shuffle
	}			}

	define <4 x float> @shuffle_mem_v4f32_4523(<4 x float> %a, <4 x float>* %pb) {			define <4 x float> @shuffle_mem_v4f32_4523(<4 x float> %a, <4 x float>* %pb) {
	; SSE2-LABEL: shuffle_mem_v4f32_4523:			; SSE2-LABEL: shuffle_mem_v4f32_4523:
	; SSE2: # %bb.0:			; SSE2: # %bb.0:
	; SSE2-NEXT: movupd (%rdi), %xmm1			; SSE2-NEXT: movlps {{.*#+}} xmm0 = mem[0,1],xmm0[2,3]
	; SSE2-NEXT: movsd {{.*#+}} xmm0 = xmm1[0],xmm0[1]
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE3-LABEL: shuffle_mem_v4f32_4523:			; SSE3-LABEL: shuffle_mem_v4f32_4523:
	; SSE3: # %bb.0:			; SSE3: # %bb.0:
	; SSE3-NEXT: movupd (%rdi), %xmm1			; SSE3-NEXT: movlps {{.*#+}} xmm0 = mem[0,1],xmm0[2,3]
	; SSE3-NEXT: movsd {{.*#+}} xmm0 = xmm1[0],xmm0[1]
	; SSE3-NEXT: retq			; SSE3-NEXT: retq
	;			;
	; SSSE3-LABEL: shuffle_mem_v4f32_4523:			; SSSE3-LABEL: shuffle_mem_v4f32_4523:
	; SSSE3: # %bb.0:			; SSSE3: # %bb.0:
	; SSSE3-NEXT: movupd (%rdi), %xmm1			; SSSE3-NEXT: movlps {{.*#+}} xmm0 = mem[0,1],xmm0[2,3]
	; SSSE3-NEXT: movsd {{.*#+}} xmm0 = xmm1[0],xmm0[1]
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: shuffle_mem_v4f32_4523:			; SSE41-LABEL: shuffle_mem_v4f32_4523:
	; SSE41: # %bb.0:			; SSE41: # %bb.0:
	; SSE41-NEXT: movups (%rdi), %xmm1			; SSE41-NEXT: movups (%rdi), %xmm1
	; SSE41-NEXT: blendps {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]			; SSE41-NEXT: blendps {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: shuffle_mem_v4f32_4523:			; AVX-LABEL: shuffle_mem_v4f32_4523:
	; AVX: # %bb.0:			; AVX: # %bb.0:
	; AVX-NEXT: vblendps {{.*#+}} xmm0 = mem[0,1],xmm0[2,3]			; AVX-NEXT: vblendps {{.*#+}} xmm0 = mem[0,1],xmm0[2,3]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%b = load <4 x float>, <4 x float>* %pb, align 1			%b = load <4 x float>, <4 x float>* %pb, align 1
	%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 4, i32 5, i32 2, i32 3>			%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
	ret <4 x float> %shuffle			ret <4 x float> %shuffle
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into isLoadBitCastBeneficial/isStoreBitCastBeneficial to allow X86 to bypass itClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 208290

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/lib/Target/X86/X86ISelLowering.h

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/merge-consecutive-stores-nt.ll

llvm/test/CodeGen/X86/vector-shuffle-128-v4.ll

[X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into isLoadBitCastBeneficial/isStoreBitCastBeneficial to allow X86 to bypass it
ClosedPublic