This is an archive of the discontinued LLVM Phabricator instance.

[x86] Add a feature flag for slow 32-byte unaligned memory accesses.
ClosedPublic

Authored by spatel on Nov 21 2014, 7:53 AM.

Download Raw Diff

Details

Reviewers

qcolombet
nadav
andreadb
hfinkel

Commits

rG501890e909c2: Add a feature flag for slow 32-byte unaligned memory accesses [x86].
rL222544: Add a feature flag for slow 32-byte unaligned memory accesses [x86].

Summary

This patch adds a feature flag to avoid unaligned 32-byte load/store AVX codegen for Sandy Bridge and Ivy Bridge. There is no functionality change intended for those chips. Previously, the absence of AVX2 was being used as a proxy to detect this feature. But that hindered codegen for AVX-enabled AMD chips such as btver2 that do not have the 32-byte unaligned access slowdown.

More detailed performance measurements are included in PR21541 ( http://llvm.org/bugs/show_bug.cgi?id=21541 ).

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 16492.Nov 21 2014, 7:53 AM

spatel retitled this revision from to [x86] Add a feature flag for slow 32-byte unaligned memory accesses..

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: qcolombet, hfinkel, andreadb, nadav.

spatel added a subscriber: Unknown Object (MLST).

Hi Sanjay,

The patch looks good to me (I left a minor comment on the test).

Thanks!
Andrea

test/CodeGen/X86/unaligned-32-byte-memops.ll
1–3 ↗	(On Diff #16492)	Can you also add a RUN line for testing Ivy Bridge (core-avx-i) ? Feature flag 'FeatureSlowUAMem32' is also added to Ivy Bridge so, I think you should test it (I guess you could reuse the same SANDYB checks).

This revision is now accepted and ready to land.Nov 21 2014, 8:39 AM

Closed by commit rL222544 (authored by @spatel).

Thanks, Andrea! Yes, I agree that we should explicitly check Ivy Bridge too. Added that run line and committed with r222544.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

11 lines

13 lines

4 lines

1 line

test/

CodeGen/

X86/

2012-05-19-avx2-store.ll

13 lines

unaligned-32-byte-memops.ll

46 lines

Diff 16497

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	def FeatureCMPXCHG16B : SubtargetFeature<"cx16", "HasCmpxchg16b", "true",
[Feature64Bit]>;		[Feature64Bit]>;
def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",		def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",
"Bit testing of memory is slow">;		"Bit testing of memory is slow">;
def FeatureSlowSHLD : SubtargetFeature<"slow-shld", "IsSHLDSlow", "true",		def FeatureSlowSHLD : SubtargetFeature<"slow-shld", "IsSHLDSlow", "true",
"SHLD instruction is slow">;		"SHLD instruction is slow">;
def FeatureFastUAMem : SubtargetFeature<"fast-unaligned-mem",		def FeatureFastUAMem : SubtargetFeature<"fast-unaligned-mem",
"IsUAMemFast", "true",		"IsUAMemFast", "true",
"Fast unaligned memory access">;		"Fast unaligned memory access">;
		def FeatureSlowUAMem32 : SubtargetFeature<"slow-unaligned-mem-32",
		"IsUAMem32Slow", "true",
		"Slow unaligned 32-byte memory access">;
def FeatureSSE4A : SubtargetFeature<"sse4a", "HasSSE4A", "true",		def FeatureSSE4A : SubtargetFeature<"sse4a", "HasSSE4A", "true",
"Support SSE 4a instructions",		"Support SSE 4a instructions",
[FeatureSSE3]>;		[FeatureSSE3]>;

def FeatureAVX : SubtargetFeature<"avx", "X86SSELevel", "AVX",		def FeatureAVX : SubtargetFeature<"avx", "X86SSELevel", "AVX",
"Enable AVX instructions",		"Enable AVX instructions",
[FeatureSSE42]>;		[FeatureSSE42]>;
def FeatureAVX2 : SubtargetFeature<"avx2", "X86SSELevel", "AVX2",		def FeatureAVX2 : SubtargetFeature<"avx2", "X86SSELevel", "AVX2",
▲ Show 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	def : ProcessorModel<"westmere", SandyBridgeModel,
[FeatureSSE42, FeatureCMPXCHG16B, FeatureSlowBTMem,		[FeatureSSE42, FeatureCMPXCHG16B, FeatureSlowBTMem,
FeatureFastUAMem, FeaturePOPCNT, FeatureAES,		FeatureFastUAMem, FeaturePOPCNT, FeatureAES,
FeaturePCLMUL]>;		FeaturePCLMUL]>;
// Sandy Bridge		// Sandy Bridge
// SSE is not listed here since llvm treats AVX as a reimplementation of SSE,		// SSE is not listed here since llvm treats AVX as a reimplementation of SSE,
// rather than a superset.		// rather than a superset.
def : ProcessorModel<"corei7-avx", SandyBridgeModel,		def : ProcessorModel<"corei7-avx", SandyBridgeModel,
[FeatureAVX, FeatureCMPXCHG16B, FeatureFastUAMem,		[FeatureAVX, FeatureCMPXCHG16B, FeatureFastUAMem,
FeaturePOPCNT, FeatureAES, FeaturePCLMUL]>;		FeatureSlowUAMem32, FeaturePOPCNT, FeatureAES,
		FeaturePCLMUL]>;
// Ivy Bridge		// Ivy Bridge
def : ProcessorModel<"core-avx-i", SandyBridgeModel,		def : ProcessorModel<"core-avx-i", SandyBridgeModel,
[FeatureAVX, FeatureCMPXCHG16B, FeatureFastUAMem,		[FeatureAVX, FeatureCMPXCHG16B, FeatureFastUAMem,
FeaturePOPCNT, FeatureAES, FeaturePCLMUL, FeatureRDRAND,		FeatureSlowUAMem32, FeaturePOPCNT, FeatureAES,
FeatureF16C, FeatureFSGSBase]>;		FeaturePCLMUL, FeatureRDRAND, FeatureF16C,
		FeatureFSGSBase]>;

// Haswell		// Haswell
def : ProcessorModel<"core-avx2", HaswellModel,		def : ProcessorModel<"core-avx2", HaswellModel,
[FeatureAVX2, FeatureCMPXCHG16B, FeatureFastUAMem,		[FeatureAVX2, FeatureCMPXCHG16B, FeatureFastUAMem,
FeaturePOPCNT, FeatureAES, FeaturePCLMUL, FeatureRDRAND,		FeaturePOPCNT, FeatureAES, FeaturePCLMUL, FeatureRDRAND,
FeatureF16C, FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT,		FeatureF16C, FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT,
FeatureBMI, FeatureBMI2, FeatureFMA, FeatureRTM,		FeatureBMI, FeatureBMI2, FeatureFMA, FeatureRTM,
FeatureHLE, FeatureSlowIncDec]>;		FeatureHLE, FeatureSlowIncDec]>;
▲ Show 20 Lines • Show All 200 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 24,370 Lines • ▼ Show 20 Lines	static SDValue PerformLOADCombine(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
LoadSDNode *Ld = cast<LoadSDNode>(N);		LoadSDNode *Ld = cast<LoadSDNode>(N);
EVT RegVT = Ld->getValueType(0);		EVT RegVT = Ld->getValueType(0);
EVT MemVT = Ld->getMemoryVT();		EVT MemVT = Ld->getMemoryVT();
SDLoc dl(Ld);		SDLoc dl(Ld);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();

// On Sandybridge unaligned 256bit loads are inefficient.		// For chips with slow 32-byte unaligned loads, break the 32-byte operation
		// into two 16-byte operations.
ISD::LoadExtType Ext = Ld->getExtensionType();		ISD::LoadExtType Ext = Ld->getExtensionType();
unsigned Alignment = Ld->getAlignment();		unsigned Alignment = Ld->getAlignment();
bool IsAligned = Alignment == 0 \|\| Alignment >= MemVT.getSizeInBits()/8;		bool IsAligned = Alignment == 0 \|\| Alignment >= MemVT.getSizeInBits()/8;
if (RegVT.is256BitVector() && !Subtarget->hasInt256() &&		if (RegVT.is256BitVector() && Subtarget->isUnalignedMem32Slow() &&
!DCI.isBeforeLegalizeOps() && !IsAligned && Ext == ISD::NON_EXTLOAD) {		!DCI.isBeforeLegalizeOps() && !IsAligned && Ext == ISD::NON_EXTLOAD) {
unsigned NumElems = RegVT.getVectorNumElements();		unsigned NumElems = RegVT.getVectorNumElements();
if (NumElems < 2)		if (NumElems < 2)
return SDValue();		return SDValue();

SDValue Ptr = Ld->getBasePtr();		SDValue Ptr = Ld->getBasePtr();
SDValue Increment = DAG.getConstant(16, TLI.getPointerTy());		SDValue Increment = DAG.getConstant(16, TLI.getPointerTy());

Show All 26 Lines	static SDValue PerformSTORECombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
StoreSDNode *St = cast<StoreSDNode>(N);		StoreSDNode *St = cast<StoreSDNode>(N);
EVT VT = St->getValue().getValueType();		EVT VT = St->getValue().getValueType();
EVT StVT = St->getMemoryVT();		EVT StVT = St->getMemoryVT();
SDLoc dl(St);		SDLoc dl(St);
SDValue StoredVal = St->getOperand(1);		SDValue StoredVal = St->getOperand(1);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();

// If we are saving a concatenation of two XMM registers, perform two stores.		// If we are saving a concatenation of two XMM registers and 32-byte stores
// On Sandy Bridge, 256-bit memory operations are executed by two		// are slow, such as on Sandy Bridge, perform two 16-byte stores.
// 128-bit ports. However, on Haswell it is better to issue a single 256-bit
// memory operation.
unsigned Alignment = St->getAlignment();		unsigned Alignment = St->getAlignment();
bool IsAligned = Alignment == 0 \|\| Alignment >= VT.getSizeInBits()/8;		bool IsAligned = Alignment == 0 \|\| Alignment >= VT.getSizeInBits()/8;
if (VT.is256BitVector() && !Subtarget->hasInt256() &&		if (VT.is256BitVector() && Subtarget->isUnalignedMem32Slow() &&
StVT == VT && !IsAligned) {		StVT == VT && !IsAligned) {
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
if (NumElems < 2)		if (NumElems < 2)
return SDValue();		return SDValue();

SDValue Value0 = Extract128BitVector(StoredVal, 0, DAG, dl);		SDValue Value0 = Extract128BitVector(StoredVal, 0, DAG, dl);
SDValue Value1 = Extract128BitVector(StoredVal, NumElems/2, DAG, dl);		SDValue Value1 = Extract128BitVector(StoredVal, NumElems/2, DAG, dl);

▲ Show 20 Lines • Show All 1,909 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	protected:
bool IsBTMemSlow;		bool IsBTMemSlow;

/// IsSHLDSlow - True if SHLD instructions are slow.		/// IsSHLDSlow - True if SHLD instructions are slow.
bool IsSHLDSlow;		bool IsSHLDSlow;

/// IsUAMemFast - True if unaligned memory access is fast.		/// IsUAMemFast - True if unaligned memory access is fast.
bool IsUAMemFast;		bool IsUAMemFast;

		/// True if unaligned 32-byte memory accesses are slow.
		bool IsUAMem32Slow;

/// HasVectorUAMem - True if SIMD operations can have unaligned memory		/// HasVectorUAMem - True if SIMD operations can have unaligned memory
/// operands. This may require setting a feature bit in the processor.		/// operands. This may require setting a feature bit in the processor.
bool HasVectorUAMem;		bool HasVectorUAMem;

/// HasCmpxchg16b - True if this processor has the CMPXCHG16B instruction;		/// HasCmpxchg16b - True if this processor has the CMPXCHG16B instruction;
/// this is true for most x86-64 chips, but not the first AMD chips.		/// this is true for most x86-64 chips, but not the first AMD chips.
bool HasCmpxchg16b;		bool HasCmpxchg16b;

▲ Show 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	public:
bool hasSHA() const { return HasSHA; }		bool hasSHA() const { return HasSHA; }
bool hasSGX() const { return HasSGX; }		bool hasSGX() const { return HasSGX; }
bool hasPRFCHW() const { return HasPRFCHW; }		bool hasPRFCHW() const { return HasPRFCHW; }
bool hasRDSEED() const { return HasRDSEED; }		bool hasRDSEED() const { return HasRDSEED; }
bool hasSMAP() const { return HasSMAP; }		bool hasSMAP() const { return HasSMAP; }
bool isBTMemSlow() const { return IsBTMemSlow; }		bool isBTMemSlow() const { return IsBTMemSlow; }
bool isSHLDSlow() const { return IsSHLDSlow; }		bool isSHLDSlow() const { return IsSHLDSlow; }
bool isUnalignedMemAccessFast() const { return IsUAMemFast; }		bool isUnalignedMemAccessFast() const { return IsUAMemFast; }
		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasVectorUAMem() const { return HasVectorUAMem; }		bool hasVectorUAMem() const { return HasVectorUAMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasSHA = false;		HasSHA = false;
HasSGX = false;		HasSGX = false;
HasPRFCHW = false;		HasPRFCHW = false;
HasRDSEED = false;		HasRDSEED = false;
HasSMAP = false;		HasSMAP = false;
IsBTMemSlow = false;		IsBTMemSlow = false;
IsSHLDSlow = false;		IsSHLDSlow = false;
IsUAMemFast = false;		IsUAMemFast = false;
		IsUAMem32Slow = false;
HasVectorUAMem = false;		HasVectorUAMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/2012-05-19-avx2-store.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7-avx -mattr=+avx2 \| FileCheck %s

	define void @double_save(<4 x i32>* %Ap, <4 x i32>* %Bp, <8 x i32>* %P) nounwind ssp {
	entry:
	; CHECK: vmovaps
	; CHECK: vinsertf128 $1, ([[A0:%rdi\|%rsi]]),
	; CHECK: vmovups
	%A = load <4 x i32>* %Ap
	%B = load <4 x i32>* %Bp
	%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	store <8 x i32> %Z, <8 x i32>* %P, align 16
	ret void
	}

llvm/trunk/test/CodeGen/X86/unaligned-32-byte-memops.ll

				; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7-avx \| FileCheck %s --check-prefix=SANDYB
				; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core-avx-i \| FileCheck %s --check-prefix=SANDYB
				; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=btver2 \| FileCheck %s --check-prefix=BTVER2
				; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core-avx2 \| FileCheck %s --check-prefix=HASWELL

				; On Sandy Bridge or Ivy Bridge, we should not generate an unaligned 32-byte load
				; because that is slower than two 16-byte loads.
				; Other AVX-capable chips don't have that problem.

				define <8 x float> @load32bytes(<8 x float>* %Ap) {
				; CHECK-LABEL: load32bytes

				; SANDYB: vmovaps
				; SANDYB: vinsertf128
				; SANDYB: retq

				; BTVER2: vmovups
				; BTVER2: retq

				; HASWELL: vmovups
				; HASWELL: retq

				%A = load <8 x float>* %Ap, align 16
				ret <8 x float> %A
				}

				; On Sandy Bridge or Ivy Bridge, we should not generate an unaligned 32-byte store
				; because that is slowerthan two 16-byte stores.
				; Other AVX-capable chips don't have that problem.

				define void @store32bytes(<8 x float> %A, <8 x float>* %P) {
				; CHECK-LABEL: store32bytes

				; SANDYB: vextractf128
				; SANDYB: vmovaps
				; SANDYB: retq

				; BTVER2: vmovups
				; BTVER2: retq

				; HASWELL: vmovups
				; HASWELL: retq

				store <8 x float> %A, <8 x float>* %P, align 16
				ret void
				}