This is an archive of the discontinued LLVM Phabricator instance.

[x86] invert logic for attribute 'FeatureFastUAMem'
ClosedPublic

Authored by spatel on Aug 19 2015, 9:14 AM.

Download Raw Diff

Details

Reviewers

jyknight
qcolombet
RKSimon
chandlerc
mkuper
zansari

Commits

rG9e916dc48def: [x86] invert logic for attribute 'FeatureFastUAMem'
rL245729: [x86] invert logic for attribute 'FeatureFastUAMem'

Summary

This is a 'no functional change intended' patch. It removes one FIXME, but it serves as a delivery mechanism for several more. :)

Motivation: we have a FeatureFastUAMem attribute that may be too general. It is used to determine if any sized misaligned memory access under 32-bytes is 'fast'. At some point around Nehalem for Intel and Bobcat for AMD, all scalar and SSE unaligned accesses apparently became fast enough that we can happily use them whenever we want. From the added FIXME comments, however, you can see that we're not consistent about this. Changing the name of the attribute makes it clearer to see the logic holes IMO.

Further motivation: this is a preliminary step for PR24449 ( https://llvm.org/bugs/show_bug.cgi?id=24449 ). I'm hoping to answer a few questions about this seemingly simple test case:

void foo(char *x) {
   memset(x, 0, 32);
}

Both of these:
$ clang -O2 memset.c -S -o -
$ clang -O2 -mavx memset.c -S -o -

Produce:

movq	$0, 24(%rdi)
movq	$0, 16(%rdi)
movq	$0, 8(%rdi)
movq	$0, (%rdi)

Is it ok to generate misaligned 8-byte stores by default?
Is it better to generate misaligned 16-byte SSE stores for the default case? (The default CPU is Core2/Merom.)
Is it better to generate a misaligned 32-byte AVX store for the AVX case?

Diff Detail

Event Timeline

spatel updated this revision to Diff 32556.Aug 19 2015, 9:14 AM

spatel retitled this revision from to [x86] invert logic for attribute 'FeatureFastUAMem'.

spatel updated this object.

spatel added reviewers: chandlerc, qcolombet, zansari, jyknight, mkuper, RKSimon.

spatel added a subscriber: llvm-commits.

RKSimon added inline comments.Aug 19 2015, 10:05 AM

lib/Target/X86/X86.td
489	You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned performance should be the same for aligned addresses and only +1cy for unaligned. It might be more complex for cache-line crossing but most targets will suffer there, not just BD.

spatel added inline comments.Aug 19 2015, 10:13 AM

lib/Target/X86/X86.td
489	Thanks, Simon. Can we make the same argument for AMD 16H? I was planning to fix these up in the next patch and add test cases since that would be a functional change (FIXME at line 445).

RKSimon added inline comments.Aug 19 2015, 10:21 AM

lib/Target/X86/X86.td
489	Yes I'm happy for any changes to made in a followup patch. Jaguar (16h) is definitely as fast for unaligned load/stores with aligned addresses and +1cy for unaligned. IIRC Bobcat you could do fast unaligned loads (as long as the SSE unaligned flag was set). I think there was something about stores that you had to be careful with though. This is probably the same for all AMD 10h/12h families.

silvas added a subscriber: silvas.Aug 19 2015, 12:24 PM

silvas added inline comments.

lib/Target/X86/X86.td
489	To expand a bit on what Simon said, for Jaguar, the LD/ST unit performs a (naturally aligned) 16-byte access to L1D each cycle, so the actual rule is that you can be as unaligned as you want as long as you remain within a 16-byte chunk, with no penalty. Basically if your store crosses a 16-byte chunk you are forcing LD/ST to do an extra round to L1D, which is where the +1cy Simon was talking about comes from. Just considering the mechanism at play here, the compiler can't hope to do anything differently/better in general for the unaligned case, so just let the hardware take care of it if it happens :)

Hi Sanjay,

Functionality-wise, your changes LGTM (that is, they do what you're intending them to do).

I do think, however, that the code needs a little tweaking down the road. Some random comment in no particular order (all relating to Intel processors.. I'm not too familiar with others) that will hopefully answer some of your questions:

There are actually a couple different attributes that we care about:
- Fast unaligned "instructions". The movups instruction used to be very slow and was always to be avoided before NHM (SLM on the small core side). After NHM/SLM, unaligned instructions were just as fast as aligned instructions, provided that the access doesn't split cache lines.
- FastER unaligned memory accesses that split the cache. At the same time, the penalty associated with memory accesses that do split the cache lines was reduced.

Since these two attributes are set/unset on the same H/W, we might be able to get away with just the 1 attribute.

This means, however, that the attribute name is slightly mislabeled (not a big deal) and, more importantly, the statement you made "..became fast enough that we can happily use them whenever we want." isn't entirely true. This is true for using the unaligned instructions for cases that don't split the cache, but there is still a penalty for cases when we do split the cache. This is true for all 8B and 16B accesses, and 32B accesses that split either 16B half on anything below HSW (since 32B accesses are double pumped), and split anywhere on HSW and above (where we do full 32B accesses on L0).

We can be more "clumsy" in unaligned memory accesses with modern H/W, but we can't completely ignore splits. For example:

loop:
    vmovups %ymm, array+64(%rcx)    // 0 mod 32
    vmovups %ymm, array+128(%rcx)  // 0 mod 32

Compared with

loop:
    vmovups %ymm, array+48(%rcx)    // 16 mod 32
    vmovups %ymm, array+112(%rcx)  // 16 mod 32

Is around 3x slower on a HSW. On anything below HSW, the performance is equal due to double pumping the accesses, but suffer similar slowdowns if we split the 16B parts (similar to other pure 16B references)

Doing this:

loop:
    vmovups %xmm, array+48(%rcx)
    vmovups %xmm, array+64(%rcx)
    vmovups %xmm, array+112(%rcx)
    vmovups %xmm, array+128(%rcx)

.. changes performance to be right in between the two loops above.

My initial thoughts on heuristics:

In general, I think that if we know what the alignment is and we know we will split a cache line, we should use 2 instructions to avoid any penalties.
If we don't know the alignment and want to minimize ld/st counts by using larger instructions, it can be worth the gamble on NHM/SLM+ architectures.
On H/W before NHM/SLM, we should avoid unaligned instruction.

Hope this helps, and sorry for the long response.

Thanks,
Zia.

Thanks, Sean and Zia. Certainly appreciate the detailed analysis.

So it sounds like we're not completely broken. The attributes themselves may be sufficient to generate the code as we want to, and they are even mostly applied correctly for each CPU. I'll fix the AMD chips as much as possible in a follow-on patch.

The bugs are as I noted in the FIXME/TODO comments in the lowering itself. We should probably also have the AVX or SSE4.2 attribute imply that IsUAMemUnder32Slow is false. That way if someone has only specified -mavx without a particular -mcpu, they would get the benefit of larger unaligned memory accesses. Alternatively, -mavx might bump the default CPU model from Merom to Sandybridge.

Regarding crossing cachelines: the current behavior is that we don't take that into account, and that's probably how it will remain...unless for example, a front-end starts specifying 'align 64' in the IR.

spatel mentioned this in rL245704: add a test case to check the fast-unaligned-mem attribute per CPU.Aug 21 2015, 9:09 AM

Added a test case to partly confirm that this patch is NFC and to serve as the test for the upcoming AMD changes:
http://reviews.llvm.org/rL245704
http://reviews.llvm.org/rL245709

Closed by commit rL245729: [x86] invert logic for attribute 'FeatureFastUAMem' (authored by spatel). · Explain WhyAug 21 2015, 1:18 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in rL245733: remove 'FeatureSlowUAMem' from AMD CPUs based on 10H micro-arch or later.Aug 21 2015, 1:40 PM

spatel mentioned this in D12635: merge vector stores into wider vector stores and fix AArch64 misaligned access TLI hook (PR21711).Sep 25 2015, 2:09 PM

Revision Contents

Path

Size

lib/

Target/

X86/

156 lines

10 lines

11 lines

8 lines

2 lines

Diff 32556

lib/Target/X86/X86.td

Show First 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	def Feature64Bit : SubtargetFeature<"64bit", "HasX86_64", "true",
[FeatureCMOV]>;		[FeatureCMOV]>;
def FeatureCMPXCHG16B : SubtargetFeature<"cx16", "HasCmpxchg16b", "true",		def FeatureCMPXCHG16B : SubtargetFeature<"cx16", "HasCmpxchg16b", "true",
"64-bit with cmpxchg16b",		"64-bit with cmpxchg16b",
[Feature64Bit]>;		[Feature64Bit]>;
def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",		def FeatureSlowBTMem : SubtargetFeature<"slow-bt-mem", "IsBTMemSlow", "true",
"Bit testing of memory is slow">;		"Bit testing of memory is slow">;
def FeatureSlowSHLD : SubtargetFeature<"slow-shld", "IsSHLDSlow", "true",		def FeatureSlowSHLD : SubtargetFeature<"slow-shld", "IsSHLDSlow", "true",
"SHLD instruction is slow">;		"SHLD instruction is slow">;
// FIXME: This is a 16-byte (SSE/AVX) feature; we should rename it to make that		def FeatureSlowUAMem : SubtargetFeature<"slow-unaligned-mem-under-32",
// explicit. Also, it seems this would be the default state for most chips		"IsUAMemUnder32Slow", "true",
// going forward, so it would probably be better to negate the logic and		"Slow unaligned 16-byte-or-less memory access">;
// match the 32-byte "slow mem" feature below.
def FeatureFastUAMem : SubtargetFeature<"fast-unaligned-mem",
"IsUAMemFast", "true",
"Fast unaligned memory access">;
def FeatureSlowUAMem32 : SubtargetFeature<"slow-unaligned-mem-32",		def FeatureSlowUAMem32 : SubtargetFeature<"slow-unaligned-mem-32",
"IsUAMem32Slow", "true",		"IsUAMem32Slow", "true",
"Slow unaligned 32-byte memory access">;		"Slow unaligned 32-byte memory access">;
def FeatureSSE4A : SubtargetFeature<"sse4a", "HasSSE4A", "true",		def FeatureSSE4A : SubtargetFeature<"sse4a", "HasSSE4A", "true",
"Support SSE 4a instructions",		"Support SSE 4a instructions",
[FeatureSSE3]>;		[FeatureSSE3]>;

def FeatureAVX : SubtargetFeature<"avx", "X86SSELevel", "AVX",		def FeatureAVX : SubtargetFeature<"avx", "X86SSELevel", "AVX",
"Enable AVX instructions",		"Enable AVX instructions",
[FeatureSSE42]>;		[FeatureSSE42]>;
def FeatureAVX2 : SubtargetFeature<"avx2", "X86SSELevel", "AVX2",		def FeatureAVX2 : SubtargetFeature<"avx2", "X86SSELevel", "AVX2",
▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
"Intel Atom processors">;		"Intel Atom processors">;
def ProcIntelSLM : SubtargetFeature<"slm", "X86ProcFamily", "IntelSLM",		def ProcIntelSLM : SubtargetFeature<"slm", "X86ProcFamily", "IntelSLM",
"Intel Silvermont processors">;		"Intel Silvermont processors">;

class Proc<string Name, list<SubtargetFeature> Features>		class Proc<string Name, list<SubtargetFeature> Features>
: ProcessorModel<Name, GenericModel, Features>;		: ProcessorModel<Name, GenericModel, Features>;

def : Proc<"generic", []>;		def : Proc<"generic", [FeatureSlowUAMem]>;
def : Proc<"i386", []>;		def : Proc<"i386", [FeatureSlowUAMem]>;
def : Proc<"i486", []>;		def : Proc<"i486", [FeatureSlowUAMem]>;
def : Proc<"i586", []>;		def : Proc<"i586", [FeatureSlowUAMem]>;
def : Proc<"pentium", []>;		def : Proc<"pentium", [FeatureSlowUAMem]>;
def : Proc<"pentium-mmx", [FeatureMMX]>;		def : Proc<"pentium-mmx", [FeatureSlowUAMem, FeatureMMX]>;
def : Proc<"i686", []>;		def : Proc<"i686", [FeatureSlowUAMem]>;
def : Proc<"pentiumpro", [FeatureCMOV]>;		def : Proc<"pentiumpro", [FeatureSlowUAMem, FeatureCMOV]>;
def : Proc<"pentium2", [FeatureMMX, FeatureCMOV]>;		def : Proc<"pentium2", [FeatureSlowUAMem, FeatureMMX, FeatureCMOV]>;
def : Proc<"pentium3", [FeatureSSE1]>;		def : Proc<"pentium3", [FeatureSlowUAMem, FeatureSSE1]>;
def : Proc<"pentium3m", [FeatureSSE1, FeatureSlowBTMem]>;		def : Proc<"pentium3m", [FeatureSlowUAMem, FeatureSSE1, FeatureSlowBTMem]>;
def : Proc<"pentium-m", [FeatureSSE2, FeatureSlowBTMem]>;		def : Proc<"pentium-m", [FeatureSlowUAMem, FeatureSSE2, FeatureSlowBTMem]>;
def : Proc<"pentium4", [FeatureSSE2]>;		def : Proc<"pentium4", [FeatureSlowUAMem, FeatureSSE2]>;
def : Proc<"pentium4m", [FeatureSSE2, FeatureSlowBTMem]>;		def : Proc<"pentium4m", [FeatureSlowUAMem, FeatureSSE2, FeatureSlowBTMem]>;

// Intel Core Duo.		// Intel Core Duo.
def : ProcessorModel<"yonah", SandyBridgeModel,		def : ProcessorModel<"yonah", SandyBridgeModel,
[FeatureSSE3, FeatureSlowBTMem]>;		[FeatureSlowUAMem, FeatureSSE3, FeatureSlowBTMem]>;

// NetBurst.		// NetBurst.
def : Proc<"prescott", [FeatureSSE3, FeatureSlowBTMem]>;		def : Proc<"prescott", [FeatureSlowUAMem, FeatureSSE3, FeatureSlowBTMem]>;
def : Proc<"nocona", [FeatureSSE3, FeatureCMPXCHG16B, FeatureSlowBTMem]>;		def : Proc<"nocona", [FeatureSlowUAMem, FeatureSSE3, FeatureCMPXCHG16B,
		FeatureSlowBTMem]>;

// Intel Core 2 Solo/Duo.		// Intel Core 2 Solo/Duo.
def : ProcessorModel<"core2", SandyBridgeModel,		def : ProcessorModel<"core2", SandyBridgeModel,
[FeatureSSSE3, FeatureCMPXCHG16B, FeatureSlowBTMem]>;		[FeatureSlowUAMem, FeatureSSSE3, FeatureCMPXCHG16B,
		FeatureSlowBTMem]>;
def : ProcessorModel<"penryn", SandyBridgeModel,		def : ProcessorModel<"penryn", SandyBridgeModel,
[FeatureSSE41, FeatureCMPXCHG16B, FeatureSlowBTMem]>;		[FeatureSlowUAMem, FeatureSSE41, FeatureCMPXCHG16B,
		FeatureSlowBTMem]>;

// Atom CPUs.		// Atom CPUs.
class BonnellProc<string Name> : ProcessorModel<Name, AtomModel, [		class BonnellProc<string Name> : ProcessorModel<Name, AtomModel, [
ProcIntelAtom,		ProcIntelAtom,
		FeatureSlowUAMem,
FeatureSSSE3,		FeatureSSSE3,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureMOVBE,		FeatureMOVBE,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureLeaForSP,		FeatureLeaForSP,
FeatureSlowDivide32,		FeatureSlowDivide32,
FeatureSlowDivide64,		FeatureSlowDivide64,
FeatureCallRegIndirect,		FeatureCallRegIndirect,
Show All 11 Lines	class SilvermontProc<string Name> : ProcessorModel<Name, SLMModel, [
FeaturePOPCNT,		FeaturePOPCNT,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureAES,		FeatureAES,
FeatureSlowDivide64,		FeatureSlowDivide64,
FeatureCallRegIndirect,		FeatureCallRegIndirect,
FeaturePRFCHW,		FeaturePRFCHW,
FeatureSlowLEA,		FeatureSlowLEA,
FeatureSlowIncDec,		FeatureSlowIncDec,
FeatureSlowBTMem,		FeatureSlowBTMem
FeatureFastUAMem
]>;		]>;
def : SilvermontProc<"silvermont">;		def : SilvermontProc<"silvermont">;
def : SilvermontProc<"slm">; // Legacy alias.		def : SilvermontProc<"slm">; // Legacy alias.

// "Arrandale" along with corei3 and corei5		// "Arrandale" along with corei3 and corei5
class NehalemProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [		class NehalemProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [
FeatureSSE42,		FeatureSSE42,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeaturePOPCNT		FeaturePOPCNT
]>;		]>;
def : NehalemProc<"nehalem">;		def : NehalemProc<"nehalem">;
def : NehalemProc<"corei7">;		def : NehalemProc<"corei7">;

// Westmere is a similar machine to nehalem with some additional features.		// Westmere is a similar machine to nehalem with some additional features.
// Westmere is the corei3/i5/i7 path from nehalem to sandybridge		// Westmere is the corei3/i5/i7 path from nehalem to sandybridge
class WestmereProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [		class WestmereProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [
FeatureSSE42,		FeatureSSE42,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL		FeaturePCLMUL
]>;		]>;
def : WestmereProc<"westmere">;		def : WestmereProc<"westmere">;

// SSE is not listed here since llvm treats AVX as a reimplementation of SSE,		// SSE is not listed here since llvm treats AVX as a reimplementation of SSE,
// rather than a superset.		// rather than a superset.
class SandyBridgeProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [		class SandyBridgeProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [
FeatureAVX,		FeatureAVX,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeatureSlowUAMem32,		FeatureSlowUAMem32,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL		FeaturePCLMUL
]>;		]>;
def : SandyBridgeProc<"sandybridge">;		def : SandyBridgeProc<"sandybridge">;
def : SandyBridgeProc<"corei7-avx">; // Legacy alias.		def : SandyBridgeProc<"corei7-avx">; // Legacy alias.

class IvyBridgeProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [		class IvyBridgeProc<string Name> : ProcessorModel<Name, SandyBridgeModel, [
FeatureAVX,		FeatureAVX,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeatureSlowUAMem32,		FeatureSlowUAMem32,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureRDRAND,		FeatureRDRAND,
FeatureF16C,		FeatureF16C,
FeatureFSGSBase		FeatureFSGSBase
]>;		]>;
def : IvyBridgeProc<"ivybridge">;		def : IvyBridgeProc<"ivybridge">;
def : IvyBridgeProc<"core-avx-i">; // Legacy alias.		def : IvyBridgeProc<"core-avx-i">; // Legacy alias.

class HaswellProc<string Name> : ProcessorModel<Name, HaswellModel, [		class HaswellProc<string Name> : ProcessorModel<Name, HaswellModel, [
FeatureAVX2,		FeatureAVX2,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureRDRAND,		FeatureRDRAND,
FeatureF16C,		FeatureF16C,
FeatureFSGSBase,		FeatureFSGSBase,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
FeatureBMI,		FeatureBMI,
FeatureBMI2,		FeatureBMI2,
FeatureFMA,		FeatureFMA,
FeatureRTM,		FeatureRTM,
FeatureHLE,		FeatureHLE,
FeatureSlowIncDec		FeatureSlowIncDec
]>;		]>;
def : HaswellProc<"haswell">;		def : HaswellProc<"haswell">;
def : HaswellProc<"core-avx2">; // Legacy alias.		def : HaswellProc<"core-avx2">; // Legacy alias.

class BroadwellProc<string Name> : ProcessorModel<Name, HaswellModel, [		class BroadwellProc<string Name> : ProcessorModel<Name, HaswellModel, [
FeatureAVX2,		FeatureAVX2,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureFastUAMem,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureRDRAND,		FeatureRDRAND,
FeatureF16C,		FeatureF16C,
FeatureFSGSBase,		FeatureFSGSBase,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
FeatureBMI,		FeatureBMI,
FeatureBMI2,		FeatureBMI2,
FeatureFMA,		FeatureFMA,
FeatureRTM,		FeatureRTM,
FeatureHLE,		FeatureHLE,
FeatureADX,		FeatureADX,
FeatureRDSEED,		FeatureRDSEED,
FeatureSlowIncDec		FeatureSlowIncDec
]>;		]>;
def : BroadwellProc<"broadwell">;		def : BroadwellProc<"broadwell">;

// FIXME: define KNL model		// FIXME: define KNL model
class KnightsLandingProc<string Name> : ProcessorModel<Name, HaswellModel,		class KnightsLandingProc<string Name> : ProcessorModel<Name, HaswellModel,
[FeatureAVX512, FeatureERI, FeatureCDI, FeaturePFI,		[FeatureAVX512, FeatureERI, FeatureCDI, FeaturePFI,
FeatureCMPXCHG16B, FeatureFastUAMem, FeaturePOPCNT,		FeatureCMPXCHG16B, FeaturePOPCNT,
FeatureAES, FeaturePCLMUL, FeatureRDRAND, FeatureF16C,		FeatureAES, FeaturePCLMUL, FeatureRDRAND, FeatureF16C,
FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT, FeatureBMI,		FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT, FeatureBMI,
FeatureBMI2, FeatureFMA, FeatureRTM, FeatureHLE,		FeatureBMI2, FeatureFMA, FeatureRTM, FeatureHLE,
FeatureSlowIncDec, FeatureMPX]>;		FeatureSlowIncDec, FeatureMPX]>;
def : KnightsLandingProc<"knl">;		def : KnightsLandingProc<"knl">;

// FIXME: define SKX model		// FIXME: define SKX model
class SkylakeProc<string Name> : ProcessorModel<Name, HaswellModel,		class SkylakeProc<string Name> : ProcessorModel<Name, HaswellModel,
[FeatureAVX512, FeatureCDI,		[FeatureAVX512, FeatureCDI,
FeatureDQI, FeatureBWI, FeatureVLX,		FeatureDQI, FeatureBWI, FeatureVLX,
FeatureCMPXCHG16B, FeatureSlowBTMem, FeatureFastUAMem,		FeatureCMPXCHG16B, FeatureSlowBTMem,
FeaturePOPCNT, FeatureAES, FeaturePCLMUL, FeatureRDRAND,		FeaturePOPCNT, FeatureAES, FeaturePCLMUL, FeatureRDRAND,
FeatureF16C, FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT,		FeatureF16C, FeatureFSGSBase, FeatureMOVBE, FeatureLZCNT,
FeatureBMI, FeatureBMI2, FeatureFMA, FeatureRTM,		FeatureBMI, FeatureBMI2, FeatureFMA, FeatureRTM,
FeatureHLE, FeatureADX, FeatureRDSEED, FeatureSlowIncDec,		FeatureHLE, FeatureADX, FeatureRDSEED, FeatureSlowIncDec,
FeatureMPX]>;		FeatureMPX]>;
def : SkylakeProc<"skylake">;		def : SkylakeProc<"skylake">;
def : SkylakeProc<"skx">; // Legacy alias.		def : SkylakeProc<"skx">; // Legacy alias.


// AMD CPUs.		// AMD CPUs.

def : Proc<"k6", [FeatureMMX]>;		def : Proc<"k6", [FeatureSlowUAMem, FeatureMMX]>;
def : Proc<"k6-2", [Feature3DNow]>;		def : Proc<"k6-2", [FeatureSlowUAMem, Feature3DNow]>;
def : Proc<"k6-3", [Feature3DNow]>;		def : Proc<"k6-3", [FeatureSlowUAMem, Feature3DNow]>;
def : Proc<"athlon", [Feature3DNowA, FeatureSlowBTMem,		def : Proc<"athlon", [FeatureSlowUAMem, Feature3DNowA,
FeatureSlowSHLD]>;
def : Proc<"athlon-tbird", [Feature3DNowA, FeatureSlowBTMem,
FeatureSlowSHLD]>;
def : Proc<"athlon-4", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
FeatureSlowSHLD]>;
def : Proc<"athlon-xp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
FeatureSlowSHLD]>;
def : Proc<"athlon-mp", [FeatureSSE1, Feature3DNowA, FeatureSlowBTMem,
FeatureSlowSHLD]>;
def : Proc<"k8", [FeatureSSE2, Feature3DNowA, Feature64Bit,
FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"opteron", [FeatureSSE2, Feature3DNowA, Feature64Bit,
FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"athlon64", [FeatureSSE2, Feature3DNowA, Feature64Bit,
FeatureSlowBTMem, FeatureSlowSHLD]>;		FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"athlon-fx", [FeatureSSE2, Feature3DNowA, Feature64Bit,		def : Proc<"athlon-tbird", [FeatureSlowUAMem, Feature3DNowA,
FeatureSlowBTMem, FeatureSlowSHLD]>;		FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"k8-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,		def : Proc<"athlon-4", [FeatureSlowUAMem, FeatureSSE1, Feature3DNowA,
FeatureSlowBTMem, FeatureSlowSHLD]>;		FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"opteron-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,		def : Proc<"athlon-xp", [FeatureSlowUAMem, FeatureSSE1, Feature3DNowA,
FeatureSlowBTMem, FeatureSlowSHLD]>;		FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"athlon64-sse3", [FeatureSSE3, Feature3DNowA, FeatureCMPXCHG16B,		def : Proc<"athlon-mp", [FeatureSlowUAMem, FeatureSSE1, Feature3DNowA,
FeatureSlowBTMem, FeatureSlowSHLD]>;		FeatureSlowBTMem, FeatureSlowSHLD]>;
def : Proc<"amdfam10", [FeatureSSE4A,		def : Proc<"k8", [FeatureSlowUAMem, FeatureSSE2, Feature3DNowA,
		Feature64Bit, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"opteron", [FeatureSlowUAMem, FeatureSSE2, Feature3DNowA,
		Feature64Bit, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"athlon64", [FeatureSlowUAMem, FeatureSSE2, Feature3DNowA,
		Feature64Bit, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"athlon-fx", [FeatureSlowUAMem, FeatureSSE2, Feature3DNowA,
		Feature64Bit, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"k8-sse3", [FeatureSlowUAMem, FeatureSSE3, Feature3DNowA,
		FeatureCMPXCHG16B, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"opteron-sse3", [FeatureSlowUAMem, FeatureSSE3, Feature3DNowA,
		FeatureCMPXCHG16B, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"athlon64-sse3", [FeatureSlowUAMem, FeatureSSE3, Feature3DNowA,
		FeatureCMPXCHG16B, FeatureSlowBTMem,
		FeatureSlowSHLD]>;
		def : Proc<"amdfam10", [FeatureSlowUAMem, FeatureSSE4A,
Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,		Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,
FeaturePOPCNT, FeatureSlowBTMem,		FeaturePOPCNT, FeatureSlowBTMem,
FeatureSlowSHLD]>;		FeatureSlowSHLD]>;
def : Proc<"barcelona", [FeatureSSE4A,		def : Proc<"barcelona", [FeatureSlowUAMem, FeatureSSE4A,
Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,		Feature3DNowA, FeatureCMPXCHG16B, FeatureLZCNT,
FeaturePOPCNT, FeatureSlowBTMem,		FeaturePOPCNT, FeatureSlowBTMem,
FeatureSlowSHLD]>;		FeatureSlowSHLD]>;

		// FIXME: We should remove 'FeatureSlowUAMem' from AMD chips under here.

// Bobcat		// Bobcat
def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,		def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,
FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT,		FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT,
FeatureSlowSHLD]>;		FeatureSlowSHLD, FeatureSlowUAMem]>;

// Jaguar		// Jaguar
def : ProcessorModel<"btver2", BtVer2Model,		def : ProcessorModel<"btver2", BtVer2Model,
[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,		[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,
FeaturePRFCHW, FeatureAES, FeaturePCLMUL,		FeaturePRFCHW, FeatureAES, FeaturePCLMUL,
FeatureBMI, FeatureF16C, FeatureMOVBE,		FeatureBMI, FeatureF16C, FeatureMOVBE,
FeatureLZCNT, FeaturePOPCNT, FeatureFastUAMem,		FeatureLZCNT, FeaturePOPCNT,
FeatureSlowSHLD]>;		FeatureSlowSHLD]>;

// TODO: We should probably add 'FeatureFastUAMem' to all of the AMD chips.

// Bulldozer		// Bulldozer
def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
FeatureAVX, FeatureSSE4A, FeatureLZCNT,		FeatureAVX, FeatureSSE4A, FeatureLZCNT,
FeaturePOPCNT, FeatureSlowSHLD]>;		FeaturePOPCNT, FeatureSlowSHLD,
		FeatureSlowUAMem]>;
// Piledriver		// Piledriver
def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
FeatureAVX, FeatureSSE4A, FeatureF16C,		FeatureAVX, FeatureSSE4A, FeatureF16C,
FeatureLZCNT, FeaturePOPCNT, FeatureBMI,		FeatureLZCNT, FeaturePOPCNT, FeatureBMI,
FeatureTBM, FeatureFMA, FeatureSlowSHLD]>;		FeatureTBM, FeatureFMA, FeatureSlowSHLD,
		FeatureSlowUAMem]>;

// Steamroller		// Steamroller
def : Proc<"bdver3", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver3", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
FeatureAVX, FeatureSSE4A, FeatureF16C,		FeatureAVX, FeatureSSE4A, FeatureF16C,
FeatureLZCNT, FeaturePOPCNT, FeatureBMI,		FeatureLZCNT, FeaturePOPCNT, FeatureBMI,
FeatureTBM, FeatureFMA, FeatureSlowSHLD,		FeatureTBM, FeatureFMA, FeatureSlowSHLD,
FeatureFSGSBase]>;		FeatureFSGSBase, FeatureSlowUAMem]>;

// Excavator		// Excavator
def : Proc<"bdver4", [FeatureAVX2, FeatureXOP, FeatureFMA4,		def : Proc<"bdver4", [FeatureAVX2, FeatureXOP, FeatureFMA4,
FeatureCMPXCHG16B, FeatureAES, FeaturePRFCHW,		FeatureCMPXCHG16B, FeatureAES, FeaturePRFCHW,
FeaturePCLMUL, FeatureF16C, FeatureLZCNT,		FeaturePCLMUL, FeatureF16C, FeatureLZCNT,
FeaturePOPCNT, FeatureBMI, FeatureBMI2,		FeaturePOPCNT, FeatureBMI, FeatureBMI2,
FeatureTBM, FeatureFMA, FeatureSSE4A,		FeatureTBM, FeatureFMA, FeatureSSE4A,
FeatureFSGSBase]>;		FeatureFSGSBase, FeatureSlowUAMem]>;

		RKSimonUnsubmitted Not Done Reply Inline Actions You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned performance should be the same for aligned addresses and only +1cy for unaligned. It might be more complex for cache-line crossing but most targets will suffer there, not just BD. RKSimon: You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Thanks, Simon. Can we make the same argument for AMD 16H? I was planning to fix these up in the next patch and add test cases since that would be a functional change (FIXME at line 445). spatel: Thanks, Simon. Can we make the same argument for AMD 16H? I was planning to fix these up in the…
		RKSimonUnsubmitted Not Done Reply Inline Actions Yes I'm happy for any changes to made in a followup patch. Jaguar (16h) is definitely as fast for unaligned load/stores with aligned addresses and +1cy for unaligned. IIRC Bobcat you could do fast unaligned loads (as long as the SSE unaligned flag was set). I think there was something about stores that you had to be careful with though. This is probably the same for all AMD 10h/12h families. RKSimon: Yes I'm happy for any changes to made in a followup patch. Jaguar (16h) is definitely as fast…
		silvasUnsubmitted Not Done Reply Inline Actions To expand a bit on what Simon said, for Jaguar, the LD/ST unit performs a (naturally aligned) 16-byte access to L1D each cycle, so the actual rule is that you can be as unaligned as you want as long as you remain within a 16-byte chunk, with no penalty. Basically if your store crosses a 16-byte chunk you are forcing LD/ST to do an extra round to L1D, which is where the +1cy Simon was talking about comes from. Just considering the mechanism at play here, the compiler can't hope to do anything differently/better in general for the unaligned case, so just let the hardware take care of it if it happens :) silvas: To expand a bit on what Simon said, for Jaguar, the LD/ST unit performs a (naturally aligned)…
def : Proc<"geode", [Feature3DNowA]>;		def : Proc<"geode", [FeatureSlowUAMem, Feature3DNowA]>;

def : Proc<"winchip-c6", [FeatureMMX]>;		def : Proc<"winchip-c6", [FeatureSlowUAMem, FeatureMMX]>;
def : Proc<"winchip2", [Feature3DNow]>;		def : Proc<"winchip2", [FeatureSlowUAMem, Feature3DNow]>;
def : Proc<"c3", [Feature3DNow]>;		def : Proc<"c3", [FeatureSlowUAMem, Feature3DNow]>;
def : Proc<"c3-2", [FeatureSSE1]>;		def : Proc<"c3-2", [FeatureSlowUAMem, FeatureSSE1]>;

// We also provide a generic 64-bit specific x86 processor model which tries to		// We also provide a generic 64-bit specific x86 processor model which tries to
// be good for modern chips without enabling instruction set encodings past the		// be good for modern chips without enabling instruction set encodings past the
// basic SSE2 and 64-bit ones. It disables slow things from any mainstream and		// basic SSE2 and 64-bit ones. It disables slow things from any mainstream and
// modern 64-bit x86 chip, and enables features that are generally beneficial.		// modern 64-bit x86 chip, and enables features that are generally beneficial.
//		//
// We currently use the Sandy Bridge model as the default scheduling model as		// We currently use the Sandy Bridge model as the default scheduling model as
// we use it across Nehalem, Westmere, Sandy Bridge, and Ivy Bridge which		// we use it across Nehalem, Westmere, Sandy Bridge, and Ivy Bridge which
// covers a huge swath of x86 processors. If there are specific scheduling		// covers a huge swath of x86 processors. If there are specific scheduling
// knobs which need to be tuned differently for AMD chips, we might consider		// knobs which need to be tuned differently for AMD chips, we might consider
// forming a common base for them.		// forming a common base for them.
def : ProcessorModel<"x86-64", SandyBridgeModel,		def : ProcessorModel<"x86-64", SandyBridgeModel,
[FeatureSSE2, Feature64Bit, FeatureSlowBTMem,		[FeatureSSE2, Feature64Bit, FeatureSlowBTMem]>;
FeatureFastUAMem]>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Register File Description		// Register File Description
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86RegisterInfo.td"		include "X86RegisterInfo.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,870 Lines • ▼ Show 20 Lines	X86TargetLowering::getOptimalMemOpType(uint64_t Size,
unsigned DstAlign, unsigned SrcAlign,		unsigned DstAlign, unsigned SrcAlign,
bool IsMemset, bool ZeroMemset,		bool IsMemset, bool ZeroMemset,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const {		MachineFunction &MF) const {
const Function *F = MF.getFunction();		const Function *F = MF.getFunction();
if ((!IsMemset \|\| ZeroMemset) &&		if ((!IsMemset \|\| ZeroMemset) &&
!F->hasFnAttribute(Attribute::NoImplicitFloat)) {		!F->hasFnAttribute(Attribute::NoImplicitFloat)) {
if (Size >= 16 &&		if (Size >= 16 &&
(Subtarget->isUnalignedMemAccessFast() \|\|		(!Subtarget->isUnalignedMemUnder32Slow() \|\|
((DstAlign == 0 \|\| DstAlign >= 16) &&		((DstAlign == 0 \|\| DstAlign >= 16) &&
(SrcAlign == 0 \|\| SrcAlign >= 16)))) {		(SrcAlign == 0 \|\| SrcAlign >= 16)))) {
if (Size >= 32) {		if (Size >= 32) {
		// FIXME: Check if unaligned 32-byte accesses are slow.
if (Subtarget->hasInt256())		if (Subtarget->hasInt256())
return MVT::v8i32;		return MVT::v8i32;
if (Subtarget->hasFp256())		if (Subtarget->hasFp256())
return MVT::v8f32;		return MVT::v8f32;
}		}
if (Subtarget->hasSSE2())		if (Subtarget->hasSSE2())
return MVT::v4i32;		return MVT::v4i32;
if (Subtarget->hasSSE1())		if (Subtarget->hasSSE1())
return MVT::v4f32;		return MVT::v4f32;
} else if (!MemcpyStrSrc && Size >= 8 &&		} else if (!MemcpyStrSrc && Size >= 8 &&
!Subtarget->is64Bit() &&		!Subtarget->is64Bit() &&
Subtarget->hasSSE2()) {		Subtarget->hasSSE2()) {
// Do not use f64 to lower memcpy if source is string constant. It's		// Do not use f64 to lower memcpy if source is string constant. It's
// better to use i32 to avoid the loads.		// better to use i32 to avoid the loads.
		// TODO: Check if unaligned accesses are slow?
return MVT::f64;		return MVT::f64;
}		}
}		}
		// TODO: Check if unaligned accesses are slow?
if (Subtarget->is64Bit() && Size >= 8)		if (Subtarget->is64Bit() && Size >= 8)
return MVT::i64;		return MVT::i64;
return MVT::i32;		return MVT::i32;
}		}

bool X86TargetLowering::isSafeMemOpType(MVT VT) const {		bool X86TargetLowering::isSafeMemOpType(MVT VT) const {
if (VT == MVT::f32)		if (VT == MVT::f32)
return X86ScalarSSEf32;		return X86ScalarSSEf32;
else if (VT == MVT::f64)		else if (VT == MVT::f64)
return X86ScalarSSEf64;		return X86ScalarSSEf64;
return true;		return true;
}		}

bool		bool
X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,		X86TargetLowering::allowsMisalignedMemoryAccesses(EVT VT,
unsigned,		unsigned,
unsigned,		unsigned,
bool *Fast) const {		bool *Fast) const {
if (Fast) {		if (Fast) {
// FIXME: We should be checking 128-bit accesses separately from smaller		// TODO: Should 128-bit accesses be a separate check?
// accesses.
if (VT.getSizeInBits() == 256)		if (VT.getSizeInBits() == 256)
*Fast = !Subtarget->isUnalignedMem32Slow();		*Fast = !Subtarget->isUnalignedMem32Slow();
else		else
*Fast = Subtarget->isUnalignedMemAccessFast();		*Fast = !Subtarget->isUnalignedMemUnder32Slow();
}		}
return true;		return true;
}		}

/// Return the entry encoding for a jump table in the		/// Return the entry encoding for a jump table in the
/// current function. The returned value is a member of the		/// current function. The returned value is a member of the
/// MachineJumpTableInfo::JTEntryKind enum.		/// MachineJumpTableInfo::JTEntryKind enum.
unsigned X86TargetLowering::getJumpTableEncoding() const {		unsigned X86TargetLowering::getJumpTableEncoding() const {
▲ Show 20 Lines • Show All 24,508 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,502 Lines • ▼ Show 20 Lines	if (UnfoldLoad && !FoldedLoad)
return false;		return false;
UnfoldLoad &= FoldedLoad;		UnfoldLoad &= FoldedLoad;
if (UnfoldStore && !FoldedStore)		if (UnfoldStore && !FoldedStore)
return false;		return false;
UnfoldStore &= FoldedStore;		UnfoldStore &= FoldedStore;

const MCInstrDesc &MCID = get(Opc);		const MCInstrDesc &MCID = get(Opc);
const TargetRegisterClass *RC = getRegClass(MCID, Index, &RI, MF);		const TargetRegisterClass *RC = getRegClass(MCID, Index, &RI, MF);
		// TODO: Check if 32-byte or greater accesses are slow too?
if (!MI->hasOneMemOperand() &&		if (!MI->hasOneMemOperand() &&
RC == &X86::VR128RegClass &&		RC == &X86::VR128RegClass &&
!Subtarget.isUnalignedMemAccessFast())		Subtarget.isUnalignedMemUnder32Slow())
// Without memoperands, loadRegFromAddr and storeRegToStackSlot will		// Without memoperands, loadRegFromAddr and storeRegToStackSlot will
// conservatively assume the address is unaligned. That's bad for		// conservatively assume the address is unaligned. That's bad for
// performance.		// performance.
return false;		return false;
SmallVector<MachineOperand, X86::AddrNumOperands> AddrOps;		SmallVector<MachineOperand, X86::AddrNumOperands> AddrOps;
SmallVector<MachineOperand,2> BeforeOps;		SmallVector<MachineOperand,2> BeforeOps;
SmallVector<MachineOperand,2> AfterOps;		SmallVector<MachineOperand,2> AfterOps;
SmallVector<MachineOperand,4> ImpOps;		SmallVector<MachineOperand,4> ImpOps;
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	X86InstrInfo::unfoldMemoryOperand(SelectionDAG &DAG, SDNode *N,
if (FoldedLoad) {		if (FoldedLoad) {
EVT VT = *RC->vt_begin();		EVT VT = *RC->vt_begin();
std::pair<MachineInstr::mmo_iterator,		std::pair<MachineInstr::mmo_iterator,
MachineInstr::mmo_iterator> MMOs =		MachineInstr::mmo_iterator> MMOs =
MF.extractLoadMemRefs(cast<MachineSDNode>(N)->memoperands_begin(),		MF.extractLoadMemRefs(cast<MachineSDNode>(N)->memoperands_begin(),
cast<MachineSDNode>(N)->memoperands_end());		cast<MachineSDNode>(N)->memoperands_end());
if (!(*MMOs.first) &&		if (!(*MMOs.first) &&
RC == &X86::VR128RegClass &&		RC == &X86::VR128RegClass &&
!Subtarget.isUnalignedMemAccessFast())		Subtarget.isUnalignedMemUnder32Slow())
// Do not introduce a slow unaligned load.		// Do not introduce a slow unaligned load.
return false;		return false;
		// FIXME: If a VR128 can have size 32, we should be checking if a 32-byte
		// memory access is slow above.
unsigned Alignment = RC->getSize() == 32 ? 32 : 16;		unsigned Alignment = RC->getSize() == 32 ? 32 : 16;
bool isAligned = (*MMOs.first) &&		bool isAligned = (*MMOs.first) &&
(*MMOs.first)->getAlignment() >= Alignment;		(*MMOs.first)->getAlignment() >= Alignment;
Load = DAG.getMachineNode(getLoadRegOpcode(0, RC, isAligned, Subtarget), dl,		Load = DAG.getMachineNode(getLoadRegOpcode(0, RC, isAligned, Subtarget), dl,
VT, MVT::Other, AddrOps);		VT, MVT::Other, AddrOps);
NewNodes.push_back(Load);		NewNodes.push_back(Load);

// Preserve memory reference information.		// Preserve memory reference information.
Show All 24 Lines	if (FoldedStore) {
AddrOps.push_back(SDValue(NewNode, 0));		AddrOps.push_back(SDValue(NewNode, 0));
AddrOps.push_back(Chain);		AddrOps.push_back(Chain);
std::pair<MachineInstr::mmo_iterator,		std::pair<MachineInstr::mmo_iterator,
MachineInstr::mmo_iterator> MMOs =		MachineInstr::mmo_iterator> MMOs =
MF.extractStoreMemRefs(cast<MachineSDNode>(N)->memoperands_begin(),		MF.extractStoreMemRefs(cast<MachineSDNode>(N)->memoperands_begin(),
cast<MachineSDNode>(N)->memoperands_end());		cast<MachineSDNode>(N)->memoperands_end());
if (!(*MMOs.first) &&		if (!(*MMOs.first) &&
RC == &X86::VR128RegClass &&		RC == &X86::VR128RegClass &&
!Subtarget.isUnalignedMemAccessFast())		Subtarget.isUnalignedMemUnder32Slow())
// Do not introduce a slow unaligned store.		// Do not introduce a slow unaligned store.
return false;		return false;
		// FIXME: If a VR128 can have size 32, we should be checking if a 32-byte
		// memory access is slow above.
unsigned Alignment = RC->getSize() == 32 ? 32 : 16;		unsigned Alignment = RC->getSize() == 32 ? 32 : 16;
bool isAligned = (*MMOs.first) &&		bool isAligned = (*MMOs.first) &&
(*MMOs.first)->getAlignment() >= Alignment;		(*MMOs.first)->getAlignment() >= Alignment;
SDNode *Store =		SDNode *Store =
DAG.getMachineNode(getStoreRegOpcode(0, DstRC, isAligned, Subtarget),		DAG.getMachineNode(getStoreRegOpcode(0, DstRC, isAligned, Subtarget),
dl, MVT::Other, AddrOps);		dl, MVT::Other, AddrOps);
NewNodes.push_back(Store);		NewNodes.push_back(Store);

▲ Show 20 Lines • Show All 1,125 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	protected:
bool HasRDSEED;		bool HasRDSEED;

/// True if BT (bit test) of memory instructions are slow.		/// True if BT (bit test) of memory instructions are slow.
bool IsBTMemSlow;		bool IsBTMemSlow;

/// True if SHLD instructions are slow.		/// True if SHLD instructions are slow.
bool IsSHLDSlow;		bool IsSHLDSlow;

/// True if unaligned memory access is fast.		/// True if unaligned memory accesses of 16-bytes or smaller are slow.
bool IsUAMemFast;		bool IsUAMemUnder32Slow;

/// True if unaligned 32-byte memory accesses are slow.		/// True if unaligned memory accesses of 32-bytes are slow.
bool IsUAMem32Slow;		bool IsUAMem32Slow;

/// True if SSE operations can have unaligned memory operands.		/// True if SSE operations can have unaligned memory operands.
/// This may require setting a configuration bit in the processor.		/// This may require setting a configuration bit in the processor.
bool HasSSEUnalignedMem;		bool HasSSEUnalignedMem;

/// True if this processor has the CMPXCHG16B instruction;		/// True if this processor has the CMPXCHG16B instruction;
/// this is true for most x86-64 chips, but not the first AMD chips.		/// this is true for most x86-64 chips, but not the first AMD chips.
▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	public:
bool hasRTM() const { return HasRTM; }		bool hasRTM() const { return HasRTM; }
bool hasHLE() const { return HasHLE; }		bool hasHLE() const { return HasHLE; }
bool hasADX() const { return HasADX; }		bool hasADX() const { return HasADX; }
bool hasSHA() const { return HasSHA; }		bool hasSHA() const { return HasSHA; }
bool hasPRFCHW() const { return HasPRFCHW; }		bool hasPRFCHW() const { return HasPRFCHW; }
bool hasRDSEED() const { return HasRDSEED; }		bool hasRDSEED() const { return HasRDSEED; }
bool isBTMemSlow() const { return IsBTMemSlow; }		bool isBTMemSlow() const { return IsBTMemSlow; }
bool isSHLDSlow() const { return IsSHLDSlow; }		bool isSHLDSlow() const { return IsSHLDSlow; }
bool isUnalignedMemAccessFast() const { return IsUAMemFast; }		bool isUnalignedMemUnder32Slow() const { return IsUAMemUnder32Slow; }
bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }		bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasVLX = false;		HasVLX = false;
HasADX = false;		HasADX = false;
HasSHA = false;		HasSHA = false;
HasPRFCHW = false;		HasPRFCHW = false;
HasRDSEED = false;		HasRDSEED = false;
HasMPX = false;		HasMPX = false;
IsBTMemSlow = false;		IsBTMemSlow = false;
IsSHLDSlow = false;		IsSHLDSlow = false;
IsUAMemFast = false;		IsUAMemUnder32Slow = false;
IsUAMem32Slow = false;		IsUAMem32Slow = false;
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines