This is an archive of the discontinued LLVM Phabricator instance.

[x86] invert logic for attribute 'FeatureFastUAMem'
ClosedPublic

Authored by spatel on Aug 19 2015, 9:14 AM.

Details

Summary

This is a 'no functional change intended' patch. It removes one FIXME, but it serves as a delivery mechanism for several more. :)

Motivation: we have a FeatureFastUAMem attribute that may be too general. It is used to determine if any sized misaligned memory access under 32-bytes is 'fast'. At some point around Nehalem for Intel and Bobcat for AMD, all scalar and SSE unaligned accesses apparently became fast enough that we can happily use them whenever we want. From the added FIXME comments, however, you can see that we're not consistent about this. Changing the name of the attribute makes it clearer to see the logic holes IMO.

Further motivation: this is a preliminary step for PR24449 ( https://llvm.org/bugs/show_bug.cgi?id=24449 ). I'm hoping to answer a few questions about this seemingly simple test case:

void foo(char *x) {
   memset(x, 0, 32);
}

Both of these:
$ clang -O2 memset.c -S -o -
$ clang -O2 -mavx memset.c -S -o -

Produce:

movq	$0, 24(%rdi)
movq	$0, 16(%rdi)
movq	$0, 8(%rdi)
movq	$0, (%rdi)
  1. Is it ok to generate misaligned 8-byte stores by default?
  2. Is it better to generate misaligned 16-byte SSE stores for the default case? (The default CPU is Core2/Merom.)
  3. Is it better to generate a misaligned 32-byte AVX store for the AVX case?

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 32556.Aug 19 2015, 9:14 AM
spatel retitled this revision from to [x86] invert logic for attribute 'FeatureFastUAMem'.
spatel updated this object.
spatel added a subscriber: llvm-commits.
RKSimon added inline comments.Aug 19 2015, 10:05 AM
lib/Target/X86/X86.td
489 ↗(On Diff #32556)

You can drop FeatureSlowUAMem for BD targets - the AMD 15h SOG confirms that unaligned performance should be the same for aligned addresses and only +1cy for unaligned. It might be more complex for cache-line crossing but most targets will suffer there, not just BD.

spatel added inline comments.Aug 19 2015, 10:13 AM
lib/Target/X86/X86.td
489 ↗(On Diff #32556)

Thanks, Simon. Can we make the same argument for AMD 16H? I was planning to fix these up in the next patch and add test cases since that would be a functional change (FIXME at line 445).

RKSimon added inline comments.Aug 19 2015, 10:21 AM
lib/Target/X86/X86.td
489 ↗(On Diff #32556)

Yes I'm happy for any changes to made in a followup patch.

Jaguar (16h) is definitely as fast for unaligned load/stores with aligned addresses and +1cy for unaligned.

IIRC Bobcat you could do fast unaligned loads (as long as the SSE unaligned flag was set). I think there was something about stores that you had to be careful with though. This is probably the same for all AMD 10h/12h families.

silvas added a subscriber: silvas.Aug 19 2015, 12:24 PM
silvas added inline comments.
lib/Target/X86/X86.td
489 ↗(On Diff #32556)

To expand a bit on what Simon said, for Jaguar, the LD/ST unit performs a (naturally aligned) 16-byte access to L1D each cycle, so the actual rule is that you can be as unaligned as you want as long as you remain within a 16-byte chunk, with no penalty.

Basically if your store crosses a 16-byte chunk you are forcing LD/ST to do an extra round to L1D, which is where the +1cy Simon was talking about comes from. Just considering the mechanism at play here, the compiler can't hope to do anything differently/better in general for the unaligned case, so just let the hardware take care of it if it happens :)

zansari edited edge metadata.Aug 19 2015, 12:30 PM

Hi Sanjay,

Functionality-wise, your changes LGTM (that is, they do what you're intending them to do).

I do think, however, that the code needs a little tweaking down the road. Some random comment in no particular order (all relating to Intel processors.. I'm not too familiar with others) that will hopefully answer some of your questions:

  • There are actually a couple different attributes that we care about:
    • Fast unaligned "instructions". The movups instruction used to be very slow and was always to be avoided before NHM (SLM on the small core side). After NHM/SLM, unaligned instructions were just as fast as aligned instructions, provided that the access doesn't split cache lines.
    • FastER unaligned memory accesses that split the cache. At the same time, the penalty associated with memory accesses that do split the cache lines was reduced.

Since these two attributes are set/unset on the same H/W, we might be able to get away with just the 1 attribute.

This means, however, that the attribute name is slightly mislabeled (not a big deal) and, more importantly, the statement you made "..became fast enough that we can happily use them whenever we want." isn't entirely true. This is true for using the unaligned instructions for cases that don't split the cache, but there is still a penalty for cases when we do split the cache. This is true for all 8B and 16B accesses, and 32B accesses that split either 16B half on anything below HSW (since 32B accesses are double pumped), and split anywhere on HSW and above (where we do full 32B accesses on L0).

We can be more "clumsy" in unaligned memory accesses with modern H/W, but we can't completely ignore splits. For example:

loop:
    vmovups %ymm, array+64(%rcx)    // 0 mod 32
    vmovups %ymm, array+128(%rcx)  // 0 mod 32

Compared with

loop:
    vmovups %ymm, array+48(%rcx)    // 16 mod 32
    vmovups %ymm, array+112(%rcx)  // 16 mod 32

Is around 3x slower on a HSW. On anything below HSW, the performance is equal due to double pumping the accesses, but suffer similar slowdowns if we split the 16B parts (similar to other pure 16B references)

Doing this:

loop:
    vmovups %xmm, array+48(%rcx)
    vmovups %xmm, array+64(%rcx)
    vmovups %xmm, array+112(%rcx)
    vmovups %xmm, array+128(%rcx)

.. changes performance to be right in between the two loops above.

My initial thoughts on heuristics:

  • In general, I think that if we know what the alignment is and we know we will split a cache line, we should use 2 instructions to avoid any penalties.
  • If we don't know the alignment and want to minimize ld/st counts by using larger instructions, it can be worth the gamble on NHM/SLM+ architectures.
  • On H/W before NHM/SLM, we should avoid unaligned instruction.

Hope this helps, and sorry for the long response.

Thanks,
Zia.

Thanks, Sean and Zia. Certainly appreciate the detailed analysis.

So it sounds like we're not completely broken. The attributes themselves may be sufficient to generate the code as we want to, and they are even mostly applied correctly for each CPU. I'll fix the AMD chips as much as possible in a follow-on patch.

The bugs are as I noted in the FIXME/TODO comments in the lowering itself. We should probably also have the AVX or SSE4.2 attribute imply that IsUAMemUnder32Slow is false. That way if someone has only specified -mavx without a particular -mcpu, they would get the benefit of larger unaligned memory accesses. Alternatively, -mavx might bump the default CPU model from Merom to Sandybridge.

Regarding crossing cachelines: the current behavior is that we don't take that into account, and that's probably how it will remain...unless for example, a front-end starts specifying 'align 64' in the IR.

Added a test case to partly confirm that this patch is NFC and to serve as the test for the upcoming AMD changes:
http://reviews.llvm.org/rL245704
http://reviews.llvm.org/rL245709

This revision was automatically updated to reflect the committed changes.