This is an archive of the discontinued LLVM Phabricator instance.

Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.
Needs ReviewPublic

Authored by kromanova on Nov 13 2013, 11:28 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures.
While generating shld/shrd instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.

Given:

return x >> 7 | y << 57;

The generated instruction sequence is:

shld $7 , %rax , %rdx

we should actually prefer:

shl $57 , %rax
shr $7 , %rdx
or %rax , %rdx

which are all DirectPath instructions.

AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions.

I couldn't find optimization guides for AMD's processors family K14 and on the Web, but actual performance measurements showed 30% speedup for Bobcat (family K14). I'd like to get confirmation from the community's AMD experts that family K14 processors have poor latency SHLD/SHRD instructions.

Experiments on Ivy Bridge showed 15% improvement, when an alternative sequence of instructions was generated (thanks to Dmitry Babokin from Intel for running the performance measurements for me). I would also like to hear from Intel experts. If you know which Intel's processors should have a flag "have poor latency for SHLD/SHRD instructions" - please let me know.

Here are the references to AMD's processors optimization guide:

K7 families: http://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf
Athlon, Athlon-tbird, Athlon-4, Athlon-xp, Athlon-mp

K8 families: http://developer.amd.com/wordpress/media/2012/10/25112.pdf
Athlon64, Opteron, AMD 64 FX, AMD k8-sse, AMD Athlon64-sse3, AMD Opteron-sse3

K10 and K12:
http://amddevcentral.com/Resources/documentation/guides/Pages/default.aspx
-> Software Optimization Guide for AMD Family 10h and 12h Processors
amdfam10

K14:
AMD btver1 (Bobcat)
-> Couldn't find Optimization guide for AMD Fam 14, but I think shld documentation is applicable for Bobcat as well.

K15:
http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
-> search for "Software Optimization Guide for AMD Family 15h Processors"
bdver1 (Bulldozer), bdver2 (Piledriver)

K16:
btver2 (Jaguar)
-> http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/

Description of the changes:

lib/Target/X86/X86.td:
Introduced a new feature FeatureSlowSHLD that should be set up for the architectures that are
known to have SHLD/SHRD instructions with very poor latency.
Enabled this feature for all AMD's family K8-K16 architectures.

lib/Target/X86/X86ISelLowering.cpp:
Don't fold (or (x << c) | (y >> (64 - c))) if SHLD/SHRD instructions
have high latencies and we are not optimizing for size.

lib/Target/X86/X86Subtarget.cpp
Set IsSHLDSlow to false by default.
When autodetecting subtarget features - set IsSHLDSlow to true for AMD processors.

Diff Detail

Event Timeline

Some trivial comments on the patch, otherwise it looks pretty good. I am curious about the IvyBridge and above that you mentioned, how'd you test?

lib/Target/X86/X86Subtarget.cpp
275

Extra whitespace.

test/CodeGen/X86/x86-64-double-precision-shift-left.ll
3

Missed some of your comment here?

test/CodeGen/X86/x86-64-double-precision-shift-right.ll
3

Same here.

test/CodeGen/X86/x86-64-double-shifts-var.ll
2

Should use FileCheck instead of grep.

Hi Eric,

I didn't have access to the machine with the one of the latest Intel's processors. So, I asked one of my friends, Dmitry Babokin, who works on ISPC compiler in Moscow, to do this performance testing on one of Intel's latest architectures. I generated 2 assembly files with LLVM compiler (with and without SHLD) for the following test:

int64_t s128(uint64_t a, uint64_t b, int shift)
{

return (a << shift) | (b >> (64-shift));

}
uint64_t s128i(uint64_t a, uint64_t b)
{

return s128(a, b, 7);

}

Dmitry ran called s128i function 100 million times. The test with shld instruction took 2.18 sec to finish. The test using alternative sequence of instructions took 1.89 sec, which is 13.3 % faster. All the experiments were done on Ivy Bridge architecture.

Dmitri also confirmed that on Ivy Bridge Intel's compiler 13.0 generates code *without* shld instructions.

It will be nice to get a full list of Intel's architectures where shld instruction has very high latency.

Awesome. This should probably be turned on at least for Ivy Bridge
where we have numbers.

Nadav: ?

-eric

kromanova updated this revision to Unknown Object (????).Nov 17 2013, 11:37 PM

Made the corrections based on Eric's comments.

nadav added a comment.Nov 18 2013, 8:37 AM

Katya,

From your earlier description it sounds like neither Intel nor AMD processors benefit from this transformation. Why don’t we enable it only for Oz? What is the point of adding FeatureSlowSHLD?

Thanks,
Nadav

Hi Nadav,

Thanks for looking into this!

There were several reasons for adding FeatureSlowSHLD:

(1) I don't really know which Intel architectures have very poor latency for shld/shrd. Based on my friend's performance measurements it seems that Ivy Bridge microarchitecture is a good candidate, but that's still needs to be confirmed (that's why I even haven't changed the code for Ivy Bridge). I have a feeling that all other modern Intel processors will fall into this category as well. However, I don't want to change the code purely based on my "feelings". So far, I haven't heard a recommendation from a person who is intimately familiar with Intel's architecture. I'd rather do the change for Intel when I'm 100% sure or let someone else who cares about performance of shld/shrd on any of the Intel's processors (and who knows what he is doing :)) to make this change. After this patch, changing the code to disable this folding for any particular processor will be very easy (just a couple of lines of code). I've put a FIXME comment in the code, mentioning that we might makes sense to disable this folding for Intel, so there is a clue in the code.

(2) Consistency. There are similar features (e.g. FeatureSlowBTMem), that are enabled for all modern Intel and AMD processors, but these features still exist (I suspect for a reason).

(3) Having FeatureSlowSHLD is a more flexible approach. Even assuming that shld/shrd instructions indeed have very high latency for all modern Intel's processors, we still should respect "older" processors and make the support for the new ones easier (what if new AMD fixes shld issue for their next gen processor?).

(4) Someone wrote this folding in the past... I suspect that before writing this code, that person made sure that this folding is beneficial. Of course, it might have happened a while ago and was applicable to the "older" processors.

Katya.
Katya.

nadav added a comment.Nov 19 2013, 8:37 AM

I checked Agner’s instruction table and it looks like on Sandybridge SHLD is *very* efficient. So, let’s commit the patch as is.

Thanks,
Nadav

So, it's OK to commit with the new changes? I will need to get a commit access or ask someone else to commit on my behalf.
Katya.

chfast added a subscriber: chfast.Feb 18 2015, 3:19 AM