SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures.
While generating shld/shrd instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
Given:
return x >> 7 | y << 57;
The generated instruction sequence is:
shld $7 , %rax , %rdx
we should actually prefer:
shl $57 , %rax shr $7 , %rdx or %rax , %rdx
which are all DirectPath instructions.
AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions.
I couldn't find optimization guides for AMD's processors family K14 and on the Web, but actual performance measurements showed 30% speedup for Bobcat (family K14). I'd like to get confirmation from the community's AMD experts that family K14 processors have poor latency SHLD/SHRD instructions.
Experiments on Ivy Bridge showed 15% improvement, when an alternative sequence of instructions was generated (thanks to Dmitry Babokin from Intel for running the performance measurements for me). I would also like to hear from Intel experts. If you know which Intel's processors should have a flag "have poor latency for SHLD/SHRD instructions" - please let me know.
Here are the references to AMD's processors optimization guide:
K7 families: http://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf
Athlon, Athlon-tbird, Athlon-4, Athlon-xp, Athlon-mp
K8 families: http://developer.amd.com/wordpress/media/2012/10/25112.pdf
Athlon64, Opteron, AMD 64 FX, AMD k8-sse, AMD Athlon64-sse3, AMD Opteron-sse3
K10 and K12:
http://amddevcentral.com/Resources/documentation/guides/Pages/default.aspx
-> Software Optimization Guide for AMD Family 10h and 12h Processors
amdfam10
K14:
AMD btver1 (Bobcat)
-> Couldn't find Optimization guide for AMD Fam 14, but I think shld documentation is applicable for Bobcat as well.
K15:
http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
-> search for "Software Optimization Guide for AMD Family 15h Processors"
bdver1 (Bulldozer), bdver2 (Piledriver)
K16:
btver2 (Jaguar)
-> http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
Description of the changes:
lib/Target/X86/X86.td:
Introduced a new feature FeatureSlowSHLD that should be set up for the architectures that are
known to have SHLD/SHRD instructions with very poor latency.
Enabled this feature for all AMD's family K8-K16 architectures.
lib/Target/X86/X86ISelLowering.cpp:
Don't fold (or (x << c) | (y >> (64 - c))) if SHLD/SHRD instructions
have high latencies and we are not optimizing for size.
lib/Target/X86/X86Subtarget.cpp
Set IsSHLDSlow to false by default.
When autodetecting subtarget features - set IsSHLDSlow to true for AMD processors.
Extra whitespace.