This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve i8 + 'slow' i16 funnel shift codegen
ClosedPublic

Authored by RKSimon on May 23 2020, 3:09 AM.

Details

Summary

This is a preliminary patch before I deal with the xor+and issue raised in D77301.

We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers.

fshl(x,y,z) -> (((zext(x) << bw) | zext(y)) << (z & (bw-1))) >> bw.
fshr(x,y,z) -> (((zext(x) << bw) | zext(y)) >> (z & (bw-1))) >> bw.

Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn

This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that.

Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift.

Diff Detail

Event Timeline

RKSimon created this revision.May 23 2020, 3:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2020, 3:09 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
lebedev.ri added inline comments.May 23 2020, 4:10 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19094

This can be anyext

RKSimon updated this revision to Diff 265858.May 23 2020, 7:04 AM

Use anyextend and always extend to i32 straight away (as I said i32 funnel shifts as i64 didn't make much sense so I've dropped that generalization).

LGTM

For fshl case, we could introduce some more ILP: http://volta.cs.utah.edu:8080/z/UJ6viM
https://godbolt.org/z/xsJgPb https://godbolt.org/z/5W26NV
Not sure it would be an improvement?
As a sidenote, we clearly don't fold to either variant in DAGCombiner.

LGTM

For fshl case, we could introduce some more ILP: http://volta.cs.utah.edu:8080/z/UJ6viM
https://godbolt.org/z/xsJgPb https://godbolt.org/z/5W26NV
Not sure it would be an improvement?
As a sidenote, we clearly don't fold to either variant in DAGCombiner.

Looking at these cases in llvm-mca with 'slow shld' targets (btver2/bdver2/znver*) the naive cases all seem to give better throughput

lebedev.ri accepted this revision.May 23 2020, 11:50 AM
This revision is now accepted and ready to land.May 23 2020, 11:50 AM
This revision was automatically updated to reflect the committed changes.
foad added inline comments.May 26 2020, 1:36 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
19087

The final >> bw is wrong for fshr.

llvm/test/CodeGen/X86/fshl.ll
22–23

Would it be worth trying to generate just movb %al, %dh instead of zext+shll+orl?

RKSimon marked an inline comment as done.May 26 2020, 3:08 AM
RKSimon added inline comments.
llvm/test/CodeGen/X86/fshl.ll
22–23

Yes that might be useful but probably should be done generally. I don't know much about the hi-byte move logic @craig.topper might be able to advise?

craig.topper added inline comments.May 26 2020, 12:51 PM
llvm/test/CodeGen/X86/fshl.ll
22–23

I think you'd have to jump through some hoops to get the register allocator to do it. You'd need an INSERT_SUBREG to force the join. Possibly even a pseudo instruction on 64-bit to force NOREX on the other register to avoid an encoding issue.

I'm not sure it makes sense to write an h register on modern Intel CPUs. It guarantees a merge uop needs to be generated when bits 15:8 and 7:0 are both read by the consuming instruction.

efriedma added inline comments.
llvm/test/CodeGen/X86/fshl.ll
22–23

On processors that don't have special rename machinery for 8-bit registers, it should simply save an instruction, if it's legal. On big Intel cores, even if it doesn't save a uop, it should still be smaller.

That said, even if it's profitable in this exact case, the register allocation constraints to make it work are really tight; it's probably only worthwhile if the values are already in ABCD registers.

26

We should probably prefer shrl over movb.