This is an archive of the discontinued LLVM Phabricator instance.

[X86] Promote i8/i16 CTTZ (BSF) instructions and remove speculation branch
ClosedPublic

Authored by RKSimon on Aug 23 2022, 6:31 PM.

Details

Summary

This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.

For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.

Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.

Diff Detail

Event Timeline

RKSimon created this revision.Aug 23 2022, 6:31 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2022, 6:31 PM
RKSimon requested review of this revision.Aug 23 2022, 6:31 PM
RKSimon added inline comments.Aug 23 2022, 6:32 PM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
696

@foad @arsenm Is the Type arg enough for the target to make a better decision on when to speculate cttz/ctlz?

pengfei added inline comments.Aug 23 2022, 9:53 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
5824

promote

RKSimon updated this revision to Diff 455122.Aug 24 2022, 3:02 AM

Fix capitalization typo

pengfei accepted this revision.Aug 24 2022, 4:27 AM

LGTM.

This revision is now accepted and ready to land.Aug 24 2022, 4:27 AM
foad added inline comments.Aug 24 2022, 4:27 AM
llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
696

I don't think we need to do any more work here. The hook should always return true for AMDGPU and you can delete the FIXME comment.

This revision was landed with ongoing or failed builds.Aug 24 2022, 9:28 AM
This revision was automatically updated to reflect the committed changes.

Looks good to me. Ideas for further minor improvements. (I guess I should be filing these as missed-optimization bug reports now that this commit is in?)

If we only need a 16-bit output, we can potentially tzcnt %di, %ax, at the cost of a false dependency merging into AX. Again probably only sensible at -Os or -Oz. And of course that requires that we know tzcnt will run as tzcnt, not bsf. tzcnt itself produces the operand-size when the input is zero, like we're emulating without it.

But tzcnt already has a false output dependency on Intel CPUs from Sandybridge through Broadwell (fixed in Skylake for tzcnt/lzcnt, but not popcnt until later).

One way to avoid that (and still emulate tzcnt by setting a high bit):

leal    65536(%rdi), %eax
rep bsf  %eax, %eax

If EDI holds a 16-bit function arg that we require to be zero-extended to 16-bit, that's equivalent to OR. If it was sign-extended to 16-bit, the top half might be all-ones in which case this would clear it to zero. But the high half being non-zero would imply that the sign bit was set in the low half, so that's actually fine. We only need bit 16 set when the low half is zero, and this does achieve that goal.

(Assuming our caller respects clang's unofficial extension to the x86-64 SysV calling convention where narrow args are extended to 32-bit, which GCC follows but ICC doesn't. If the high half is arbitrary garbage when the low half is zero, no constant we can add to it can guarantee non-zero, let alone producing 16 exactly. So LEA can't work then.)

llvm/test/CodeGen/X86/clz.ll
553

This would be even better as

mov  4(%esp), %eax
or      $65536, %eax

Only 2 back-end uops (instead of 1 mov-immediate + 1 micro-fused load+or). Pretty much equivalent for the front-end and ROB, but takes one fewer entry in the RS until these execute. The mov-immediate is independent so it can exec early, but still takes a cycle on an execution unit. And then we're left with load and OR uops, just like with my asm.
The one advantage is that the mov-immediate could possibly retire from the ROB while still stalled waiting for the load, but that seems pretty unlikely. (Cache miss on stack memory in a function with stack args, or at the end of a long dependency chain involving storing the arg).

That movzx + or-immediate asm is what we get from C++ source that manually sets a bit, __builtin_ctz(x|0x10000)
https://godbolt.org/z/d63WM5s4c (clang nightly after this revision landed).

Or from std::countr_zero(*p); or on a reference, so it seems this mov-immediate thing is only with by-value stack args, and isn't a concern for the general case of other addressing modes. So always a simple addressing mode, ESP or EBP, where we won't have un-lamination on Sandy/Ivy Bridge. Also of course avoids reading outside the pointed-to 16-bit object, which could cross into an unmapped page. (Or suffer a slowdown from a cache-line split).

Anyway, what we do when dereferencing a pointer like for std::countr_zero(*p);is slightly more efficient than what we do for a by-value stack arg.

But it seems this is not specific to CTTZ lowering; we get mov-immediate / or mem,reg from return x|0x1000;

(vs. 64-bit mode correctly choosing mov %edi, %eax / or $65536, %eax, which puts the mov on the critical path on CPUs without mov-elimination, like Ice Lake with updated microcode :(. But it saves code size. In the 32-bit case there's no latency advantage: a load has to happen as part of the critical path either way, and OR register has the same latency as OR-immediate.)

BTW, with -Oz at least, we should be using 4-byte bts $16, %eax instead of 5-byte or $65536, %eax (or 6-byte for other registers, plus a REX if needed). On Intel CPUs its still only 1 uop, although can run on fewer execution ports (p06 in SKL/ICL, only p1 in Alder Lake P-cores). Appropriate at least for -Os -mtune=intel (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection, probably even -O2 -mtune=intel. Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port. So maybe only for a -march before that? But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings.

On AMD CPUs, bts $imm, %reg is 2 uops, so only appropriate for -Oz and maybe -Os.

560

BTW, with -Oz at least, we should be using 4-byte bts $16, %edi instead of 6-byte or $65536, %edi (or 5-byte for EAX).

On Intel CPUs bts $i8, %reg is still only 1 uop, although can run on fewer execution ports than or (p06 in SKL/ICL; the shift ports. Only p1 in Alder Lake P-cores). Appropriate at least for -Os -mtune=intel (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection.
Perhaps even -O2 -mtune=intel. Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port. (BTS still has 1 cycle latency, unlike most integer uops that can only run on port 1. Alder Lake E-cores run as 1 uop with 1 cycle latency to the integer output, 2 cycle latency to the CF output.)

So maybe only for -O2 with a -march before Alder Lake? But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings, just 2 bytes. OTOH it might be a pretty small gain, unless used in a loop with a port 1 bottleneck.

On AMD CPUs, bts $imm, %reg is 2 uops, so only appropriate for -Oz and maybe -Os.