This is an archive of the discontinued LLVM Phabricator instance.

[IR] document and update ctlz/cttz intrinsics to optionally return poison rather than undef
ClosedPublic

Authored by spatel on Jan 21 2022, 11:20 AM.

Details

Summary

The behavior in Analysis (knownbits) implements poison semantics already, and we expect the transforms (for example, in instcombine) derived from those semantics, so this patch changes the LangRef and remaining code to be consistent. This is one more step in removing "undef" from LLVM.

Without this, I think https://github.com/llvm/llvm-project/issues/53330 has a legitimate complaint because that report wants to allow subsequent code to mask off bits, and that is allowed with undef values. The clang builtins are not actually documented anywhere AFAICT, but we might want to add that to remove more uncertainty.

Hopefully, we are approaching a point where we can stop caring about the legacy behavior (~10 years since better x86 instructions were introduced)...but I'm usually too optimistic about those timeframes. :)

Diff Detail

Event Timeline

spatel created this revision.Jan 21 2022, 11:20 AM
spatel requested review of this revision.Jan 21 2022, 11:20 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2022, 11:20 AM
nikic accepted this revision.Jan 21 2022, 11:30 AM

LGTM

This revision is now accepted and ready to land.Jan 21 2022, 11:30 AM
lebedev.ri accepted this revision.Jan 21 2022, 11:31 AM

lg
Thanks.

Seems fine. This is technically not bitcode backwards-compatible, but that seems unlikely to matter.

Hopefully, we are approaching a point where we can stop caring about the legacy behavior

You mean, we might reach a point where we could make clang emit is_zero_poison=false without anyone caring? Don't think we're quite there yet. x86-64-v2 doesn't include lzcnt. And not sure about the current state of the compiler-rt/libgcc intrinsics.

nlopes accepted this revision.Jan 21 2022, 12:18 PM

Great, thank you!

You mean, we might reach a point where we could make clang emit is_zero_poison=false without anyone caring? Don't think we're quite there yet. x86-64-v2 doesn't include lzcnt. And not sure about the current state of the compiler-rt/libgcc intrinsics.

Yes, I was wondering what level of default x86 tuning was needed to flip the setting...I suppose that's x86-64-v3 ("Close to Haswell").

You mean, we might reach a point where we could make clang emit is_zero_poison=false without anyone caring? Don't think we're quite there yet. x86-64-v2 doesn't include lzcnt. And not sure about the current state of the compiler-rt/libgcc intrinsics.

Yes, I was wondering what level of default x86 tuning was needed to flip the setting...I suppose that's x86-64-v3 ("Close to Haswell").

Is updating isCheapToSpeculateCTTZ/CTLZ to check CMOV instead of BMI/LZCNT enough?

You mean, we might reach a point where we could make clang emit is_zero_poison=false without anyone caring? Don't think we're quite there yet. x86-64-v2 doesn't include lzcnt. And not sure about the current state of the compiler-rt/libgcc intrinsics.

Yes, I was wondering what level of default x86 tuning was needed to flip the setting...I suppose that's x86-64-v3 ("Close to Haswell").

Is updating isCheapToSpeculateCTTZ/CTLZ to check CMOV instead of BMI/LZCNT enough?

That would let this:

int clipped2(std::uint64_t a) { return (a == 0) ? 64 : __builtin_ctzll(a); }

Become:

bsfq	%rdi, %rcx
movl	$64, %eax
cmovneq	%rcx, %rax

Instead of a 'test + je' around the bsf, so that's probably good. But I can't remember all of the corner-case patterns that we're trying to account for with these builtins.

zielaj added a subscriber: zielaj.EditedJan 21 2022, 3:07 PM

I'd like to provide more context for https://github.com/llvm/llvm-project/issues/53330. This issue arose while working on performance-critical code in primesieve, which uses the output of __builtin_ctzll to do an array lookup. It turns out that, because of poorly predictable branches, it's faster for us to compute __builtin_ctzll(0), do the lookup, and then ignore it later, than to call __builtin_ctzll(x) conditionally on x != 0. Note that we don't need __builtin_ctzll(0) to be equal to 64 or to any specific value, we just need to be able to bound it somehow so that the array lookup doesn't segfault. This is where the idea of & 0x7f came from. The hope was that when compiled on processors that support TZCNT this would be optimized out to nothing, yet still work on processors that don't, at a cost of one additional instruction (AND).

The fastest workaround we have is computing __builtin_ctzll(x | (1ull << 63)) instead. This adds one more 1-cycle instruction (OR), is always bounded, and is equivalent to __builtin_ctzll(x) for non-zero x. Still, since the calls are in a hot loop, this additional instruction results in measurable slowdown, this is why primesieve currently uses inline assembly to force TZCNT when possible, because we couldn't find a C++-only way to accomplish this.

This comment is not a disagreement with this proposal, it just provides more context. I accept that our need may be too obscure to be addressed by clang, and wanted to say thank you for responding to the issue I raised so quickly and for clarifying how undefined __builtin_ctzll(0) is.

I'd like to provide more context for https://github.com/llvm/llvm-project/issues/53330. This issue arose while working on performance-critical code in primesieve, which uses the output of __builtin_ctzll to do an array lookup. It turns out that, because of poorly predictable branches, it's faster for us to compute __builtin_ctzll(0), do the lookup, and then ignore it later, than to call __builtin_ctzll(x) conditionally on x != 0. Note that we don't need __builtin_ctzll(0) to be equal to 64 or to any specific value, we just need to be able to bound it somehow so that the array lookup doesn't segfault. This is where the idea of & 0x7f came from. The hope was that when compiled on processors that support TZCNT this would be optimized out to nothing, yet still work on processors that don't, at a cost of one additional instruction (AND).

The fastest workaround we have is computing __builtin_ctzll(x | (1ull << 63)) instead. This adds one more 1-cycle instruction (OR), is always bounded, and is equivalent to __builtin_ctzll(x) for non-zero x. Still, since the calls are in a hot loop, this additional instruction results in measurable slowdown, this is why primesieve currently uses inline assembly to force TZCNT when possible, because we couldn't find a C++-only way to accomplish this.

This comment is not a disagreement with this proposal, it just provides more context. I accept that our need may be too obscure to be addressed by clang, and wanted to say thank you for responding to the issue I raised so quickly and for clarifying how undefined __builtin_ctzll(0) is.

Does x != 0 ? 64 : __builtin_ctzll(x) not produce tzcnt when it's available? I would hope you wouldn't need to resort to inline assembly.

@craig.topper Yes, it does, when compiled for example with -march=skylake. But when compiled with -march=x86-64, which I guess is a default for many linux distributions, it compiles to conditional jumps rather than conditional moves, even with __builtin_unpredictable, which is slow in our case.

On top of this, a widely used version of GCC (11.2), inserts conditional jumps even with -march=skylake (the trunk version of GCC has this fixed).

So, in this case, it's difficult to guarantee optimum generated code case across compilers and architectures.

__builtin_unpredictable

Backend (SDAG) ignores this hint sadly, it would be really cool to have it.

__builtin_unpredictable

Backend (SDAG) ignores this hint sadly, it would be really cool to have it.

Does the hint even survive through the middle end? I think the pattern will be turned into llvm.cttz.i64(i64 %x, i1 false). Then, CodeGenPrepare turns it back into the control flow version based on isCheapToSpeculateCTTZ.

__builtin_unpredictable

Backend (SDAG) ignores this hint sadly, it would be really cool to have it.

Does the hint even survive through the middle end? I think the pattern will be turned into llvm.cttz.i64(i64 %x, i1 false). Then, CodeGenPrepare turns it back into the control flow version based on isCheapToSpeculateCTTZ.

Ah, indeed, yes.

unpredictable metadata may be attached to any branch or switch instruction.

maybe allow for intrinsics too?

Thanks, @zielaj for explaining the larger example. It's not clear to me exactly what clang/LLVM can do to improve that case, but if you have a suggestion, please do file another issue.