This is an archive of the discontinued LLVM Phabricator instance.

[MS] Fix _bittest* intrinsics for values bigger than 31
AcceptedPublic

Authored by rnk on May 26 2017, 4:56 PM.

Details

Reviewers
hans
majnemer
Summary

These intrinsics are supposed to select to BT, BTS, etc instructions.
Those instructions actually perform a bitwise array indexing memory
operation that LLVM doesn't currently expose. This change implements
the shifting and array indexing in plain C.

Fixes PR33188

If we ever fix PR19164, then the array indexing should be pattern
matched to BT and BTS memory instructions as appropriate.

Event Timeline

rnk created this revision.May 26 2017, 4:56 PM
hans accepted this revision.May 30 2017, 9:34 AM

lgtm

This revision is now accepted and ready to land.May 30 2017, 9:34 AM

Do you really want to be doing signed division here? Because it is signed, it won't turn into the simple bitshifts and masks that one might expect.

hans added a comment.May 30 2017, 3:59 PM

From looking in the Intel manual (Table 3-2, in 3.1.1.9 about Bit(BitBase,BitOffset)) it does sound like the bit offset can be negative *shudder*, so I suppose this is necessary and explains why the type is signed in the first place? Hopefully most of these will be inlined into a context where BitPos is constant or unsigned.

In D33616#768287, @hans wrote:

From looking in the Intel manual (Table 3-2, in 3.1.1.9 about Bit(BitBase,BitOffset)) it does sound like the bit offset can be negative *shudder*, so I suppose this is necessary and explains why the type is signed in the first place? Hopefully most of these will be inlined into a context where BitPos is constant or unsigned.

Indeed :/

rsmith added a subscriber: rsmith.May 30 2017, 5:27 PM
In D33616#768287, @hans wrote:

From looking in the Intel manual (Table 3-2, in 3.1.1.9 about Bit(BitBase,BitOffset)) it does sound like the bit offset can be negative *shudder*, so I suppose this is necessary and explains why the type is signed in the first place? Hopefully most of these will be inlined into a context where BitPos is constant or unsigned.

OK, but this code will produce the wrong answer in cases where _BitPos is negative, because signed division rounds towards zero rather than rounding down.

majnemer added inline comments.May 30 2017, 7:57 PM
clang/lib/Headers/intrin.h
345–347

_bittest seems to expand to (((unsigned char const *)_BitBase)[_BitPos >> 3] >> (_BitPos & 7)) & 1 on CL ARM: https://godbolt.org/g/Yc8rMH

Perhaps these are the intended semantics?

rnk added a comment.May 31 2017, 1:56 PM

Darn, I think Richard is right, the signed div/mod doesn't do the right thing for negative values. Honestly I need to rig up a test case for this, and then I'll come back to it.

What do you folks think is best:

  • Add an LLVM intrinsic for this and use it
  • Use inline assembly in intrin.h
  • Emit an InlineAsm call directly in CGBuiltin.cpp
  • Keep expanding this to plain C and wait for us to eventually pattern match this in LLVM CodeGen

The LLVM pattern match optimization feels increasingly difficult because of all the edge cases around negative indices, and that's pushing me back towards doing this as an instruction.

clang/lib/Headers/intrin.h
345–347

Personally I think we need to match the x86 behavior on x86.

My feeling is that there's not much value in adding an LLVM intrinsic for something that can be expressed directly in a handful of IR instructions; those instructions seem like the way to express this that would allow the most optimization potential, and the backend should be able to pattern match this to a BT instruction. Based on that, I'm reluctant to add clang builtins for this, partly because it looks like we'd need quite a lot of them and partly because (in the absence of an IR intrinsic) they'd just get lowered into the same instructions as the plain C code anyway.

clang/lib/Headers/intrin.h
398

I'm confused: why are there distinct ...32 and ...64 versions of these, if the bit number can leave the (d|q)word anyway? Are we supposed to number bits right-to-left within words on big endian targets? Or is this guaranteeing something about alignment or about being able to load a full 32/64 bits at the requested word offset?