Page MenuHomePhabricator

[X86] Use bts/btr/btc for single bit set/clear/complement of a variable bit position
ClosedPublic

Authored by craig.topper on Jun 26 2018, 1:50 PM.

Details

Summary

If we are just modifying a single bit at a variable bit position we can use the BT* instructions to make the change instead of shifting a 1(or rotating a -1) and doing a binop. These instruction also ignore the upper bits of their index input so we can also remove an and if one is present on the index.

I'll see if I can spread some multiclass goodness on the td file to reduce the repetition.

Fixes PR37938

Diff Detail

Event Timeline

craig.topper created this revision.Jun 26 2018, 1:50 PM
craig.topper edited the summary of this revision. (Show Details)
  1. No tests for 16-bit case.
  2. Pattern with memory operands unhandled
  3. Pattern with immediates not handled.

(i can follow-up with that, if wanted, mostly have the tests anyway)

I'll add the 16-bit tests.

Memory patterns shouldn't be done. BTR/S/C from memory have strange semantics(they don't ignore unused bits of the index) and are really slow because they recompute the memory address using all bits of the index register.

What do you mean patterns with immediates?

I'll add the 16-bit tests.

Memory patterns shouldn't be done. BTR/S/C from memory have strange semantics(they don't ignore unused bits of the index) and are really slow because they recompute the memory address using all bits of the index register.

You still probably want some test to document that?

What do you mean patterns with immediates?

https://godbolt.org/g/GH5p7Z
Unless of course that is also not profitable.

Added the 16-bit test. The 16-bit btr pattern on x86-64 fails to match because the 'and' got promoted to i32 but the rotate didn't. The btc/bts test promoted everything to i32 so we just use the i32 pattern.

The i686 tests fail to select the new pattern because the parameters came off the stack and we still prioritize load folding over these new patterns. In larger sections of real world code where the data didn't directly come from a load we should manage to fold. I can try to add more tests that pad have a preamble of other operations to separate from the load if we want.

For the immediate case we prefer AND/OR/XOR because they have reciprocal throughput of 0.25 on Haswell and Skylake. BTR/BTS/BTC have a reciprocal throughput of 0.5. We had a discussion about what to do for 64-bit and/or/xor when the immediate larger than a 32-bit sign extended value and thus requries a movabsq. We currently use bts/btr/btc for those under optsize see x86-64-bittest-logic.lll

RKSimon added inline comments.Jun 26 2018, 2:38 PM
test/CodeGen/X86/btc_bts_btr.ll
4

Aren't X86/X64 the wrong way around?

RKSimon added inline comments.Jun 26 2018, 2:40 PM
test/CodeGen/X86/btc_bts_btr.ll
4

Sorry - old revision!

Add more tests with explicit loads.

Should I add the RMW memory test cases that we definitely shouldn't use BTS/BTR/BTC for?

Should I add the RMW memory test cases that we definitely shouldn't use BTS/BTR/BTC for?

I would, and explicitly spell _dontfold in names so it is obvious that these are intentional.

Add the RMW test cases

lebedev.ri added inline comments.Jun 26 2018, 11:48 PM
lib/Target/X86/X86InstrCompiler.td
1811–1813

It doesn't look like the 16-bit cases matched?

# *** IR Dump After X86 DAG->DAG Instruction Selection ***:
# Machine code for function btr_16: IsSSA, TracksLiveness
Function Live Ins: $edi in %0, $esi in %1

bb.0 (%ir-block.0):
  liveins: $edi, $esi
  %1:gr32 = COPY $esi
  %0:gr32 = COPY $edi
  %2:gr8 = COPY %1.sub_8bit:gr32
  %3:gr16 = MOV16ri -2
  $cl = COPY %2:gr8
  %4:gr16 = ROL16rCL %3:gr16, implicit-def dead $eflags, implicit $cl
  %6:gr32 = IMPLICIT_DEF
  %5:gr32 = INSERT_SUBREG %6:gr32, killed %4:gr16, %subreg.sub_16bit
  %7:gr32 = AND32rr %0:gr32, killed %5:gr32, implicit-def dead $eflags
  %8:gr16 = COPY %7.sub_16bit:gr32
  $ax = COPY %8:gr16
  RET 0, $ax
test/CodeGen/X86/btc_bts_btr.ll
513–517

I wonder if something like

mov (%rdi), %edi
btr %esi, %edi

would be better still than not folding at all?

The 16-bit BTR fails to match because the 'and' got promoted to 32-bit and the rotate didn't. We need to fix the promotion of the rotate. I don't think we should try to pattern match the bit width mismatch. In reality, C type promotion rules make it likely the IR for an "unsigned short" case is already in i32 before we even get to the backend so its probably not a huge issue. So I don't think that should hold up this patch. I'm happy to add a FIXME and/or file a bug.

test/CodeGen/X86/btc_bts_btr.ll
513–517

Probably, but its not exactly easy to do. Tablegen generates the match order by ranking how many SDNodes are covered by the pattern. The regular memory pattern for and/or/xor covers more nodes so gets higher priority. To override the priority you have to add an AddedComplexity line to the pattern. But I worry that significantly bumping the priority of this pattern to override the load pattern may have other effects and require other priorities to be adjusted. I might me being overly cautious, but I'd like to keep the simple approach. gcc doesn't do this when there is a memory op either.

The 16-bit BTR fails to match because the 'and' got promoted to 32-bit and the rotate didn't. We need to fix the promotion of the rotate. I don't think we should try to pattern match the bit width mismatch. In reality, C type promotion rules make it likely the IR for an "unsigned short" case is already in i32 before we even get to the backend so its probably not a huge issue. So I don't think that should hold up this patch. I'm happy to add a FIXME and/or file a bug.

Oh right, https://godbolt.org/g/qsd6aF, only the last case isn't folded if i try it locally.

lebedev.ri accepted this revision.Jun 27 2018, 1:57 AM

OK, looks good, other than nits, but you may want to wait for a second opinion.
(Please split into two commits - the tests first)

lib/Target/X86/X86InstrCompiler.td
1822

I would love to know how it would be possible to deduplicate the cases with mask and without mask,
since that is essentially the same problem i have in D48491 :)

1833

The 16-bit BTR fails to match because the 'and' got promoted to 32-bit and the rotate didn't. We need to fix the promotion of the rotate.

Yeah, i think for now this line should be disabled, and a FIXME added/bug filled.

This revision is now accepted and ready to land.Jun 27 2018, 1:57 AM
RKSimon added inline comments.Jun 27 2018, 5:49 AM
test/CodeGen/X86/btc_bts_btr.ll
135

Separate issue but we should really have movzbl here

craig.topper added inline comments.Jun 27 2018, 8:22 AM
test/CodeGen/X86/btc_bts_btr.ll
135

FixupBWInst only promotes byte loads in loops due to code size increase.