Download Raw Diff

Details

Reviewers

RKSimon
spatel
zvi
chandlerc

Commits

rG2b2d8c5eb262: [X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the…
rL325287: [X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the…

Summary

We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.Sep 2 2017, 6:48 PM

Harbormaster completed remote builds in B9871: Diff 113670.Sep 2 2017, 6:48 PM

craig.topper added reviewers: RKSimon, spatel, zvi, chandlerc.Sep 2 2017, 6:49 PM

craig.topper added a subscriber: llvm-commits.

The code change LGTM, but i'm not sure about the profitability. On recent Intel processors (not sure about AMD), 'or' can be executed at double throughput than 'bt'. Assuming the extra constant move being saved is hoisted out of a hot loop, are we gaining or losing?
For -Os this should be fine.

I can remove the change from TargetTransformInfo.cpp which would make the constants get hoisted by the ConstantHoisting pass. Would that help?

gcc and icc both use the bt instructions for at least simple single BB cases. I'll check some more complicated cases.

In D37418#859821, @zvi wrote:

The code change LGTM, but i'm not sure about the profitability. On recent Intel processors (not sure about AMD), 'or' can be executed at double throughput than 'bt'. Assuming the extra constant move being saved is hoisted out of a hot loop, are we gaining or losing?
For -Os this should be fine.

I suspect AMD CPUs would want the same output as Intel big cores on these. For example, Agner shows 2 uops for btc on Ryzen, so it has higher latency and less throughput than the plain logic ops. The extra uop is caused by the partial flags update of these instructions?

This is another example where we can't determine profitability in isel. So we should choose whatever we think helps the most common case there. Then, add an MI transform to alter it if we can find profit using uarch machine models and machine trace metrics. If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

I ran into this case in a snippet of hot code just Friday, so I'm glad to see someone's working on it. :) In the case I had, the register allocator was spilling a register in a hot loop just to free up room for constant. I hadn't had a chance to actually performance test the tweaked assembly, but I'd be very surprised if the btc variant wasn't faster in that case.

In D37418#859967, @spatel wrote:

If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

Another option here would be to treat this as a register allocator rematerialization problem. We have precedent for doing that for loads which can be folded into instructions, and the "fold constant into instruction" pattern we see here seems somewhat similar in spirit.

Out of the other options, I can't see a late pattern match to convert and/constant form to btc working out very well. In particular, the spilling will have already been done and reversing that seems complicated. Starting with a btc and expanding to the constant/and pattern might be a bit better, but then I worry about missed hoisting opportunities in loops with registers live over them which could have been spill/filled outside the loop.

On a vaguely related topic, has anyone looked at the possible benefit of producing single bit constants via an 8 bit move immediate and a shift immediate? If I'm remembering correctly, it should encode smaller.

In D37418#859967, @spatel wrote:

This is another example where we can't determine profitability in isel. So we should choose whatever we think helps the most common case there. Then, add an MI transform to alter it if we can find profit using uarch machine models and machine trace metrics. If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

This is a good argument, but then as @reames mentioned we could also try to handle register-pressure at spill-time by letting the target try some peephole-ish tricks to avoid spilling; not sure how well will this fits in the existing regalloc infrastructure.
FWIW, we set different scheduling strategies for 32-bit and 64-bit due to the difference in register abundance:

else if (Subtarget.is64Bit())
    setSchedulingPreference(Sched::ILP);
  else
    setSchedulingPreference(Sched::RegPressure);

Remove the X86TargetTransformInfo.cpp changes and rebase.

Still need to change to -Os only

Harbormaster completed remote builds in B13792: Diff 129674.Jan 12 2018, 11:37 AM

craig.topper mentioned this in D42615: [X86] Generate BT instrutions a bit more agressively.Jan 27 2018, 11:43 AM

craig.topper added a comment.Jan 27 2018, 4:04 PM

This comment was removed by craig.topper.

spatel mentioned this in D43201: [X86] Only reorder srl/and on last DAG combiner run.Feb 15 2018, 8:04 AM

Qualify with optsize. Add tests with/without optsize

Once more with test file

LGTM

This revision is now accepted and ready to land.Feb 15 2018, 11:14 AM

Closed by commit rL325287: [X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the… (authored by ctopper). · Explain WhyFeb 15 2018, 12:00 PM

This revision was automatically updated to reflect the committed changes.

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the upper 32-bits of a 64-bit operation.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 134483

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

llvm/trunk/lib/Target/X86/X86InstrInfo.td

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the upper 32-bits of a 64-bit operation.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 134483

llvm/trunk/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/trunk/lib/Target/X86/X86InstrCompiler.td

llvm/trunk/lib/Target/X86/X86InstrInfo.td

[X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the upper 32-bits of a 64-bit operation.
ClosedPublic