This is an archive of the discontinued LLVM Phabricator instance.

[X86] Use btc/btr/bts to implement xor/and/or that affects a single bit in the upper 32-bits of a 64-bit operation.
ClosedPublic

Authored by craig.topper on Sep 2 2017, 6:48 PM.

Details

Summary

We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.

Diff Detail

Event Timeline

craig.topper created this revision.Sep 2 2017, 6:48 PM
zvi edited edge metadata.Sep 2 2017, 11:13 PM

The code change LGTM, but i'm not sure about the profitability. On recent Intel processors (not sure about AMD), 'or' can be executed at double throughput than 'bt'. Assuming the extra constant move being saved is hoisted out of a hot loop, are we gaining or losing?
For -Os this should be fine.

I can remove the change from TargetTransformInfo.cpp which would make the constants get hoisted by the ConstantHoisting pass. Would that help?

gcc and icc both use the bt instructions for at least simple single BB cases. I'll check some more complicated cases.

spatel edited edge metadata.Sep 3 2017, 9:07 AM
In D37418#859821, @zvi wrote:

The code change LGTM, but i'm not sure about the profitability. On recent Intel processors (not sure about AMD), 'or' can be executed at double throughput than 'bt'. Assuming the extra constant move being saved is hoisted out of a hot loop, are we gaining or losing?
For -Os this should be fine.

I suspect AMD CPUs would want the same output as Intel big cores on these. For example, Agner shows 2 uops for btc on Ryzen, so it has higher latency and less throughput than the plain logic ops. The extra uop is caused by the partial flags update of these instructions?

This is another example where we can't determine profitability in isel. So we should choose whatever we think helps the most common case there. Then, add an MI transform to alter it if we can find profit using uarch machine models and machine trace metrics. If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

reames added a subscriber: reames.Sep 4 2017, 1:46 PM

I ran into this case in a snippet of hot code just Friday, so I'm glad to see someone's working on it. :) In the case I had, the register allocator was spilling a register in a hot loop just to free up room for constant. I hadn't had a chance to actually performance test the tweaked assembly, but I'd be very surprised if the btc variant wasn't faster in that case.

If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

Another option here would be to treat this as a register allocator rematerialization problem. We have precedent for doing that for loads which can be folded into instructions, and the "fold constant into instruction" pattern we see here seems somewhat similar in spirit.

Out of the other options, I can't see a late pattern match to convert and/constant form to btc working out very well. In particular, the spilling will have already been done and reversing that seems complicated. Starting with a btc and expanding to the constant/and pattern might be a bit better, but then I worry about missed hoisting opportunities in loops with registers live over them which could have been spill/filled outside the loop.

On a vaguely related topic, has anyone looked at the possible benefit of producing single bit constants via an 8 bit move immediate and a shift immediate? If I'm remembering correctly, it should encode smaller.

zvi added a comment.Sep 4 2017, 5:02 PM

This is another example where we can't determine profitability in isel. So we should choose whatever we think helps the most common case there. Then, add an MI transform to alter it if we can find profit using uarch machine models and machine trace metrics. If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?

This is a good argument, but then as @reames mentioned we could also try to handle register-pressure at spill-time by letting the target try some peephole-ish tricks to avoid spilling; not sure how well will this fits in the existing regalloc infrastructure.
FWIW, we set different scheduling strategies for 32-bit and 64-bit due to the difference in register abundance:

else if (Subtarget.is64Bit())
    setSchedulingPreference(Sched::ILP);
  else
    setSchedulingPreference(Sched::RegPressure);

Remove the X86TargetTransformInfo.cpp changes and rebase.

Still need to change to -Os only

This comment was removed by craig.topper.

Qualify with optsize. Add tests with/without optsize

Once more with test file

zvi accepted this revision.Feb 15 2018, 11:14 AM

LGTM

This revision is now accepted and ready to land.Feb 15 2018, 11:14 AM
This revision was automatically updated to reflect the committed changes.