We can't fold a large immediate into a 64-bit operation. But if we know we're only operating on a single bit we can use the bit instructions.
Details
Diff Detail
- Repository
- rL LLVM
- Build Status
Buildable 13792 Build 13792: arc lint + arc unit
Event Timeline
The code change LGTM, but i'm not sure about the profitability. On recent Intel processors (not sure about AMD), 'or' can be executed at double throughput than 'bt'. Assuming the extra constant move being saved is hoisted out of a hot loop, are we gaining or losing?
For -Os this should be fine.
I can remove the change from TargetTransformInfo.cpp which would make the constants get hoisted by the ConstantHoisting pass. Would that help?
gcc and icc both use the bt instructions for at least simple single BB cases. I'll check some more complicated cases.
I suspect AMD CPUs would want the same output as Intel big cores on these. For example, Agner shows 2 uops for btc on Ryzen, so it has higher latency and less throughput than the plain logic ops. The extra uop is caused by the partial flags update of these instructions?
This is another example where we can't determine profitability in isel. So we should choose whatever we think helps the most common case there. Then, add an MI transform to alter it if we can find profit using uarch machine models and machine trace metrics. If the btc variant always reduces register pressure, that could be a good reason to choose it in isel?
I ran into this case in a snippet of hot code just Friday, so I'm glad to see someone's working on it. :) In the case I had, the register allocator was spilling a register in a hot loop just to free up room for constant. I hadn't had a chance to actually performance test the tweaked assembly, but I'd be very surprised if the btc variant wasn't faster in that case.
Another option here would be to treat this as a register allocator rematerialization problem. We have precedent for doing that for loads which can be folded into instructions, and the "fold constant into instruction" pattern we see here seems somewhat similar in spirit.
Out of the other options, I can't see a late pattern match to convert and/constant form to btc working out very well. In particular, the spilling will have already been done and reversing that seems complicated. Starting with a btc and expanding to the constant/and pattern might be a bit better, but then I worry about missed hoisting opportunities in loops with registers live over them which could have been spill/filled outside the loop.
On a vaguely related topic, has anyone looked at the possible benefit of producing single bit constants via an 8 bit move immediate and a shift immediate? If I'm remembering correctly, it should encode smaller.
This is a good argument, but then as @reames mentioned we could also try to handle register-pressure at spill-time by letting the target try some peephole-ish tricks to avoid spilling; not sure how well will this fits in the existing regalloc infrastructure.
FWIW, we set different scheduling strategies for 32-bit and 64-bit due to the difference in register abundance:
else if (Subtarget.is64Bit()) setSchedulingPreference(Sched::ILP); else setSchedulingPreference(Sched::RegPressure);
Remove the X86TargetTransformInfo.cpp changes and rebase.
Still need to change to -Os only