This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't match TESTrr from (cmp (and X, Y), 0) during isel. Defer to post processing
ClosedPublic

Authored by craig.topper on Dec 18 2018, 5:15 PM.

Details

Summary

The (cmp (and X, Y) 0) pattern is greedy and ends up forming a TESTrr and consuming the and when it might be better to use one of the BMI/TBM like BLSR or BLSI.

This patch moves removes the pattern from isel and adds a post processing check to combine TESTrr+ANDrr into just a TESTrr. With this patch we are able to select the BMI/TBM instructions, but we'll also emit a TESTrr when the result is compared to 0. In many cases the peephole pass will be able to use optimizeCompareInstr to remove the TEST, but its probably not perfect.

Diff Detail

Repository
rL LLVM

Event Timeline

craig.topper created this revision.Dec 18 2018, 5:15 PM
andreadb added inline comments.Dec 19 2018, 5:42 AM
test/CodeGen/X86/bmi.ll
531–539 ↗(On Diff #178817)

This is a strange/interesting test.

If %a is zero, then %t1 is also zero.
If %a is not zero, then %t1 has exactly one bit set.

-->

Testing if %t1 is equal to 0, is equivalent to testing if %a is 0.

The only case where %t2 is TRUE, is if %a is 0.
This whole logic could be folded into a icmp + select. So we don't even need to select a BLSI.

This sequence should be optimized at IR level. I didn't test if it is what happens.

That being said. I take that the the purpose of this test was different. Probably, this test should be rewritten in a way that doesn't expose that simplification?

andreadb accepted this revision.Dec 19 2018, 6:06 AM

The change LGTM.

However, tests look a bit too fragile. If the goal is to verify the selection of BMI/TBM instructions, then tests should be made a bit more robust.
Future improvemenets may break some of those patterns; the code from some of those tests can be aggressively simplified...
I suggest to improve the tests first, and then commit this patch.
I think that we should probably raise a couple of bugs for missing simplifications. Most of those missing simplifications can be probably catched at IR level.

test/CodeGen/X86/bmi.ll
624–635 ↗(On Diff #178817)

Again. Here we may prefer POPCNT to BLSI. It tends to have better latency/throughput overall. I think it is worthy to raise a bug for this.

Speaking about these tests in general:
I think that we should make these more robust (maybe in a separate patch).

We can probably make this test more robust by changing how we check the result. For example, rather than comparing against zero, we can compare against a specific power-of-2. That would force the selection of BLSI, since we would need to know the position of that bit.

We can probably do something similar to improve the other test.

880–882 ↗(On Diff #178817)

Same.
Could be a simple test+cmov. But - again - I take that the purpose of this test is not to check if we are smart enough to fold away that sequence...

This revision is now accepted and ready to land.Dec 19 2018, 6:06 AM
craig.topper marked 3 inline comments as done.Dec 19 2018, 8:24 AM
craig.topper added inline comments.
test/CodeGen/X86/bmi.ll
531–539 ↗(On Diff #178817)

The tests were intended to test use the Z flag from the BMI instructions.

624–635 ↗(On Diff #178817)

I thought we just established that BLSI could be replaced with a compare of the input with 0. Why would we replace it with POPCNT?

craig.topper marked 2 inline comments as done.Dec 19 2018, 8:31 AM
craig.topper added inline comments.
test/CodeGen/X86/bmi.ll
624–635 ↗(On Diff #178817)

Doesn't BLSI have better throughput than POPCNT on Intel CPUs? BLSI is on 2 ports. POPCNT is only one port.

880–882 ↗(On Diff #178817)

How could this be a TEST+CMOV? ZF will be set if the input is zero or has exactly one bit set.

andreadb added inline comments.Dec 19 2018, 8:39 AM
test/CodeGen/X86/bmi.ll
624–635 ↗(On Diff #178817)

Right, forget about that comment.
A simple compare of the input is better for this case.

andreadb added inline comments.Dec 19 2018, 8:51 AM
test/CodeGen/X86/bmi.ll
531–539 ↗(On Diff #178817)

I see.

In that case, then don't worry about changing those tests.

I think it is still worthy to raise a bug about teaching how to fold that computation at IR level. We keep these tests to verify that the Z flag is actually used.

andreadb added inline comments.Dec 19 2018, 8:55 AM
test/CodeGen/X86/bmi.ll
880–882 ↗(On Diff #178817)

Ouch.. right. We are not extracting a bit here. We are resetting the lowest set bit here. So TEST+CMOV is not fine. I misread the code.

andreadb added inline comments.Dec 19 2018, 9:22 AM
test/CodeGen/X86/bmi.ll
624–635 ↗(On Diff #178817)

Interesting. On AMD family 16h and 17h, POPCNT and BLSI use the same ALU pipes. However BLSI is 2 uops, versus 1 uop for the POPCNT. So, the throughput of BLSI is half the throughput of POPCNT.
That being said, we already agreed that ideally, a simple compare of the input would have been better.

This revision was automatically updated to reflect the committed changes.