This is an archive of the discontinued LLVM Phabricator instance.

[x86] prefer comparisons against zero for and+cmp sequences
ClosedPublic

Authored by spatel on Apr 13 2016, 5:25 PM.

Details

Summary

For the sake of minimalism, this patch is x86 only, but I think that at least PPC, ARM, AArch64, and Sparc probably want to do this too.

We might want to generalize the hook and pattern recognition for a target like PPC that has a full assortment of negated logic ops (orc, nand).

Note that http://reviews.llvm.org/D18842 would cause this transform to trigger more often.

For reference, this relates to:
https://llvm.org/bugs/show_bug.cgi?id=27105
https://llvm.org/bugs/show_bug.cgi?id=27202
https://llvm.org/bugs/show_bug.cgi?id=27203
https://llvm.org/bugs/show_bug.cgi?id=27328

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 53644.Apr 13 2016, 5:25 PM
spatel retitled this revision from to [x86, ppc] prefer comparisons against zero for and+cmp sequences.
spatel updated this object.
spatel added reviewers: hfinkel, andreadb, kbsmith1.
spatel added a subscriber: llvm-commits.
kbsmith1 edited edge metadata.Apr 13 2016, 6:12 PM

See inline comments.

include/llvm/Target/TargetLowering.h
308 ↗(On Diff #53644)

I see \brief used in other comments preceding functions. Should that be here as well?

313 ↗(On Diff #53644)

This specific statement is a little hard to follow. I think it might be clearer if you say X & Y == Y implies that the bits in Y are a subset of the bits in X. Therefore, the set of bits not in X (~X) unioned (&) with
the bits in Y must be the empty set (0). So, X & Y == 0 is equivalent to ~X & Y == 0.

lib/CodeGen/CodeGenPrepare.cpp
5174 ↗(On Diff #53644)

Maybe I am missing something, or don't understand, but I don't see anything in this routine that checks to make sure this is an == comparison.

test/CodeGen/X86/bmi.ll
174 ↗(On Diff #53644)

Is this valid for ne? The comment in TargetLowering.h should discuss that. Seems like it is, but it would be worth showing that in the comment. This test should also have some tests that match the pattern, except use le, ge, lt, gt as well, as those are not legal patterns to hit the optimization.

hfinkel edited edge metadata.Apr 13 2016, 6:21 PM

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

Yes, I initially thought I'd do this as a DAGCombine, but then I realized that we'd need "isKnownToBeAPowerOfTwo()", and I don't see an SDNode equivalent for the IR version. I'm not sure where we draw the line between CGP and DAGCombine, but duplicating isKnownToBeAPowerOfTwo() scared me away.

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

Yes, I initially thought I'd do this as a DAGCombine, but then I realized that we'd need "isKnownToBeAPowerOfTwo()", and I don't see an SDNode equivalent for the IR version. I'm not sure where we draw the line between CGP and DAGCombine, but duplicating isKnownToBeAPowerOfTwo() scared me away.

Why? How do we select the instructions anyway? For x86 or ppc, would we not want to limit the applicability to constants? It needs to be a constant for rlwinm I'd imagine, and bt too?. For constants, we can check this with existing functionality.

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

Yes, I initially thought I'd do this as a DAGCombine, but then I realized that we'd need "isKnownToBeAPowerOfTwo()", and I don't see an SDNode equivalent for the IR version. I'm not sure where we draw the line between CGP and DAGCombine, but duplicating isKnownToBeAPowerOfTwo() scared me away.

Why? How do we select the instructions anyway? For x86 or ppc, would we not want to limit the applicability to constants? It needs to be a constant for rlwinm I'd imagine, and bt too?. For constants, we can check this with existing functionality.

isKnownToBeAPowerOfTwo() is ~100 lines and recursive. We can approximate it using computeKnownBits(), but I think getting the full power of the IR version would mean duplicating the code.

One of the x86 regression tests shows the case where a simple constant check won't do:

define i1 @and_cmp_const_power_of_two(i32 %x, i32 %y) {
  %shl = shl i32 1, %y
  %and = and i32 %x, %shl
  %cmp = icmp ne i32 %and, %shl
  ret i1 %cmp
}

X86TargetLowering::LowerToBT() uses computeKnownBits() to know that's a power-of-2 mask.

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

Yes, I initially thought I'd do this as a DAGCombine, but then I realized that we'd need "isKnownToBeAPowerOfTwo()", and I don't see an SDNode equivalent for the IR version. I'm not sure where we draw the line between CGP and DAGCombine, but duplicating isKnownToBeAPowerOfTwo() scared me away.

Why? How do we select the instructions anyway? For x86 or ppc, would we not want to limit the applicability to constants? It needs to be a constant for rlwinm I'd imagine, and bt too?. For constants, we can check this with existing functionality.

isKnownToBeAPowerOfTwo() is ~100 lines and recursive. We can approximate it using computeKnownBits(), but I think getting the full power of the IR version would mean duplicating the code.

One of the x86 regression tests shows the case where a simple constant check won't do:

define i1 @and_cmp_const_power_of_two(i32 %x, i32 %y) {
  %shl = shl i32 1, %y
  %and = and i32 %x, %shl
  %cmp = icmp ne i32 %and, %shl
  ret i1 %cmp
}

X86TargetLowering::LowerToBT() uses computeKnownBits() to know that's a power-of-2 mask.

Okay, but in the end, the check used has to match the capabilities of the check used in the backend for lowering. In this case, it needs to use computeKnownBits(). Regardless, in this case, as I understand it, it is not enough to know that the number is some power of 2, we need to know which power of 2.

...

X86TargetLowering::LowerToBT() uses computeKnownBits() to know that's a power-of-2 mask.

Okay, but in the end, the check used has to match the capabilities of the check used in the backend for lowering. In this case, it needs to use computeKnownBits(). Regardless, in this case, as I understand it, it is not enough to know that the number is some power of 2, we need to know which power of 2.

I don't understand the last statement. For x86, the test case demonstrates that "some power-of-2" is the requirement to avoid impeding lowering to 'bt'. For PPC, I think the requirement is the same - we could use a variable shift instruction (eg, 'rlwnm' rather than 'rlwinm').

If the PPC lowering is not doing that, wouldn't it be considered an optimization bug? I suppose this leads us to the conclusion that we really do need to duplicate isKnownToBeAPowerOfTwo() for SDNodes so everyone can use it.

Alternatively, we could wait even longer to attempt this transform - MachineCombiner? - so we avoid stepping on the other optimization. That looks simple enough for the x86 tests, but PPC would take a lot of work.

This certainly a good pattern to catch, but I'm not sure that CGP is the right place for this. We generally have things in CGP to work-around the fact that SDAG/ISel is basic-block local. This kind of thing seems much more natural as a something that should be in DAGCombine.

Yes, I initially thought I'd do this as a DAGCombine, but then I realized that we'd need "isKnownToBeAPowerOfTwo()", and I don't see an SDNode equivalent for the IR version. I'm not sure where we draw the line between CGP and DAGCombine, but duplicating isKnownToBeAPowerOfTwo() scared me away.

Why? How do we select the instructions anyway? For x86 or ppc, would we not want to limit the applicability to constants? It needs to be a constant for rlwinm I'd imagine, and bt too?. For constants, we can check this with existing functionality.

isKnownToBeAPowerOfTwo() is ~100 lines and recursive. We can approximate it using computeKnownBits(), but I think getting the full power of the IR version would mean duplicating the code.

One of the x86 regression tests shows the case where a simple constant check won't do:

define i1 @and_cmp_const_power_of_two(i32 %x, i32 %y) {
  %shl = shl i32 1, %y
  %and = and i32 %x, %shl
  %cmp = icmp ne i32 %and, %shl
  ret i1 %cmp
}

X86TargetLowering::LowerToBT() uses computeKnownBits() to know that's a power-of-2 mask.

Okay, but in the end, the check used has to match the capabilities of the check used in the backend for lowering. In this case, it needs to use computeKnownBits(). Regardless, in this case, as I understand it, it is not enough to know that the number is some power of 2, we need to know which power of 2.

...

X86TargetLowering::LowerToBT() uses computeKnownBits() to know that's a power-of-2 mask.

Okay, but in the end, the check used has to match the capabilities of the check used in the backend for lowering. In this case, it needs to use computeKnownBits(). Regardless, in this case, as I understand it, it is not enough to know that the number is some power of 2, we need to know which power of 2.

I don't understand the last statement. For x86, the test case demonstrates that "some power-of-2" is the requirement to avoid impeding lowering to 'bt'. For PPC, I think the requirement is the same - we could use a variable shift instruction (eg, 'rlwnm' rather than 'rlwinm').

If the PPC lowering is not doing that, wouldn't it be considered an optimization bug? I suppose this leads us to the conclusion that we really do need to duplicate isKnownToBeAPowerOfTwo() for SDNodes so everyone can use it.

You're right, it does not need to be a constant. However, we still need to know how to determine which power of two it will be (i.e. the shift amount), even if not a constant. That is stronger than just knowing it is some power of two. In any case, we should move this to DAGCombine and sync the associated logic with that used for instruction selection.

Alternatively, we could wait even longer to attempt this transform - MachineCombiner? - so we avoid stepping on the other optimization. That looks simple enough for the x86 tests, but PPC would take a lot of work.

Not sure why PPC would take more work than x86 (both backends currently use the MachineCombiner for other purposes). Regardless, I'd not jump here unless necessary.

Not sure why PPC would take more work than x86 (both backends currently use the MachineCombiner for other purposes). Regardless, I'd not jump here unless necessary.

In the test cases, PPC is branching, so I figure there'd be more complexity undoing that than the simple instruction replacement/deletion that x86 needs. But I agree, I don't want to sink that far down if I don't have to.

Before abandoning ship, provide inline answers to Kevin's questions.

include/llvm/Target/TargetLowering.h
308 ↗(On Diff #53644)

Doxygen's autobrief was enabled some time in the last few months, so \brief is just noise as I understand it. We can remove \brief from documentation comments for readability.

313 ↗(On Diff #53644)

This may be a case where we should just let the equation speak for itself. :)

lib/CodeGen/CodeGenPrepare.cpp
5174 ↗(On Diff #53644)

The equality check is (poorly) placed on the first line.

test/CodeGen/X86/bmi.ll
174 ↗(On Diff #53644)

Yes, the transform is valid for 'ne'. This patch is just undoing the opposite direction transform performed by InstCombine in visitICmpInstWithInstAndIntCst(). But I agree we should have tests to confirm that the transform doesn't fire for lt/gt.

Replied to inline comments.

include/llvm/Target/TargetLowering.h
308 ↗(On Diff #53644)

I didn't realize that. Not too long ago a reviewer had me add \brief in some code I was doing.

313 ↗(On Diff #53644)

That sounds fine.

lib/CodeGen/CodeGenPrepare.cpp
5174 ↗(On Diff #53644)

I see it now. I skipped right by it.

spatel updated this revision to Diff 56314.May 5 2016, 11:39 AM
spatel retitled this revision from [x86, ppc] prefer comparisons against zero for and+cmp sequences to [x86] prefer comparisons against zero for and+cmp sequences.
spatel updated this object.
spatel edited edge metadata.

Patch updated:

  1. Move the transform to the DAG vs. the earlier rev that was based in CGP. It turns out that the 'power-of-2' scenario is already handled for us here; TargetLowering has a function called valueHasExactlyOneBitSet() that is used in SimplifySetCC().
  1. I removed the PPC diffs for simplicity, but my local testing shows the same benefits as the earlier rev of the patch.
  1. I'm not sure if this transform should live in TargetLowering, but I've made it a helper of SimplifySetCC() in this draft because that's where all of the related transforms are. DAGCombiner::visitSETCC() calls DAGCombiner::SimplifySetCC() calls TargetLowering::SimplifySetCC(). It's not clear to me what the trade-offs would be if we added transforms directly to DAGCombiner::visitSETCC().

Did you add tests to check that lt/gt conditions don't get transformed?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
1336 ↗(On Diff #56314)

Although this might be your expectation, I don't think it should be an assertion. I think it should be
if (!valueHasExactlyOneBitSet(Y, DAG))

return SDVALUE();

Using an assertion could cause a compiler internal error during a compilation. There is a safe and correct return by using SDValue(), and this code shouldn't really need to know whether the transform for bit test should happen before or after it.

spatel added a comment.May 6 2016, 7:58 AM

Did you add tests to check that lt/gt conditions don't get transformed?

Oops - let me add those and update the patch. The EQ/NE check is hopefully more obvious in the code now.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
1336 ↗(On Diff #56314)

I made this an assert rather an actual condition because I'm assuming this helper function should be tightly coupled with "SimplifySetCC()" below, and I thought duplicating a call to computeKnownBits() (by way of valueHasExactlyOneBitSet()) would be considered wasteful.

The reason I've broken this into a separate function is because it felt wrong to add more code directly to SimplifySetCC() - that is already approaching 1000 lines.

spatel updated this revision to Diff 56421.May 6 2016, 8:53 AM

Patch updated:

  1. Added tests for predicates other than EQ/NE to show that we don't transform those.
  2. Changed assert of valueHasExactlyOneBitSet() to a real check (if my earlier reasoning holds, I can change this back).
kbsmith1 accepted this revision.May 6 2016, 9:09 AM
kbsmith1 edited edge metadata.

Did you add tests to check that lt/gt conditions don't get transformed?

Oops - let me add those and update the patch. The EQ/NE check is hopefully more obvious in the code now.

Thank you for adding those tests now, and yes the EQ/NE check was much more obvious when I read it this time. Thank you.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
1336 ↗(On Diff #56421)

I agree with your reasoning on breaking it out into this separate function. I think that greatly enhanced the readability. Thank you also for changing this to not be an assertion. I think that makes the code less fragile for the future.

This revision is now accepted and ready to land.May 6 2016, 9:09 AM
This revision was automatically updated to reflect the committed changes.