This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] support ZERO_EXTEND in tryBitPermutation
ClosedPublic

Authored by inouehrs on Sep 6 2017, 6:07 AM.

Details

Summary

This patch add a support of ISD::ZERO_EXTEND in PPCDAGToDAGISel::tryBitPermutation to increase the opportunity to use rotate-and-mask by reordering ZEXT and ANDI.
Since tryBitPermutation stops analyzing nodes if it hits a ZEXT node while traversing SDNodes, we want to avoid ZEXT between two nodes that can be folded into a rotate-and-mask instruction.

For example, we allow these nodes

      t9: i32 = add t7, Constant:i32<1>
    t11: i32 = and t9, Constant:i32<255>
  t12: i64 = zero_extend t11
t14: i64 = shl t12, Constant:i64<2>

to be folded into a rotate-and-mask instruction.
Such case often happens in array accesses with logical AND operation in the index, e.g. array[i & 0xFF];

Diff Detail

Repository
rL LLVM

Event Timeline

inouehrs created this revision.Sep 6 2017, 6:07 AM
hfinkel edited edge metadata.Sep 6 2017, 6:24 AM

I'd prefer that we extend tryBitPermutation to look through the zext. I realize this means adding code there to deal with mismatched input/output bit widths, but that seems much less fragile, and potentially more useful, than this.

I have a particular fear of doing this kinds of transformations as target-specific DAGCombines: The output pattern could be anti-canonical, or could become so in the future, and if that's in any case true, then you'll cause the optimizer to hang while DAGCombine and the target fight each other. I also don't like when one piece of code is guessing what another piece will do. Sometimes this is unavoidable (e.g., the vectorizer guesses what the register allocator will do), but these two pieces of code are essentially at the same level, so there's no need. As a result, I'd recommend against solving the problem this way.

Hal, thank you for the advice. I will redesign based on your comment.

inouehrs updated this revision to Diff 114342.Sep 8 2017, 5:31 AM
inouehrs retitled this revision from [PowerPC] DAGCombine for better exploitation of rotate-and-mask instruction to [PowerPC] support ZERO_EXTEND in tryBitPermutation.
inouehrs edited the summary of this revision. (Show Details)

reimplemented the optimization in tryBitPermutation instead of adding new DAGCombine rule as @hfinkel suggested.

I apologize for taking so long to get back to this. The bit-permutation selector uses a cost model to decide how to lower each permutation sequence. Preemptively lowering the zext like this during the analysis phase of the algorithm seems suboptimal (or at least ad hoc).

In BitPermutationSelector, the ValueBit type can have one of two kinds: Variable or ConstZero. What we should do here to handle zext is, upon encountering the ISD::ZERO_EXTEND node, we should simply recurse by calling getValueBits(V.getOperand(0), <number of operand bits>). After recursing, we should take the gathered bits, extend the result with ValueBit(ValueBit::ConstZero) (similar to how ISD::AND is handled now).

I believe that the remainder of the algorithm will work as-is, except, that in the code that actually selects the machine instructions: the code called from Select64 -- and perhaps Select32 if we're extending from i1, although you might omit this by only handling i32 -> i64 zext), you'll need to check that the incoming values are all i64. If they're not, you'll then want to convert them to i64 using INSERT_SUBREG (as you're doing here). You could even do this as a pre-pass over all of the BitGroups to avoid complicating the rest of the code (by keeping a map to avoid inserting more insert_subregs than necessary: you want only one per incoming i32 value).

inouehrs updated this revision to Diff 117096.Sep 29 2017, 12:33 AM
inouehrs edited the summary of this revision. (Show Details)

@hfinkel Thank you so much for the suggestion. I reimplemented the patch based on your suggestion and I hope this implementation is cleaner than the previous one.
So far, the patch supporting only i32 to i64 zero-entensions. I will support other conversions such as i1 case if I found common patterns.

hfinkel added inline comments.Sep 29 2017, 8:26 AM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
1076 ↗(On Diff #117096)

entend -> extend

1079 ↗(On Diff #117096)

Why are you checking that the upper bits are zero? This is a zext node, so I'd think you should just force them all to zero (like the code for AND does for bits that are zero).

hfinkel added inline comments.Sep 29 2017, 8:37 AM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
1079 ↗(On Diff #117096)

Or, to put it another way, if you do it this way, then you should have a comment that reads something like, "We'll look through zext nodes here, but only if they're provably redundant." If we do this, however, we should explain why.

inouehrs updated this revision to Diff 117202.Sep 29 2017, 12:35 PM
inouehrs added inline comments.Sep 29 2017, 12:40 PM
lib/Target/PowerPC/PPCISelDAGToDAG.cpp
1079 ↗(On Diff #117096)

As you suggested, zero extension is like logical AND and so I do not need to check the upper bits.
I removed the check.

hfinkel accepted this revision.Sep 29 2017, 1:07 PM

LGTM (but please run performance tests and a self-host check before committing - I can imagine us overlooking something here).

lib/Target/PowerPC/PPCISelDAGToDAG.cpp
1074 ↗(On Diff #117202)

Line is too long.

This revision is now accepted and ready to land.Sep 29 2017, 1:07 PM
This revision was automatically updated to reflect the committed changes.

I did not see significant changes in performance on average (0.01% improvement) in five runs of SPECCPU2006.
As individual benchmarks, the performance ranges from 1.74% improvement (povray) to 1.23% degradation (xalan).
(Note that measurements are done on a somewhat _noisy_ cloud instance.)

All tests are passed on a ppc64le/POWER8 box.