Page MenuHomePhabricator

[PowerPC] Match vec_revb builtins to P9 instructions.
ClosedPublic

Authored by jtony on May 30 2017, 1:27 PM.

Details

Summary
Power9 has instructions that will reverse the bytes within an element for all
sizes (half-word, word, double-word and quad-word). These can be used for the
vec_revb builtins in altivec.h. However, we implement these to match vector
shuffle nodes as that will cover both the builtins and vector shuffles that
occur in the SDAG through other means. 
 
 This patch is tested functionally clean on Power9 machine.

Diff Detail

Repository
rL LLVM

Event Timeline

jtony created this revision.May 30 2017, 1:27 PM
nemanjai edited edge metadata.May 30 2017, 9:49 PM

I am debating internally about suggesting a potential solution for this, but this implementation essentially misses an entire set of complementary shuffle masks. We have a number of instructions that do element-wise reordering and we now have these that do per-element byte reversal. Combining the capabilities covers a lot more shuffles - I am just not positive that these occur enough to warrant the effort.

Here's what I mean:

  • We have an instruction that will do a "rotate-left-by-word" operation on a vector (and a way to emit that instructions)
  • We now have a "reverse-bytes-within-word-elements" operation
  • We don't have a "reverse-bytes-within-each-word-and-rotate-left-by-word", which we can simply do with a 2 instruction sequence now

And of course, the same goes for all other masks we lower to a single instruction. It might be useful for each of them to detect byte-reversal as well. It is non-trivial work, but doesn't sound fundamentally all that hard. Perhaps we should re-design our handling of shuffles at some point and have a robust way to determine what we can lower to a one or two instruction sequence on any Subtarget.

lib/Target/PowerPC/PPCISelLowering.cpp
1598 ↗(On Diff #100756)

This comment is no longer adequate. There is now another parameter which appears to be the "direction and stride". Please elaborate in a comment what this function does and how to use it.
Also, if values other than 1/-1 don't make sense for the new parameter, you can add an assert.

1768 ↗(On Diff #100756)

Doesn't it just suffice at this point to ensure that each element of the shuffle mask has a value less than 16?

test/CodeGen/PowerPC/vec_revb.ll
1 ↗(On Diff #100756)

If you're breaking up the RUN lines, keep them within 80 columns.

inouehrs edited edge metadata.May 31 2017, 4:53 AM

I agree to Nemanja's comment. Since we have so many special patterns of permutation (e.g. byte reverse, rotate, shift, merge, splat, pack, unpack etc etc), we need a robust framework to optimize these patterns. Maybe we can create a table that maps a shuffle pattern to a specific instruction.

In complex case that can map onto a couple of instructions, there is a trade-off between consuming permutation pipeline resource and a vector register; e.g. "reverse-bytes-within-each-word-and-rotate-left-by-word" can be executed by one permute instruction using one additional vector register for the shuffle pattern. In hot loops, one permute approach may be a better choice than 2 instruction approach since we may be able to prepare shuffle pattern out side loop (if we have enough vector registers).

I agree to Nemanja's comment. Since we have so many special patterns of permutation (e.g. byte reverse, rotate, shift, merge, splat, pack, unpack etc etc), we need a robust framework to optimize these patterns. Maybe we can create a table that maps a shuffle pattern to a specific instruction.

In complex case that can map onto a couple of instructions, there is a trade-off between consuming permutation pipeline resource and a vector register; e.g. "reverse-bytes-within-each-word-and-rotate-left-by-word" can be executed by one permute instruction using one additional vector register for the shuffle pattern. In hot loops, one permute approach may be a better choice than 2 instruction approach since we may be able to prepare shuffle pattern out side loop (if we have enough vector registers).

Of course, we have the "Perfect Shuffle" solution that works on BE and for v4i32 only. We could extend it to work on LE and to support the new instructions. However, I'm not convinced that's the best (or fastest) way forward.

jtony marked 3 inline comments as done.May 31 2017, 12:53 PM
jtony added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
1768 ↗(On Diff #100756)

Just check the value less 16 is not sufficient here. We need to make sure the starting element index of each N Byte element is i + Width - 1. For example, for XXBRW , I expect a mask like this: shuffleVector(a, undef, 3,2,1,0,7,6,5,4, 11,10,9,8,15,14,13,12)
We check getElt(0)=3, getElt(4)=7, getElt(8)=11, getElt(12)=15
i.e. N->getMaskElt(i) != i + Width - 1, here Width == 4

jtony updated this revision to Diff 100958.May 31 2017, 7:20 PM
jtony marked an inline comment as done.

Address comments from Nemanja.

jtony added a comment.May 31 2017, 7:22 PM

I am debating internally about suggesting a potential solution for this, but this implementation essentially misses an entire set of complementary shuffle masks. We have a number of instructions that do element-wise reordering and we now have these that do per-element byte reversal. Combining the capabilities covers a lot more shuffles - I am just not positive that these occur enough to warrant the effort.

Here's what I mean:

  • We have an instruction that will do a "rotate-left-by-word" operation on a vector (and a way to emit that instructions)
  • We now have a "reverse-bytes-within-word-elements" operation
  • We don't have a "reverse-bytes-within-each-word-and-rotate-left-by-word", which we can simply do with a 2 instruction sequence now

    And of course, the same goes for all other masks we lower to a single instruction. It might be useful for each of them to detect byte-reversal as well. It is non-trivial work, but doesn't sound fundamentally all that hard. Perhaps we should re-design our handling of shuffles at some point and have a robust way to determine what we can lower to a one or two instruction sequence on any Subtarget.

Had a talk with Nemanja, this work should be done as a separate issue.

jtony edited the summary of this revision. (Show Details)Jun 2 2017, 7:36 AM
kbarton added inline comments.Jun 5 2017, 10:54 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1602 ↗(On Diff #100958)

Please remove \brief. We no longer need to use \brief now that autobrief has been enabled.

1604 ↗(On Diff #100958)

incremental/decremental -> increasing/decreasing

1609 ↗(On Diff #100958)

incremental/decremental -> increasing/decreasing

1781 ↗(On Diff #100958)

No braces required here.

7909 ↗(On Diff #100958)

I'm probably missing something basic here, but why are we always converting the return to v16i8?

test/CodeGen/PowerPC/vec_revb.ll
2 ↗(On Diff #100958)

The patterns are the same for both BE and LE, therefor you don't need separate CHECK-BE and CHECK-LE labels.

jtony updated this revision to Diff 101888.Jun 8 2017, 4:26 AM
jtony marked 6 inline comments as done.

Address Kit's comments.

jtony added inline comments.Jun 8 2017, 4:41 AM
lib/Target/PowerPC/PPCISelLowering.cpp
7909 ↗(On Diff #100958)

Base on my understanding, after legalization, the only legal vector type for vector_shuffle is v16i8, and also the return value for vector_shuffle is v16i8, if we want to replace vector_shuffle with PPCISD::XXREVERSE node, we want to keep the original return type. Otherwise, it will cause "LLVM ERROR: Cannot select" error.

If you look at the debug info, it is more clear. If I remove the bitcast to v16i8, for XXBRQ, we would do the following (note we changed the return type after DAG combine):

Legalizing: t5: v16i8 = vector_shuffle<15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0> t3, undef:v16i8
... replacing: t5: v16i8 = vector_shuffle<15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0> t3, undef:v16i8 with: t11: v1i128 = PPCISD::XXREVERSE t2

This will eventually cause:
LLVM ERROR: Cannot select: t6: v1i128 = bitcast t11

t11: v1i128 = PPCISD::XXREVERSE t2
  t2: v1i128,ch = CopyFromReg t0, Register:v1i128 %vreg0
    t1: v1i128 = Register %vreg0

In function: testXXBRQ

Therefore, we always need to cast the return value to v16i8. Correct me, if there is any improper understanding.

This revision is now accepted and ready to land.Jun 12 2017, 11:04 AM
This revision was automatically updated to reflect the committed changes.