Power9 has instructions that will reverse the bytes within an element for all sizes (half-word, word, double-word and quad-word). These can be used for the vec_revb builtins in altivec.h. However, we implement these to match vector shuffle nodes as that will cover both the builtins and vector shuffles that occur in the SDAG through other means. This patch is tested functionally clean on Power9 machine.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I am debating internally about suggesting a potential solution for this, but this implementation essentially misses an entire set of complementary shuffle masks. We have a number of instructions that do element-wise reordering and we now have these that do per-element byte reversal. Combining the capabilities covers a lot more shuffles - I am just not positive that these occur enough to warrant the effort.
Here's what I mean:
- We have an instruction that will do a "rotate-left-by-word" operation on a vector (and a way to emit that instructions)
- We now have a "reverse-bytes-within-word-elements" operation
- We don't have a "reverse-bytes-within-each-word-and-rotate-left-by-word", which we can simply do with a 2 instruction sequence now
And of course, the same goes for all other masks we lower to a single instruction. It might be useful for each of them to detect byte-reversal as well. It is non-trivial work, but doesn't sound fundamentally all that hard. Perhaps we should re-design our handling of shuffles at some point and have a robust way to determine what we can lower to a one or two instruction sequence on any Subtarget.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1598 ↗ | (On Diff #100756) | This comment is no longer adequate. There is now another parameter which appears to be the "direction and stride". Please elaborate in a comment what this function does and how to use it. |
1768 ↗ | (On Diff #100756) | Doesn't it just suffice at this point to ensure that each element of the shuffle mask has a value less than 16? |
test/CodeGen/PowerPC/vec_revb.ll | ||
1 ↗ | (On Diff #100756) | If you're breaking up the RUN lines, keep them within 80 columns. |
I agree to Nemanja's comment. Since we have so many special patterns of permutation (e.g. byte reverse, rotate, shift, merge, splat, pack, unpack etc etc), we need a robust framework to optimize these patterns. Maybe we can create a table that maps a shuffle pattern to a specific instruction.
In complex case that can map onto a couple of instructions, there is a trade-off between consuming permutation pipeline resource and a vector register; e.g. "reverse-bytes-within-each-word-and-rotate-left-by-word" can be executed by one permute instruction using one additional vector register for the shuffle pattern. In hot loops, one permute approach may be a better choice than 2 instruction approach since we may be able to prepare shuffle pattern out side loop (if we have enough vector registers).
Of course, we have the "Perfect Shuffle" solution that works on BE and for v4i32 only. We could extend it to work on LE and to support the new instructions. However, I'm not convinced that's the best (or fastest) way forward.
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1768 ↗ | (On Diff #100756) | Just check the value less 16 is not sufficient here. We need to make sure the starting element index of each N Byte element is i + Width - 1. For example, for XXBRW , I expect a mask like this: shuffleVector(a, undef, 3,2,1,0,7,6,5,4, 11,10,9,8,15,14,13,12) |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
1602 ↗ | (On Diff #100958) | Please remove \brief. We no longer need to use \brief now that autobrief has been enabled. |
1604 ↗ | (On Diff #100958) | incremental/decremental -> increasing/decreasing |
1609 ↗ | (On Diff #100958) | incremental/decremental -> increasing/decreasing |
1781 ↗ | (On Diff #100958) | No braces required here. |
7909 ↗ | (On Diff #100958) | I'm probably missing something basic here, but why are we always converting the return to v16i8? |
test/CodeGen/PowerPC/vec_revb.ll | ||
2 ↗ | (On Diff #100958) | The patterns are the same for both BE and LE, therefor you don't need separate CHECK-BE and CHECK-LE labels. |
lib/Target/PowerPC/PPCISelLowering.cpp | ||
---|---|---|
7909 ↗ | (On Diff #100958) | Base on my understanding, after legalization, the only legal vector type for vector_shuffle is v16i8, and also the return value for vector_shuffle is v16i8, if we want to replace vector_shuffle with PPCISD::XXREVERSE node, we want to keep the original return type. Otherwise, it will cause "LLVM ERROR: Cannot select" error. If you look at the debug info, it is more clear. If I remove the bitcast to v16i8, for XXBRQ, we would do the following (note we changed the return type after DAG combine): Legalizing: t5: v16i8 = vector_shuffle<15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0> t3, undef:v16i8 This will eventually cause: t11: v1i128 = PPCISD::XXREVERSE t2 t2: v1i128,ch = CopyFromReg t0, Register:v1i128 %vreg0 t1: v1i128 = Register %vreg0 In function: testXXBRQ Therefore, we always need to cast the return value to v16i8. Correct me, if there is any improper understanding. |