Page MenuHomePhabricator

[PowerPC] Use rldimi/rlwimi instructions to optimize build_vector
Needs ReviewPublic

Authored by qiucf on Jan 11 2021, 9:24 PM.

Details

Reviewers
nemanjai
jsji
shchenz
Group Reviewers
Restricted Project
Summary

Leverage rldimi/rlwimi instructions to generate better code for BUILD_VECTOR:

  • For v16i8, four groups of (i8 << 24) | (i8 << 16) | (i8 << 8) | i8 to construct a vector.
  • For v8i16, four groups of (i16 << 16) | i16 to construct a vector.
  • We already have patterns for v4i32 and v2i64 construction.

Diff Detail

Event Timeline

qiucf created this revision.Jan 11 2021, 9:24 PM
qiucf requested review of this revision.Jan 11 2021, 9:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 9:24 PM

If all the values are in GPR's, the code produced with this patch:

mtvsrdd 34, 4, 3
mtvsrdd 35, 6, 5
vpkudum 2, 3, 2
mtvsrdd 35, 8, 7
mtvsrdd 36, 10, 9
vpkudum 3, 4, 3
vpkuwum 2, 3, 2

is certainly better than the naive code we currently produce. But I don't think we should be doing the merging/packing in the vector domain because (at least on P9) we get half the dispatch width and the permute operations potentially have a higher latency. Furthermore, there is a potential of increasing vector register pressure with this approach which is probably not ideal. I think that for the basic case (where all values are in GPR's) we should simply add a pattern in the .td file that does something like this (similar to what we did for the wider elements):

rlwimi 3, 4, ...  # merge r3 and r4
rlwimi 5, 6, ...  # merge r5 and r6
rlwimi 7, 8, ...  # merge r7 and r8
rlwimi 9, 10, ... # merge r9 and r10
rldimi 3, 5, ...  # merge r3, r4, r5, r6
rldimi 7, 9, ...  # merge r7, r8, r9, r10
mtvsrdd 34, 3, 7

For 32-bit mode, we can't really do the merging to doublewords in GPR's but I think they can be moved to VSR's after the word merges and then merged with a single vpkuwum.

qiucf planned changes to this revision.May 30 2021, 8:25 PM

Will use rldimi/rlwimi instructions to build vector.

qiucf updated this revision to Diff 351090.Jun 10 2021, 1:41 AM
qiucf retitled this revision from [PowerPC] Use mtvsrdd+vpku instructions to optimize build_vector to [PowerPC] Use rldimi/rlwimi instructions to optimize build_vector.
qiucf edited the summary of this revision. (Show Details)
qiucf edited reviewers, added: shchenz; removed: steven.zhang, bsaleil.
shchenz added inline comments.Jun 15 2021, 3:21 AM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
9038

out of curiosity, if we already have patterns for v4i32 and v2i64, should we also handle v16i8 and v8i16 there?

9065

No need for the else, return directly

llvm/test/CodeGen/PowerPC/pre-inc-disable.ll
343

Why do we eliminate so many instructions in the entry block? Are they moved to the for.body block?
If so, if for.body is a real loop body(for now it is not, maybe we can change the IR to make the for.body be a loop body), will this increase the loop size?

qiucf updated this revision to Diff 352899.Jun 17 2021, 7:20 PM
qiucf marked an inline comment as done.
qiucf updated this revision to Diff 366232.Aug 13 2021, 3:57 AM

Update tests.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
9038

Seems we can't write nested pattern for instructions like rldimi (both read and write first op):

def : Pat<(v8i16 (build_vector i16:$A, i16:$B, i16:$C, i16:$D,
                               i16:$E, i16:$F, i16:$G, i16:$H)),
          (MTVSRDD
            (RLDIMI
              (RLWIMI8 AnyExts16.C, AnyExts16.D, 16, 0, 15),
              (RLWIMI8 AnyExts16.A, AnyExts16.B, 16, 0, 15), 32, 0),
            (RLDIMI
              (RLWIMI8 AnyExts16.G, AnyExts16.H, 16, 0, 15),
              (RLWIMI8 AnyExts16.E, AnyExts16.F, 16, 0, 15), 32, 0))>;

And these patterns are complex for v16i8.

llvm/test/CodeGen/PowerPC/pre-inc-disable.ll
343

Some of these code should be dead. I tried opt on it, the loop is gone, and then use current llc, they're removed.

qiucf updated this revision to Diff 378914.Oct 12 2021, 1:18 AM

Update testcase