Page MenuHomePhabricator

[PowerPC] Use mtvsrdd+vpku instructions to optimize build_vector
Needs ReviewPublic

Authored by qiucf on Jan 11 2021, 9:24 PM.

Details

Reviewers
nemanjai
steven.zhang
jsji
bsaleil
Group Reviewers
Restricted Project
Summary

mtvsrdd was introduced in ISA 3.0 which moves two GPRs into a vector in single instruction. So we can use that to reduce instructions building vector from elements. Take v8i16 as example (u for undef, others for elements):

u u u a  <-- original elements
u u u b
...

u u u a u u u b  <-- mtvsrdd
u u u c u u u d
...

u a u b u c u d  <-- vpkudum
u e u f u g u h

a b c d e f g h  <-- vpkuwum

In theory, this applies for vectors from v2i64 to v16i8. However, rldimi+vpkudum is better codegen for v4i32.

Diff Detail

Event Timeline

qiucf created this revision.Jan 11 2021, 9:24 PM
qiucf requested review of this revision.Jan 11 2021, 9:24 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 9:24 PM

If all the values are in GPR's, the code produced with this patch:

mtvsrdd 34, 4, 3
mtvsrdd 35, 6, 5
vpkudum 2, 3, 2
mtvsrdd 35, 8, 7
mtvsrdd 36, 10, 9
vpkudum 3, 4, 3
vpkuwum 2, 3, 2

is certainly better than the naive code we currently produce. But I don't think we should be doing the merging/packing in the vector domain because (at least on P9) we get half the dispatch width and the permute operations potentially have a higher latency. Furthermore, there is a potential of increasing vector register pressure with this approach which is probably not ideal. I think that for the basic case (where all values are in GPR's) we should simply add a pattern in the .td file that does something like this (similar to what we did for the wider elements):

rlwimi 3, 4, ...  # merge r3 and r4
rlwimi 5, 6, ...  # merge r5 and r6
rlwimi 7, 8, ...  # merge r7 and r8
rlwimi 9, 10, ... # merge r9 and r10
rldimi 3, 5, ...  # merge r3, r4, r5, r6
rldimi 7, 9, ...  # merge r7, r8, r9, r10
mtvsrdd 34, 3, 7

For 32-bit mode, we can't really do the merging to doublewords in GPR's but I think they can be moved to VSR's after the word merges and then merged with a single vpkuwum.