This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Change VSRpRC allocation order
ClosedPublic

Authored by qiucf on Jun 24 2021, 6:19 AM.

Details

Reviewers
nemanjai
shchenz
jsji
Group Reviewers
Restricted Project
Commits
rGa08fc1361aa3: [PowerPC] Change VSRpRC allocation order
Summary

On PowerPC, VSRpRC represents the pairs of even and odd VSX register ((vs0, vs1), (vs2, vs3), ...).VRRC means the latter 32 registers in VSRC. So in some cases, when we're handling incoming VRRC arguments with VSRpRC, extra copies are produced.

This patch changes the allocation order of VSRpRC to put latter 32 ones first. This should not affect ACC/UACC since they only own the first 32 VSX registers.

Diff Detail

Event Timeline

qiucf created this revision.Jun 24 2021, 6:19 AM
qiucf requested review of this revision.Jun 24 2021, 6:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2021, 6:19 AM
nemanjai accepted this revision.Jun 24 2021, 7:03 AM

Thanks for fixing this. I think we should go ahead with this fix as it is certainly an improvement. However, there is still some work to do here. I think it would be useful to test a change where we allocate all volatile registers first for VSFRC, VSSRC, VSRC, VSRpRC (namely, we allocate vs34-37, vs32-vs33, vs38-vs51, vs0-vs13, vs63-vs51, vs31-vs14). Of course, the relative order of the low 32 vs. high 32 might need to be different for the different register classes, but in any case, perhaps go through all the volatiles before we dig into the non-volatiles.
Of course, it would be good if the register coalescer was smart enough that the order doesn't matter, but this is certainly an easier fix.

In any case, this particular change can go in after the nits are addressed.

llvm/lib/Target/PowerPC/PPCRegisterInfo.td
478

Please explain the strange allocation order with something like:

// Allocate in the same order as the underlying VSX/Altivec registers
// placing the Altivec ones first in order to reduce the interference
// with accumulator registers (that are overlaid over the 32 low
// VSX registers). This will reduce the number of copies when loading
// accumulator registers - which is a very common use for paired
// VSX registers.
llvm/test/CodeGen/PowerPC/mma-outer-product.ll
81

Any idea why this sticks around? This and line 85 seem to be dead copies. Similar artifacts appear in other tests.

llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
13–32

This shows a side-effect of this change that should be mentioned in the commit message and maybe the code. Namely, if we end up allocating non-volatile registers, the first 12 underlying registers we allocate are 16 bytes rather than 8 bytes (since they're non-volatile Altivec registers rather than FP registers). This increases the size of stack frames for such functions.

This revision is now accepted and ready to land.Jun 24 2021, 7:03 AM
jsji added inline comments.Jun 24 2021, 8:07 AM
llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
13–32

Maybe we should provide alternative allocation order if user care about stack frame size ? Or even better if we can detect the usage of ACC or other potential conflicts before choosing this alternative allocation order? But yeah, that can be a FIXME and follow up.

shchenz added inline comments.Jun 24 2021, 7:02 PM
llvm/test/CodeGen/PowerPC/mma-intrinsics.ll
27–31

Why need to copy to v3? I think we can also use v2 for follow 4 xxlor?

llvm/test/CodeGen/PowerPC/mma-outer-product.ll
81

+1, there are some unnecessary vmr after this change. Maybe we need to have a look at why. Before the change, there are no such vector copies.

llvm/test/CodeGen/PowerPC/paired-vector-intrinsics.ll
23–25

unnecessary vmr?

This revision was automatically updated to reflect the committed changes.
qiucf marked an inline comment as done.Jun 27 2021, 10:54 PM
qiucf added inline comments.
llvm/test/CodeGen/PowerPC/mma-intrinsics.ll
27–31

Yes. This extra copy is from below:

undef %5.sub_vsx1:vsrprc = COPY $v2 <--
%0:g8rc_and_g8rc_nox0 = COPY $x3
%5.sub_vsx0:vsrprc = COPY %5.sub_vsx1:vsrprc
%5:vsrprc = KILL_PAIR %5:vsrprc(tied-def 0)
undef %8.sub_pair0:uaccrc = COPY %5:vsrprc
%8.sub_pair1:uaccrc = COPY %5:vsrprc
%11:accrc = BUILD_UACC %8:uaccrc
%11:accrc = XXMTACC %11:accrc(tied-def 0)
%11:accrc = XXMFACC %11:accrc(tied-def 0)

Here the copy cannot be eliminated since we specified that the assigned register is sub_vsx1 (VSX register with odd number). We need to fix how MMA intrinsics are expanded.

llvm/test/CodeGen/PowerPC/mma-outer-product.ll
81

This copy is not due to this patch. It's just scheduled up. pmxvf64gernn acc0, vsp32, v2, 0, 0 used it. But yes, it can be eliminated. I'll follow it up

llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
13–32

Yes. Maybe we can provide an option to control it first and then work on to resolve it properly.