Page MenuHomePhabricator

[PowerPC] Improve float vector gather codegen
ClosedPublic

Authored by kamaub on Jun 5 2019, 7:55 AM.

Details

Summary

This patch aims to improve the code generation for float vector gather on POWER9. Patterns have been implemented to utilize instructions that deliver improved performance. This decreases overall latency from 16 to 12 cycles.

  • Before Patch
lfs 0, 0(3)
lfs 2, 0(5)
lfs 1, 0(4)
xxmrghd 0, 2, 0
lfs 3, 0(6)
xvcvdpsp 34, 0
xxmrghd 0, 3, 1
xvcvdpsp 35, 0
vmrgew 2, 3, 2
  • After Patch (using POWER9 instructions)
lfiwzx 0, 0, 6
lfiwzx 1, 0, 5
xxmrghw 0, 0, 1
lfiwzx 1, 0, 4
lfiwzx 2, 0, 3
xxmrghw 1, 1, 2
xxmrgld 34, 0, 1

Diff Detail

Event Timeline

kamaub created this revision.Jun 5 2019, 7:55 AM
kamaub edited the summary of this revision. (Show Details)Jun 6 2019, 12:27 PM
jsji added a reviewer: Restricted Project.Aug 28 2019, 11:57 AM
stefanp accepted this revision.Aug 30 2019, 2:11 PM
stefanp added a subscriber: stefanp.

Overall this looks good. I have a couple of very minor comments that can be fixed on commit.
LGTM.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4179

nit:
Line length.

llvm/test/CodeGen/PowerPC/float-vector-gather.ll
4

Please add a note at the start here to say what you are testing.

This revision is now accepted and ready to land.Aug 30 2019, 2:11 PM
jsji requested changes to this revision.Aug 30 2019, 8:57 PM

My understanding is that the cycle saving come from avoiding unnecessary SP->DP, then DP->SP conversion.
But not the difference merge sequence.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
2539

Why these dag belongs to AlignValues? Why not MrgFP?

4174–4185

What about BigEndian?

4178

What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then vmrgew?

vmrgew is 2 cycle ALU instruction, should still be better than 3 cycler xxpermdi here.

llvm/test/CodeGen/PowerPC/float-vector-gather.ll
3

Add test for Big endian as well please.

This revision now requires changes to proceed.Aug 30 2019, 8:57 PM
kamaub removed a reviewer: adalava.
nemanjai requested changes to this revision.Sep 17 2019, 10:05 AM

Requesting changes because there is no BE support.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4174–4185

Yes, by all means, we need BE support as well.

4178

We should favour the larger register file available to XXPERMDI here rather than VMRGEW.
Besides, where does the information about XXPERMDI taking 3 cycles come from? It is not listed in the UM and a similar instruction (XXSEL is a 2 cycle ALU instruction as well).

jsji added inline comments.Sep 17 2019, 10:27 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

Yes, if the cycles is the same, then we should favor larger reg files.
But looks like they are not the same to me.

Unfortunately, UM is missing detail cycle information about xxpermdi,
but since it is a permute instruction, all PM instructions are 3 cycles.
xxsel is *ALU* instruction, hence it is 2 cycles.

And we are modeling that in our scheduling info as well.

$ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
// Three Cycle PM operation. Only one PM unit per superslice so we use the whole
// superslice. That includes both exec pipelines (EXECO, EXECE) and one
// dispatch.
def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
nemanjai added inline comments.Sep 24 2019, 3:14 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.

However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.

In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.

jsji added inline comments.Sep 24 2019, 7:43 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

But I do not think it is a given

Yes, agree, we shouldn't. I will try to confirm with HW guys.

especially since PM instructions typically range in latency between 2 and 3 cycles.

As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model.

In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.

Yes, likely...
It depends on the reg pressure and how likely we may increase reg dependency or cause additional reg spill.
It might not be a good choice most of the time when the reg pressure is not that big.

Anyhow, I believe you already have some example and important performance data in mind to support your argument.

But can we at least add some comments here to describe why we make such choice here -- why we prefer
access to a wider register set than shaving 1 cycle here.

jsji added inline comments.Sep 24 2019, 8:42 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given...

Yes, agree, we shouldn't. I will try to confirm with HW guys.

Confirmed with HW team, comparing to vmrgew, xxpermdi is PM-routed and therefore has the longer latency

nemanjai added inline comments.Sep 25 2019, 4:59 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

I think this discussion was quite valuable, so thanks for bringing it up.
And I do agree that we should add a comment in the code. Perhaps the following comment:

// Using VMRGEW to assemble the final vector would be a lower latency
// solution. However, we choose to go with the slightly higher latency
// XXPERMDI for 2 reasons:
// 1. This is likely to occur in unrolled loops where regpressure is high, so we
//    want to use the latter as it has access to all 64 VSX registers.
// 2. Using Altivec instructions in this sequence would likely cause the
//    allocation of Altivec registers even for the loads which in turn would
//    force the use of LXSIWZX for the loads, adding a cycle of latency to
//    each of the loads which would otherwise be able to use LFIWZX.
jsji added inline comments.Sep 25 2019, 7:36 AM
llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178

The comments looks great. Thanks!

kamaub updated this revision to Diff 225422.Oct 17 2019, 6:46 AM
kamaub edited the summary of this revision. (Show Details)

Updating the patch to improve float vector gather codegen

This patch updates how the improve float vector gather patch works to be more
readable and support Big Endian cases.

kamaub marked 14 inline comments as done.EditedOct 18 2019, 7:42 AM

Thank you all for your comments, I have addressed them.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4179

I'm sorry, not sure where the problem is, I'm fairly certain this is less than 80 characters

jsji accepted this revision as: jsji.Oct 18 2019, 2:59 PM

LGTM. It would be better to pre-commit the testcase and rebase, so that the diff show only the change due to this patch.

amyk added a subscriber: amyk.Oct 18 2019, 3:31 PM
amyk added inline comments.
llvm/test/CodeGen/PowerPC/float-vector-gather.ll
12

I think C code from which this IR test case was generated from would sound more clear.

nemanjai accepted this revision.Oct 21 2019, 5:19 PM

LGTM. +1 for pre-committing the test case.

llvm/test/CodeGen/PowerPC/float-vector-gather.ll
12

I don't agree - let's not change it to end the sentence on a preposition :). But do add the missing letter and change generate to generated.

This revision is now accepted and ready to land.Oct 21 2019, 5:19 PM
kamaub updated this revision to Diff 226482.Oct 25 2019, 12:26 PM
  • Minor spelling correction to test case comment
kamaub marked 2 inline comments as done.Oct 25 2019, 12:27 PM

Thank you, comment addressed.

LGTM. It would be better to pre-commit the testcase and rebase, so that the diff show only the change due to this patch.

LGTM. +1 for pre-committing the test case.

Thank you @jsji and @nemanjai I have made a pre-commit testcase and attached it as a parent differential, I will rebase this patch shortly

kamaub marked an inline comment as done.Oct 25 2019, 12:30 PM
kamaub updated this revision to Diff 229620.Nov 15 2019, 12:47 PM
  • Minor spelling change to test case.
kamaub updated this revision to Diff 229897.EditedNov 18 2019, 12:26 PM
-Rebasing this patch to reflect the pre-commiting of the test case to master.
This revision was automatically updated to reflect the committed changes.