Download Raw Diff

Details

Reviewers

power-llvm-team
hfinkel
echristo
stefanp
jsji
nemanjai

Group Reviewers

Restricted Project

Commits

rG6512473ceef2: [PowerPC] Improve float vector gather codegen

Summary

This patch aims to improve the code generation for float vector gather on POWER9. Patterns have been implemented to utilize instructions that deliver improved performance. This decreases overall latency from 16 to 12 cycles.

Before Patch

lfs 0, 0(3)
lfs 2, 0(5)
lfs 1, 0(4)
xxmrghd 0, 2, 0
lfs 3, 0(6)
xvcvdpsp 34, 0
xxmrghd 0, 3, 1
xvcvdpsp 35, 0
vmrgew 2, 3, 2

After Patch (using POWER9 instructions)

lfiwzx 0, 0, 6
lfiwzx 1, 0, 5
xxmrghw 0, 0, 1
lfiwzx 1, 0, 4
lfiwzx 2, 0, 3
xxmrghw 1, 1, 2
xxmrgld 34, 0, 1

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kamaub created this revision.Jun 5 2019, 7:55 AM

Herald added subscribers: llvm-commits, jsji, kbarton and 2 others. · View Herald TranscriptJun 5 2019, 7:55 AM

kamaub edited the summary of this revision. (Show Details)Jun 6 2019, 12:27 PM

jsji added a reviewer: Restricted Project.Aug 28 2019, 11:57 AM

Herald added subscribers: shchenz, • wuzish, MaskRay. · View Herald TranscriptAug 28 2019, 11:57 AM

Overall this looks good. I have a couple of very minor comments that can be fixed on commit.
LGTM.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4179	nit: Line length.
llvm/test/CodeGen/PowerPC/float-vector-gather.ll
4	Please add a note at the start here to say what you are testing.

This revision is now accepted and ready to land.Aug 30 2019, 2:11 PM

My understanding is that the cycle saving come from avoiding unnecessary SP->DP, then DP->SP conversion.
But not the difference merge sequence.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
2539	Why these dag belongs to `AlignValues`? Why not `MrgFP`?
4174–4185	What about BigEndian?
4178	What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`? `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
llvm/test/CodeGen/PowerPC/float-vector-gather.ll
3	Add test for Big endian as well please.

This revision now requires changes to proceed.Aug 30 2019, 8:57 PM

kamaub added a reviewer: adalava.Sep 17 2019, 8:52 AM

kamaub removed a reviewer: adalava.

Requesting changes because there is no BE support.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4174–4185	Yes, by all means, we need BE support as well.
4178	We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`. Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).

jsji added inline comments.Sep 17 2019, 10:27 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178	Yes, if the cycles is the same, then we should favor larger reg files. But looks like they are not the same to me. Unfortunately, UM is missing detail cycle information about `xxpermdi`, but since it is a permute instruction, all PM instructions are 3 cycles. `xxsel` is ALU instruction, hence it is 2 cycles. And we are modeling that in our scheduling info as well. $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100\|grep def -B 3 // Three Cycle PM operation. Only one PM unit per superslice so we use the whole // superslice. That includes both exec pipelines (EXECO, EXECE) and one // dispatch. def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],

nemanjai added inline comments.Sep 24 2019, 3:14 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178	I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler. However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more. In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.

jsji added inline comments.Sep 24 2019, 7:43 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178	But I do not think it is a given Yes, agree, we shouldn't. I will try to confirm with HW guys. especially since PM instructions typically range in latency between 2 and 3 cycles. As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model. In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence. Yes, likely... It depends on the reg pressure and how likely we may increase reg dependency or cause additional reg spill. It might not be a good choice most of the time when the reg pressure is not that big. Anyhow, I believe you already have some example and important performance data in mind to support your argument. But can we at least add some comments here to describe why we make such choice here -- why we prefer access to a wider register set than shaving 1 cycle here.

jsji added inline comments.Sep 24 2019, 8:42 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178	I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given... Yes, agree, we shouldn't. I will try to confirm with HW guys. Confirmed with HW team, comparing to `vmrgew`, `xxpermdi is PM-routed and therefore has the longer latency`

nemanjai added inline comments.Sep 25 2019, 4:59 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td

4178

I think this discussion was quite valuable, so thanks for bringing it up.
And I do agree that we should add a comment in the code. Perhaps the following comment:

// Using VMRGEW to assemble the final vector would be a lower latency
// solution. However, we choose to go with the slightly higher latency
// XXPERMDI for 2 reasons:
// 1. This is likely to occur in unrolled loops where regpressure is high, so we
//    want to use the latter as it has access to all 64 VSX registers.
// 2. Using Altivec instructions in this sequence would likely cause the
//    allocation of Altivec registers even for the loads which in turn would
//    force the use of LXSIWZX for the loads, adding a cycle of latency to
//    each of the loads which would otherwise be able to use LFIWZX.

jsji added inline comments.Sep 25 2019, 7:36 AM

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4178	The comments looks great. Thanks!

Updating the patch to improve float vector gather codegen

This patch updates how the improve float vector gather patch works to be more
readable and support Big Endian cases.

Harbormaster completed remote builds in B39728: Diff 225422.Oct 17 2019, 6:52 AM

Thank you all for your comments, I have addressed them.

llvm/lib/Target/PowerPC/PPCInstrVSX.td
4179	I'm sorry, not sure where the problem is, I'm fairly certain this is less than 80 characters

LGTM. It would be better to pre-commit the testcase and rebase, so that the diff show only the change due to this patch.

amyk added a subscriber: amyk.Oct 18 2019, 3:31 PM

amyk added inline comments.

llvm/test/CodeGen/PowerPC/float-vector-gather.ll
12	I think `C code from which this IR test case was generated from` would sound more clear.

LGTM. +1 for pre-committing the test case.

llvm/test/CodeGen/PowerPC/float-vector-gather.ll
12	I don't agree - let's not change it to end the sentence on a preposition :). But do add the missing letter and change `generate` to `generated`.

This revision is now accepted and ready to land.Oct 21 2019, 5:19 PM

kamaub added a parent revision: D69443: [PowerPC] Test case for vector float gather on ppc64le and ppc64.Oct 25 2019, 12:21 PM

Minor spelling correction to test case comment

Thank you, comment addressed.

In D62908#1715211, @jsji wrote:

LGTM. It would be better to pre-commit the testcase and rebase, so that the diff show only the change due to this patch.

In D62908#1717475, @nemanjai wrote:

LGTM. +1 for pre-committing the test case.

Thank you @jsji and @nemanjai I have made a pre-commit testcase and attached it as a parent differential, I will rebase this patch shortly

Harbormaster completed remote builds in B40074: Diff 226482.Oct 25 2019, 12:29 PM

kamaub marked an inline comment as done.Oct 25 2019, 12:30 PM

Minor spelling change to test case.

Harbormaster completed remote builds in B41054: Diff 229620.Nov 15 2019, 12:54 PM

-Rebasing this patch to reflect the pre-commiting of the test case to master.

Harbormaster completed remote builds in B41131: Diff 229897.Nov 18 2019, 12:30 PM

Closed by commit rG6512473ceef2: [PowerPC] Improve float vector gather codegen (authored by stefanp). · Explain WhyNov 18 2019, 1:55 PM

This revision was automatically updated to reflect the committed changes.

Diff 229919

llvm/lib/Target/PowerPC/PPCInstrVSX.td

Show First 20 Lines • Show All 2,530 Lines • ▼ Show 20 Lines
// Materialize a zero-vector of long long		// Materialize a zero-vector of long long
def : Pat<(v2i64 immAllZerosV),		def : Pat<(v2i64 immAllZerosV),
(v2i64 (XXLXORz))>;		(v2i64 (XXLXORz))>;
}		}

def AlignValues {		def AlignValues {
dag F32_TO_BE_WORD1 = (v4f32 (XXSLDWI (XSCVDPSPN $B), (XSCVDPSPN $B), 3));		dag F32_TO_BE_WORD1 = (v4f32 (XXSLDWI (XSCVDPSPN $B), (XSCVDPSPN $B), 3));
dag I32_TO_BE_WORD1 = (COPY_TO_REGCLASS (MTVSRWZ $B), VSRC);		dag I32_TO_BE_WORD1 = (COPY_TO_REGCLASS (MTVSRWZ $B), VSRC);
}		}
		jsjiUnsubmitted Done Reply Inline Actions Why these dag belongs to `AlignValues`? Why not `MrgFP`? jsji: Why these dag belongs to `AlignValues`? Why not `MrgFP`?

// The following VSX instructions were introduced in Power ISA 3.0		// The following VSX instructions were introduced in Power ISA 3.0
def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;		def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;
let AddedComplexity = 400, Predicates = [HasP9Vector] in {		let AddedComplexity = 400, Predicates = [HasP9Vector] in {

// [PO VRT XO VRB XO /]		// [PO VRT XO VRB XO /]
class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,		class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,
list<dag> pattern>		list<dag> pattern>
▲ Show 20 Lines • Show All 1,371 Lines • ▼ Show 20 Lines
}		}
def DblToLongLoad {		def DblToLongLoad {
dag A = (i64 (PPCmfvsr (PPCfctidz (f64 (load xoaddr:$A)))));		dag A = (i64 (PPCmfvsr (PPCfctidz (f64 (load xoaddr:$A)))));
}		}
def DblToULongLoad {		def DblToULongLoad {
dag A = (i64 (PPCmfvsr (PPCfctiduz (f64 (load xoaddr:$A)))));		dag A = (i64 (PPCmfvsr (PPCfctiduz (f64 (load xoaddr:$A)))));
}		}

		// FP load dags (for f32 -> v4f32)
		def LoadFP {
		dag A = (f32 (load xoaddr:$A));
		dag B = (f32 (load xoaddr:$B));
		dag C = (f32 (load xoaddr:$C));
		dag D = (f32 (load xoaddr:$D));
		}

// FP merge dags (for f32 -> v4f32)		// FP merge dags (for f32 -> v4f32)
def MrgFP {		def MrgFP {
		dag LD32A = (COPY_TO_REGCLASS (LIWZX xoaddr:$A), VSRC);
		dag LD32B = (COPY_TO_REGCLASS (LIWZX xoaddr:$B), VSRC);
		dag LD32C = (COPY_TO_REGCLASS (LIWZX xoaddr:$C), VSRC);
		dag LD32D = (COPY_TO_REGCLASS (LIWZX xoaddr:$D), VSRC);
dag AC = (XVCVDPSP (XXPERMDI (COPY_TO_REGCLASS $A, VSRC),		dag AC = (XVCVDPSP (XXPERMDI (COPY_TO_REGCLASS $A, VSRC),
(COPY_TO_REGCLASS $C, VSRC), 0));		(COPY_TO_REGCLASS $C, VSRC), 0));
dag BD = (XVCVDPSP (XXPERMDI (COPY_TO_REGCLASS $B, VSRC),		dag BD = (XVCVDPSP (XXPERMDI (COPY_TO_REGCLASS $B, VSRC),
(COPY_TO_REGCLASS $D, VSRC), 0));		(COPY_TO_REGCLASS $D, VSRC), 0));
dag ABhToFlt = (XVCVDPSP (XXPERMDI $A, $B, 0));		dag ABhToFlt = (XVCVDPSP (XXPERMDI $A, $B, 0));
dag ABlToFlt = (XVCVDPSP (XXPERMDI $A, $B, 3));		dag ABlToFlt = (XVCVDPSP (XXPERMDI $A, $B, 3));
dag BAhToFlt = (XVCVDPSP (XXPERMDI $B, $A, 0));		dag BAhToFlt = (XVCVDPSP (XXPERMDI $B, $A, 0));
dag BAlToFlt = (XVCVDPSP (XXPERMDI $B, $A, 3));		dag BAlToFlt = (XVCVDPSP (XXPERMDI $B, $A, 3));
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	let AddedComplexity = 400 in {
}		}

// Big endian, available on all targets with VSX		// Big endian, available on all targets with VSX
let Predicates = [IsBigEndian, HasVSX] in {		let Predicates = [IsBigEndian, HasVSX] in {
def : Pat<(v2f64 (build_vector f64:$A, f64:$B)),		def : Pat<(v2f64 (build_vector f64:$A, f64:$B)),
(v2f64 (XXPERMDI		(v2f64 (XXPERMDI
(COPY_TO_REGCLASS $A, VSRC),		(COPY_TO_REGCLASS $A, VSRC),
(COPY_TO_REGCLASS $B, VSRC), 0))>;		(COPY_TO_REGCLASS $B, VSRC), 0))>;
		// Using VMRGEW to assemble the final vector would be a lower latency
		// solution. However, we choose to go with the slightly higher latency
		// XXPERMDI for 2 reasons:
		// 1. This is likely to occur in unrolled loops where regpressure is high,
		// so we want to use the latter as it has access to all 64 VSX registers.
		// 2. Using Altivec instructions in this sequence would likely cause the
		// allocation of Altivec registers even for the loads which in turn would
		// force the use of LXSIWZX for the loads, adding a cycle of latency to
		// each of the loads which would otherwise be able to use LFIWZX.
		def : Pat<(v4f32 (build_vector LoadFP.A, LoadFP.B, LoadFP.C, LoadFP.D)),
		(v4f32 (XXPERMDI (XXMRGHW MrgFP.LD32A, MrgFP.LD32B),
		(XXMRGHW MrgFP.LD32C, MrgFP.LD32D), 3))>;
def : Pat<(v4f32 (build_vector f32:$A, f32:$B, f32:$C, f32:$D)),		def : Pat<(v4f32 (build_vector f32:$A, f32:$B, f32:$C, f32:$D)),
(VMRGEW MrgFP.AC, MrgFP.BD)>;		(VMRGEW MrgFP.AC, MrgFP.BD)>;
def : Pat<(v4f32 (build_vector DblToFlt.A0, DblToFlt.A1,		def : Pat<(v4f32 (build_vector DblToFlt.A0, DblToFlt.A1,
DblToFlt.B0, DblToFlt.B1)),		DblToFlt.B0, DblToFlt.B1)),
(v4f32 (VMRGEW MrgFP.ABhToFlt, MrgFP.ABlToFlt))>;		(v4f32 (VMRGEW MrgFP.ABhToFlt, MrgFP.ABlToFlt))>;

// Convert 4 doubles to a vector of ints.		// Convert 4 doubles to a vector of ints.
def : Pat<(v4i32 (build_vector DblToInt.A, DblToInt.B,		def : Pat<(v4i32 (build_vector DblToInt.A, DblToInt.B,
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	let AddedComplexity = 400 in {
}		}

let Predicates = [IsLittleEndian, HasVSX] in {		let Predicates = [IsLittleEndian, HasVSX] in {
// Little endian, available on all targets with VSX		// Little endian, available on all targets with VSX
def : Pat<(v2f64 (build_vector f64:$A, f64:$B)),		def : Pat<(v2f64 (build_vector f64:$A, f64:$B)),
(v2f64 (XXPERMDI		(v2f64 (XXPERMDI
(COPY_TO_REGCLASS $B, VSRC),		(COPY_TO_REGCLASS $B, VSRC),
(COPY_TO_REGCLASS $A, VSRC), 0))>;		(COPY_TO_REGCLASS $A, VSRC), 0))>;
		// Using VMRGEW to assemble the final vector would be a lower latency
		// solution. However, we choose to go with the slightly higher latency
		// XXPERMDI for 2 reasons:
		// 1. This is likely to occur in unrolled loops where regpressure is high,
		// so we want to use the latter as it has access to all 64 VSX registers.
		jsjiUnsubmitted Done Reply Inline Actions What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`? `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here. jsji: What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
		nemanjaiUnsubmitted Done Reply Inline Actions We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`. Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well). nemanjai: We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
		jsjiUnsubmitted Done Reply Inline Actions Yes, if the cycles is the same, then we should favor larger reg files. But looks like they are not the same to me. Unfortunately, UM is missing detail cycle information about `xxpermdi`, but since it is a permute instruction, all PM instructions are 3 cycles. `xxsel` is ALU instruction, hence it is 2 cycles. And we are modeling that in our scheduling info as well. $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100\|grep def -B 3 // Three Cycle PM operation. Only one PM unit per superslice so we use the whole // superslice. That includes both exec pipelines (EXECO, EXECE) and one // dispatch. def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C], jsji: Yes, if the cycles is the same, then we should favor larger reg files. But looks like they are…
		nemanjaiUnsubmitted Done Reply Inline Actions I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler. However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more. In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence. nemanjai: I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I…
		jsjiUnsubmitted Done Reply Inline Actions But I do not think it is a given Yes, agree, we shouldn't. I will try to confirm with HW guys. especially since PM instructions typically range in latency between 2 and 3 cycles. As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model. In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence. Yes, likely... It depends on the reg pressure and how likely we may increase reg dependency or cause additional reg spill. It might not be a good choice most of the time when the reg pressure is not that big. Anyhow, I believe you already have some example and important performance data in mind to support your argument. But can we at least add some comments here to describe why we make such choice here -- why we prefer access to a wider register set than shaving 1 cycle here. jsji: > But I do not think it is a given Yes, agree, we shouldn't. I will try to confirm with HW…
		jsjiUnsubmitted Done Reply Inline Actions I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given... Yes, agree, we shouldn't. I will try to confirm with HW guys. Confirmed with HW team, comparing to `vmrgew`, `xxpermdi is PM-routed and therefore has the longer latency` jsji: >> I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I…
		nemanjaiUnsubmitted Done Reply Inline Actions I think this discussion was quite valuable, so thanks for bringing it up. And I do agree that we should add a comment in the code. Perhaps the following comment: // Using VMRGEW to assemble the final vector would be a lower latency // solution. However, we choose to go with the slightly higher latency // XXPERMDI for 2 reasons: // 1. This is likely to occur in unrolled loops where regpressure is high, so we // want to use the latter as it has access to all 64 VSX registers. // 2. Using Altivec instructions in this sequence would likely cause the // allocation of Altivec registers even for the loads which in turn would // force the use of LXSIWZX for the loads, adding a cycle of latency to // each of the loads which would otherwise be able to use LFIWZX. nemanjai: I think this discussion was quite valuable, so thanks for bringing it up. And I do agree that…
		jsjiUnsubmitted Done Reply Inline Actions The comments looks great. Thanks! jsji: The comments looks great. Thanks!
		// 2. Using Altivec instructions in this sequence would likely cause the
		stefanpUnsubmitted Done Reply Inline Actions nit: Line length. stefanp: nit: Line length.
		kamaubAuthorUnsubmitted Done Reply Inline Actions I'm sorry, not sure where the problem is, I'm fairly certain this is less than 80 characters kamaub: I'm sorry, not sure where the problem is, I'm fairly certain this is less than 80 characters
		// allocation of Altivec registers even for the loads which in turn would
		// force the use of LXSIWZX for the loads, adding a cycle of latency to
		// each of the loads which would otherwise be able to use LFIWZX.
		def : Pat<(v4f32 (build_vector LoadFP.A, LoadFP.B, LoadFP.C, LoadFP.D)),
		(v4f32 (XXPERMDI (XXMRGHW MrgFP.LD32D, MrgFP.LD32C),
		(XXMRGHW MrgFP.LD32B, MrgFP.LD32A), 3))>;
		jsjiUnsubmitted Done Reply Inline Actions What about BigEndian? jsji: What about BigEndian?
		nemanjaiUnsubmitted Done Reply Inline Actions Yes, by all means, we need BE support as well. nemanjai: Yes, by all means, we need BE support as well.
def : Pat<(v4f32 (build_vector f32:$D, f32:$C, f32:$B, f32:$A)),		def : Pat<(v4f32 (build_vector f32:$D, f32:$C, f32:$B, f32:$A)),
(VMRGEW MrgFP.AC, MrgFP.BD)>;		(VMRGEW MrgFP.AC, MrgFP.BD)>;
def : Pat<(v4f32 (build_vector DblToFlt.A0, DblToFlt.A1,		def : Pat<(v4f32 (build_vector DblToFlt.A0, DblToFlt.A1,
DblToFlt.B0, DblToFlt.B1)),		DblToFlt.B0, DblToFlt.B1)),
(v4f32 (VMRGEW MrgFP.BAhToFlt, MrgFP.BAlToFlt))>;		(v4f32 (VMRGEW MrgFP.BAhToFlt, MrgFP.BAlToFlt))>;

// Convert 4 doubles to a vector of ints.		// Convert 4 doubles to a vector of ints.
def : Pat<(v4i32 (build_vector DblToInt.A, DblToInt.B,		def : Pat<(v4i32 (build_vector DblToInt.A, DblToInt.B,
▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/float-vector-gather.ll

	; NOTE: This test ensures that for both Big and Little Endian cases a set of			; NOTE: This test ensures that for both Big and Little Endian cases a set of
	; NOTE: 4 floats is gathered into a v4f32 register using xxmrghd, xvcvdpsp,			; NOTE: 4 floats is gathered into a v4f32 register using xxmrghw and xxmrgld
	; NOTE: and vmrgew.
	; RUN: llc -verify-machineinstrs -mcpu=pwr9 -ppc-vsr-nums-as-vr \			; RUN: llc -verify-machineinstrs -mcpu=pwr9 -ppc-vsr-nums-as-vr \
				jsjiUnsubmitted Done Reply Inline Actions Add test for Big endian as well please. jsji: Add test for Big endian as well please.
	; RUN: -ppc-asm-full-reg-names -mtriple=powerpc64le-unknown-linux-gnu < %s \			; RUN: -ppc-asm-full-reg-names -mtriple=powerpc64le-unknown-linux-gnu < %s \
				stefanpUnsubmitted Done Reply Inline Actions Please add a note at the start here to say what you are testing. stefanp: Please add a note at the start here to say what you are testing.
	; RUN: \| FileCheck %s -check-prefix=CHECK-LE			; RUN: \| FileCheck %s -check-prefix=CHECK-LE
	; RUN: llc -verify-machineinstrs -mcpu=pwr9 -ppc-vsr-nums-as-vr \			; RUN: llc -verify-machineinstrs -mcpu=pwr9 -ppc-vsr-nums-as-vr \
	; RUN: -ppc-asm-full-reg-names -mtriple=powerpc64-unknown-linux-gnu < %s \			; RUN: -ppc-asm-full-reg-names -mtriple=powerpc64-unknown-linux-gnu < %s \
	; RUN: \| FileCheck %s -check-prefix=CHECK-BE			; RUN: \| FileCheck %s -check-prefix=CHECK-BE
	define dso_local <4 x float> @vector_gatherf(float* nocapture readonly %a,			define dso_local <4 x float> @vector_gatherf(float* nocapture readonly %a,
	float* nocapture readonly %b, float* nocapture readonly %c,			float* nocapture readonly %b, float* nocapture readonly %c,
	float* nocapture readonly %d) {			float* nocapture readonly %d) {
	; C code from which this IR test case was generated:			; C code from which this IR test case was generated:
				amykUnsubmitted Done Reply Inline Actions I think `C code from which this IR test case was generated from` would sound more clear. amyk: I think `C code from which this IR test case was generated from` would sound more clear.
				nemanjaiUnsubmitted Done Reply Inline Actions I don't agree - let's not change it to end the sentence on a preposition :). But do add the missing letter and change `generate` to `generated`. nemanjai: I don't agree - let's not change it to end the sentence on a preposition :). But do add the…
	; vector float test(float a, float b, float c, float d) {			; vector float test(float a, float b, float c, float d) {
	; return (vector float) { a, b, c, d };			; return (vector float) { a, b, c, d };
	; }			; }
	; CHECK-LE-LABEL: vector_gatherf:			; CHECK-LE-LABEL: vector_gatherf:
	; CHECK-LE: # %bb.0: # %entry			; CHECK-LE: # %bb.0: # %entry
	; CHECK-LE-DAG: lfs f[[REG0:[0-9]+]], 0(r3)			; CHECK-LE-DAG: lfiwzx f[[REG0:[0-9]+]], 0, r6
	; CHECK-LE-DAG: lfs f[[REG1:[0-9]+]], 0(r4)			; CHECK-LE-DAG: lfiwzx f[[REG1:[0-9]+]], 0, r5
	; CHECK-LE-DAG: lfs f[[REG2:[0-9]+]], 0(r5)			; CHECK-LE-DAG: lfiwzx f[[REG2:[0-9]+]], 0, r4
	; CHECK-LE-DAG: lfs f[[REG3:[0-9]+]], 0(r6)			; CHECK-LE-DAG: lfiwzx f[[REG3:[0-9]+]], 0, r3
	; CHECK-LE-DAG: xxmrghd vs[[REG4:[0-9]+]], vs[[REG2]], vs[[REG0]]			; CHECK-LE-DAG: xxmrghw vs[[REG0]], vs[[REG0]], vs[[REG1]]
	; CHECK-LE-NEXT: xvcvdpsp v[[VREG2:[0-9]+]], vs[[REG4]]			; CHECK-LE-DAG: xxmrghw vs[[REG4:[0-9]+]], vs[[REG2]], vs[[REG3]]
	; CHECK-LE-NEXT: xxmrghd vs[[REG5:[0-9]+]], vs[[REG3]], vs[[REG1]]			; CHECK-LE-NEXT: xxmrgld v[[REG:[0-9]+]], vs[[REG0]], vs[[REG4]]
	; CHECK-LE-NEXT: xvcvdpsp v[[VREG3:[0-9]+]], vs[[REG5]]
	; CHECK-LE-NEXT: vmrgew v[[VREG:[0-9]+]], v[[VREG3]], v[[VREG2]]
	; CHECK-LE-NEXT: blr			; CHECK-LE-NEXT: blr

	; CHECK-BE-LABEL: vector_gatherf:			; CHECK-BE-LABEL: vector_gatherf:
	; CHECK-BE: # %bb.0: # %entry			; CHECK-BE: # %bb.0: # %entry
	; CHECK-BE-DAG: lfs f[[REG0:[0-9]+]], 0(r3)			; CHECK-BE-DAG: lfiwzx f[[REG0:[0-9]+]], 0, r3
	; CHECK-BE-DAG: lfs f[[REG1:[0-9]+]], 0(r4)			; CHECK-BE-DAG: lfiwzx f[[REG1:[0-9]+]], 0, r4
	; CHECK-BE-DAG: lfs f[[REG2:[0-9]+]], 0(r5)			; CHECK-BE-DAG: lfiwzx f[[REG2:[0-9]+]], 0, r5
	; CHECK-BE-DAG: lfs f[[REG3:[0-9]+]], 0(r6)			; CHECK-BE-DAG: lfiwzx f[[REG3:[0-9]+]], 0, r6
	; CHECK-BE-DAG: xxmrghd vs[[REG4:[0-9]+]], vs[[REG0]], vs[[REG2]]			; CHECK-BE-DAG: xxmrghw vs[[REG0]], vs[[REG0]], vs[[REG1]]
	; CHECK-BE-DAG: xxmrghd vs[[REG5:[0-9]+]], vs[[REG1]], vs[[REG3]]			; CHECK-BE-DAG: xxmrghw vs[[REG4:[0-9]+]], vs[[REG2]], vs[[REG3]]
	; CHECK-BE-NEXT: xvcvdpsp v[[VREG2:[0-9]+]], vs[[REG5]]			; CHECK-BE-NEXT: xxmrgld v[[REG:[0-9]+]], vs[[REG0]], vs[[REG4]]
	; CHECK-BE-NEXT: xvcvdpsp v[[VREG3:[0-9]+]], vs[[REG4]]
	; CHECK-BE-NEXT: vmrgew v[[VREG:[0-9]+]], v[[VREG3]], v[[VREG2]]
	; CHECK-BE-NEXT: blr			; CHECK-BE-NEXT: blr
	entry:			entry:
	%0 = load float, float* %a, align 4			%0 = load float, float* %a, align 4
	%vecinit = insertelement <4 x float> undef, float %0, i32 0			%vecinit = insertelement <4 x float> undef, float %0, i32 0
	%1 = load float, float* %b, align 4			%1 = load float, float* %b, align 4
	%vecinit1 = insertelement <4 x float> %vecinit, float %1, i32 1			%vecinit1 = insertelement <4 x float> %vecinit, float %1, i32 1
	%2 = load float, float* %c, align 4			%2 = load float, float* %c, align 4
	%vecinit2 = insertelement <4 x float> %vecinit1, float %2, i32 2			%vecinit2 = insertelement <4 x float> %vecinit1, float %2, i32 2
	%3 = load float, float* %d, align 4			%3 = load float, float* %d, align 4
	%vecinit3 = insertelement <4 x float> %vecinit2, float %3, i32 3			%vecinit3 = insertelement <4 x float> %vecinit2, float %3, i32 3
	ret <4 x float> %vecinit3			ret <4 x float> %vecinit3
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Improve float vector gather codegen
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 229919

llvm/lib/Target/PowerPC/PPCInstrVSX.td

llvm/test/CodeGen/PowerPC/float-vector-gather.ll

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Improve float vector gather codegenClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 229919

llvm/lib/Target/PowerPC/PPCInstrVSX.td

llvm/test/CodeGen/PowerPC/float-vector-gather.ll

[PowerPC] Improve float vector gather codegen
ClosedPublic