This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPCInstrInfo.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
builtins-ppc-p9-f128.ll
-
f128-conv.ll
-
f128-passByValue.ll
-
p9_copy_fp.ll
-
vsx-spill.ll

Differential D50004

[PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers for P9
ClosedPublic

Authored by amyk on Jul 30 2018, 12:37 PM.

Download Raw Diff

Details

Reviewers

nemanjai
echristo
hfinkel

Commits

rGf3846067991e: [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar…
rL340643: [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar…

Summary

This patch will address using the xscpsgndp instruction to copy floating point scalar registers instead of the xxlor (specifically XXLORf) instruction that is currently used. Additionally, this patch of utilizing xscpsgndp will apply to P9, while pre-P9 will still use xxlor.

This patch includes:

The change in instruction opcode to utilize xscpsgndp instead of xxlor when copying the floating point scalar registers on P9
An update of test cases to reflect this behaviour (specifically for P9), while still using xxlor pre-P9
An update to test cases to include the -ppc-vsr-nums-as-vr -ppc-asm-full-reg-names llc options

Diff Detail

Repository: rL LLVM

Event Timeline

amyk created this revision.Jul 30 2018, 12:37 PM

Herald added subscribers: hiraditya, qcolombet. · View Herald TranscriptJul 30 2018, 12:37 PM

amyk edited the summary of this revision. (Show Details)Jul 31 2018, 3:25 AM

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

In D50004#1182972, @inouehrs wrote:

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

Can you clarify this please? Where is this information coming from? According to the UM, XXLOR takes up a whole superslice whereas XSCPSGNDP takes up a single slice so we can dispatch 2 of the former per cycle and 4 of the latter. And the "Pipe Busy Cycles" field for both is 1.

In D50004#1182972, @inouehrs wrote:

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

Yes, I agree that we should limit this to Power9. Does the comment about normalization only pertain to ISA 2.07? The text from ISA 3.0 is:

Bit 0 of VSR[XT] is set to the contents of bit 0 of VSR[XA].
Bits 1:63 of VSR[XT] are set to the contents of bits 1:63 of VSR[XB].
The contents of doubleword element 1 of VSR[XT] are undefined.

There is no mention of normalization.

Does the comment about normalization only pertain to ISA 2.07?

As I browse the document, neither ISA 2.07 nor 3.0 mention about normalization by xscpsgndp.
P8 UM says xscpsgndp, xvcpsgndp and fmr are normalizing instruction. P9 UM say nothing.
Since fmr is also a normalizing instruction, I feel it is acceptable to use xscpsgndp for coping register.

In D50004#1185697, @nemanjai wrote:

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

Can you clarify this please? Where is this information coming from? According to the UM, XXLOR takes up a whole superslice whereas XSCPSGNDP takes up a single slice so we can dispatch 2 of the former per cycle and 4 of the latter. And the "Pipe Busy Cycles" field for both is 1.

My fault, sorry, I saw the data from the wrong column "max ops per cycle". You are right, both are 1 cycle for busy pipe, XSCPSGNDP should outperform on Power9 then.

amyk retitled this revision from [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers to [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers for P9.Aug 3 2018, 4:52 AM

amyk edited the summary of this revision. (Show Details)

This revision primarily addresses enabling xscpsgndp for P9 only, as xxlor is still preferable to use for copying floating point scalars pre-P9.

In D50004#1185846, @inouehrs wrote:

Does the comment about normalization only pertain to ISA 2.07?

As I browse the document, neither ISA 2.07 nor 3.0 mention about normalization by xscpsgndp.
P8 UM says xscpsgndp, xvcpsgndp and fmr are normalizing instruction. P9 UM say nothing.
Since fmr is also a normalizing instruction, I feel it is acceptable to use xscpsgndp for coping register.

P9 UM does clarify the difference:

4.3.6 Handling of Denormal Single-Precision Values in Double-Precision Format
Unlike previous generation processors, such as the POWER8 processor, the POWER9 processor is capable of handling denormal single-precision values as inputs for all subsequent instructions. Whereas, in some cases, the POWER8 processor takes a soft-patch interrupt to allow the interrupt handler to reformat the input operands to a double-precision format and then re-execute the instruction, the POWER9 processor simply executes normally regardless of how that number was produced.

In D50004#1187230, @amyk wrote:

This revision primarily addresses enabling xscpsgndp for P9 only, as xxlor is still preferable to use for copying floating point scalars pre-P9.

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

In D50004#1187625, @jsji wrote:

<snip>

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

I don't think there are any conditions under which we will emit an xxlor that will be eligible for this. That may be a good candidate to peephole and/or fuse together.
Example:

vector double test(double a, vector double b, vector double c, vector double *s) {
  vector double n = (vector double)a;
  *s = n + c * b;
  return n;
}

Is about as close as you can get, but will produce the following on Power9:

xxspltd vs0, vs1, 0
xxlor vs1, vs0, vs0
xvmaddadp vs1, vs35, vs34
xxlor vs34, vs0, vs0
stxv vs1, 0(r9)

Ultimately, the target of the copy will always be used as an input to the destructive operation. If we want to exploit this optimization in the HW, we'd have to forward the source of the copy (and eliminate the second copy in this case as well). But if we're consciously transforming the code to exploit this, the instruction we use for the COPY is immaterial (we can always transform it to XXLOR at the time).

In D50004#1192390, @nemanjai wrote:
In D50004#1187625, @jsji wrote:

<snip>

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

I don't think there are any conditions under which we will emit an xxlor that will be eligible for this. That may be a good candidate to peephole and/or fuse together.
Example:
vector double test(double a, vector double b, vector double c, vector double *s) {
  vector double n = (vector double)a;
  *s = n + c * b;
  return n;
}
Is about as close as you can get, but will produce the following on Power9:
xxspltd vs0, vs1, 0
xxlor vs1, vs0, vs0
xvmaddadp vs1, vs35, vs34
xxlor vs34, vs0, vs0
stxv vs1, 0(r9)
Ultimately, the target of the copy will always be used as an input to the destructive operation. If we want to exploit this optimization in the HW, we'd have to forward the source of the copy (and eliminate the second copy in this case as well). But if we're consciously transforming the code to exploit this, the instruction we use for the COPY is immaterial (we can always transform it to XXLOR at the time).

FYI.

An ugly example to show that the there are situations that this change will have impact to destructive operations.

$ cat t1.c

double test(double n0, double n1, double n2, double n3, double n4, double n5, double n6, double n7, double n8, double n9,
double n10, double n11, double n12, double n13, double n14, double n15, double n16, double n17, double n18, double n19,
double n20, double n21, double n22, double n23, double n24, double n25, double n26, double n27, double n28, double n29,
double n30, double n31, double n32, double n33,
 double c, double b, double *s) {
  *s = n0 + c * b;
  *s += n1 + *s * b;
  *s += n2 + *s * b;
  *s += n3 + *s * b;
  *s += n4 + *s * b;
  *s += n5+ *s * b;
  *s += n6+ *s * b;
  *s += n7+ *s * b;
  *s += n8+ *s * b;
  *s += n9+ *s * b;
  *s += n10 + *s * b;
  *s += n11 + *s * b;
  *s += n12 + *s * b;
  *s += n13 + *s * b;
  *s += n14 + *s * b;
  *s += n15+ *s * b;
  *s += n16+ *s * b;
  *s += n17+ *s * b;
  *s += n18+ *s * b;
  *s += n19+ *s * b;
  *s += n20 + *s * b;
  *s += n21 + *s * b;
  *s += n22 + *s * b;
  *s += n23 + *s * b;
  *s += n24 + *s * b;
  *s += n25+ *s * b;
  *s += n26+ *s * b;
  *s += n27+ *s * b;
  *s += n28+ *s * b;
  *s += n29+ *s * b;
  *s += n30 + *s * b;
  *s += n31 + *s * b;
  *s += n32 + *s * b;
  *s += n33 + *s * b;
  return n0;
}

clang -S -mcpu=pwr9 -O2 -ffast-math t1.c -mllvm -ppc-vsr-nums-as-vr -mllvm -ppc-asm-full-reg-names -mllvm -enable-post-misched=false

diff of assembly before change and after change:

$ diff -Naur before.s after.s 
--- before.s    2018-08-09 13:52:24.785246846 -0400
+++ after.s     2018-08-09 13:59:38.815493708 -0400
@@ -9,7 +9,7 @@
 # %bb.0:                                # %entry
        lfd f0, 304(r1)
        lxsd v2, 312(r1)
-       xxlor v3, f1, f1
+      xscpsgndp v3, f1, f1
        xsmaddadp v3, v2, f0
        xsadddp f0, v3, f2
        xsmaddadp f0, v3, v2

In D50004#1194086, @jsji wrote:

<snip>

diff of assembly before change and after change:

$ diff -Naur before.s after.s 
--- before.s    2018-08-09 13:52:24.785246846 -0400
+++ after.s     2018-08-09 13:59:38.815493708 -0400
@@ -9,7 +9,7 @@
 # %bb.0:                                # %entry
        lfd f0, 304(r1)
        lxsd v2, 312(r1)
-       xxlor v3, f1, f1
+      xscpsgndp v3, f1, f1
        xsmaddadp v3, v2, f0
        xsadddp f0, v3, f2
        xsmaddadp f0, v3, v2

This is exactly what I was referring to... The situation you describe is not analogous to the situation described in the ISA. According to the ISA, the sequence that will be optimized is:

xxlor XC, XT, XT
xxperm XT, XA, XB

So in this case, the only way we would get the optimized behaviour would be if the "pre-patch" code sequence was:

xxlor v3, f1, f1
xsmaddadp f1, v2, f0

And I'm fairly certain that without source forwarding of the copy, we can never produce such code (but of course, I could be wrong).

In D50004#1195184, @nemanjai wrote:
In D50004#1194086, @jsji wrote:

<snip>

This is exactly what I was referring to... The situation you describe is not analogous to the situation described in the ISA. According to the ISA, the sequence that will be optimized is:
xxlor XC, XT, XT
xxperm XT, XA, XB
So in this case, the only way we would get the optimized behaviour would be if the "pre-patch" code sequence was:
xxlor v3, f1, f1
xsmaddadp f1, v2, f0
And I'm fairly certain that without source forwarding of the copy, we can never produce such code (but of course, I could be wrong).

??? Can you please double check what ISA are you referring to?

The description in PowerISA_public.v3.0B https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv is:

As an example, to preserve the XT source register in the xxperm instruction, the following sequence will optimize performance.

xxlor XT,XC,XC /* Copy (XC) to XT
xxperm XT,XA,XB /* Permute, overwriting XT

The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance.

This should be exact the same pattern as in my example: xxlor XT,XC,XC to Copy (XC) to XT, not xxlor XC,XT,XT in your description.

In D50004#1195212, @jsji wrote:

??? Can you please double check what ISA are you referring to?

ISA 3.0

The description in PowerISA_public.v3.0B https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv is:

As an example, to preserve the XT source register in the xxperm instruction, the following sequence will optimize performance.
xxlor XT,XC,XC /* Copy (XC) to XT
xxperm XT,XA,XB /* Permute, overwriting XT
The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance.

This should be exact the same pattern as in my example: xxlor XT,XC,XC to Copy (XC) to XT, not xxlor XC,XT,XT in your description.

It would appear that the version of the document I have has a bug in it (I had ISA 3.0 - without the B). Yes, I agree that the description in the corrected ISA document matches both of the provided examples.
However, I don't think this should preclude this patch. Register/register copies are far more common than destructive operations so we should emit the best instruction for the copy. As a follow-up, we should detect copies that are inputs to destructive operations and emit the XXLOR for those.

In D50004#1196993, @nemanjai wrote:

It would appear that the version of the document I have has a bug in it (I had ISA 3.0 - without the B). Yes, I agree that the description in the corrected ISA document matches both of the provided examples.
However, I don't think this should preclude this patch. Register/register copies are far more common than destructive operations so we should emit the best instruction for the copy. As a follow-up, we should detect copies that are inputs to destructive operations and emit the XXLOR for those.

Yes, agree. It is OK as long as we will follow up for destructive operations. Thanks!

LGTM. Since we're all in agreement that we want to make this change for Power9, I think this is fine to go in.

This revision is now accepted and ready to land.Aug 14 2018, 10:25 AM

Updated the diff to since a specific test on line 528 in vsx.ll now using the v4 register instead of vs0.

In D50004#1204121, @amyk wrote:

Updated the diff to since a specific test on line 528 in vsx.ll now using the v4 register instead of vs0.

Can we split the part of "An update to test cases to include the -ppc-vsr-nums-as-vr -ppc-asm-full-reg-names llc options" into a separate patch and commit it first,
so that it would be clearer to see what is the intended change from xxlor-> xscpsgndp in all the testcases?

Closed by commit rL340643: [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar… (authored by stefanp). · Explain WhyAug 24 2018, 1:01 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

PowerPC/

PPCInstrInfo.cpp

2 lines

test/

CodeGen/

PowerPC/

builtins-ppc-p9-f128.ll

2 lines

14 lines

4 lines

48 lines

8 lines

Diff 162452

llvm/trunk/lib/Target/PowerPC/PPCInstrInfo.cpp

Show First 20 Lines • Show All 981 Lines • ▼ Show 20 Lines	else if (PPC::VSRCRegClass.contains(DestReg, SrcReg))
// 2. xmovdp/xmovsp: This has higher latency (on the P7), 6 cycles, but		// 2. xmovdp/xmovsp: This has higher latency (on the P7), 6 cycles, but
// can go to either pipeline.		// can go to either pipeline.
// We'll always use xxlor here, because in practically all cases where		// We'll always use xxlor here, because in practically all cases where
// copies are generated, they are close enough to some use that the		// copies are generated, they are close enough to some use that the
// lower-latency form is preferable.		// lower-latency form is preferable.
Opc = PPC::XXLOR;		Opc = PPC::XXLOR;
else if (PPC::VSFRCRegClass.contains(DestReg, SrcReg) \|\|		else if (PPC::VSFRCRegClass.contains(DestReg, SrcReg) \|\|
PPC::VSSRCRegClass.contains(DestReg, SrcReg))		PPC::VSSRCRegClass.contains(DestReg, SrcReg))
Opc = PPC::XXLORf;		Opc = (Subtarget.hasP9Vector()) ? PPC::XSCPSGNDP : PPC::XXLORf;
else if (PPC::QFRCRegClass.contains(DestReg, SrcReg))		else if (PPC::QFRCRegClass.contains(DestReg, SrcReg))
Opc = PPC::QVFMR;		Opc = PPC::QVFMR;
else if (PPC::QSRCRegClass.contains(DestReg, SrcReg))		else if (PPC::QSRCRegClass.contains(DestReg, SrcReg))
Opc = PPC::QVFMRs;		Opc = PPC::QVFMRs;
else if (PPC::QBRCRegClass.contains(DestReg, SrcReg))		else if (PPC::QBRCRegClass.contains(DestReg, SrcReg))
Opc = PPC::QVFMRb;		Opc = PPC::QVFMRb;
else if (PPC::CRBITRCRegClass.contains(DestReg, SrcReg))		else if (PPC::CRBITRCRegClass.contains(DestReg, SrcReg))
Opc = PPC::CROR;		Opc = PPC::CROR;
▲ Show 20 Lines • Show All 2,674 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/PowerPC/builtins-ppc-p9-f128.ll

	Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	declare fp128 @llvm.ppc.divf128.round.to.odd(fp128, fp128)			declare fp128 @llvm.ppc.divf128.round.to.odd(fp128, fp128)

	define double @testTruncOdd(fp128 %a) {			define double @testTruncOdd(fp128 %a) {
	entry:			entry:
	%0 = call double @llvm.ppc.truncf128.round.to.odd(fp128 %a)			%0 = call double @llvm.ppc.truncf128.round.to.odd(fp128 %a)
	ret double %0			ret double %0
	; CHECK-LABEL: testTruncOdd			; CHECK-LABEL: testTruncOdd
	; CHECK: xscvqpdpo v2, v2			; CHECK: xscvqpdpo v2, v2
	; CHECK: xxlor f1, v2, v2			; CHECK: xscpsgndp f1, v2, v2
	; CHECK: blr			; CHECK: blr
	}			}

	declare double @llvm.ppc.truncf128.round.to.odd(fp128)			declare double @llvm.ppc.truncf128.round.to.odd(fp128)

llvm/trunk/test/CodeGen/PowerPC/f128-conv.ll

Show First 20 Lines • Show All 408 Lines • ▼ Show 20 Lines
@f128global = global fp128 0xL300000000000000040089CA8F5C28F5C, align 16		@f128global = global fp128 0xL300000000000000040089CA8F5C28F5C, align 16

; Function Attrs: norecurse nounwind readonly		; Function Attrs: norecurse nounwind readonly
define double @qpConv2dp(fp128* nocapture readonly %a) {		define double @qpConv2dp(fp128* nocapture readonly %a) {
; CHECK-LABEL: qpConv2dp:		; CHECK-LABEL: qpConv2dp:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: lxv v2, 0(r3)		; CHECK-NEXT: lxv v2, 0(r3)
; CHECK-NEXT: xscvqpdp v2, v2		; CHECK-NEXT: xscvqpdp v2, v2
; CHECK-NEXT: xxlor f1, v2, v2		; CHECK-NEXT: xscpsgndp f1, v2, v2
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%0 = load fp128, fp128* %a, align 16		%0 = load fp128, fp128* %a, align 16
%conv = fptrunc fp128 %0 to double		%conv = fptrunc fp128 %0 to double
ret double %conv		ret double %conv
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines
}		}

@f128Glob = common global fp128 0xL00000000000000000000000000000000, align 16		@f128Glob = common global fp128 0xL00000000000000000000000000000000, align 16

; Function Attrs: norecurse nounwind readnone		; Function Attrs: norecurse nounwind readnone
define fp128 @dpConv2qp(double %a) {		define fp128 @dpConv2qp(double %a) {
; CHECK-LABEL: dpConv2qp:		; CHECK-LABEL: dpConv2qp:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext double %a to fp128		%conv = fpext double %a to fp128
ret fp128 %conv		ret fp128 %conv
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
Show All 32 Lines	entry:
store fp128 %conv, fp128* @f128Glob, align 16		store fp128 %conv, fp128* @f128Glob, align 16
ret void		ret void
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define void @dpConv2qp_03(fp128* nocapture %res, i32 signext %idx, double %a) {		define void @dpConv2qp_03(fp128* nocapture %res, i32 signext %idx, double %a) {
; CHECK-LABEL: dpConv2qp_03:		; CHECK-LABEL: dpConv2qp_03:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: sldi r4, r4, 4		; CHECK-NEXT: sldi r4, r4, 4
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: stxvx v2, r3, r4		; CHECK-NEXT: stxvx v2, r3, r4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext double %a to fp128		%conv = fpext double %a to fp128
%idxprom = sext i32 %idx to i64		%idxprom = sext i32 %idx to i64
%arrayidx = getelementptr inbounds fp128, fp128* %res, i64 %idxprom		%arrayidx = getelementptr inbounds fp128, fp128* %res, i64 %idxprom
store fp128 %conv, fp128* %arrayidx, align 16		store fp128 %conv, fp128* %arrayidx, align 16
ret void		ret void
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define void @dpConv2qp_04(double %a, fp128* nocapture %res) {		define void @dpConv2qp_04(double %a, fp128* nocapture %res) {
; CHECK-LABEL: dpConv2qp_04:		; CHECK-LABEL: dpConv2qp_04:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: stxv v2, 0(r4)		; CHECK-NEXT: stxv v2, 0(r4)
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext double %a to fp128		%conv = fpext double %a to fp128
store fp128 %conv, fp128* %res, align 16		store fp128 %conv, fp128* %res, align 16
ret void		ret void
}		}

; Function Attrs: norecurse nounwind readnone		; Function Attrs: norecurse nounwind readnone
define fp128 @spConv2qp(float %a) {		define fp128 @spConv2qp(float %a) {
; CHECK-LABEL: spConv2qp:		; CHECK-LABEL: spConv2qp:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext float %a to fp128		%conv = fpext float %a to fp128
ret fp128 %conv		ret fp128 %conv
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
Show All 32 Lines	entry:
store fp128 %conv, fp128* @f128Glob, align 16		store fp128 %conv, fp128* @f128Glob, align 16
ret void		ret void
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define void @spConv2qp_03(fp128* nocapture %res, i32 signext %idx, float %a) {		define void @spConv2qp_03(fp128* nocapture %res, i32 signext %idx, float %a) {
; CHECK-LABEL: spConv2qp_03:		; CHECK-LABEL: spConv2qp_03:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: sldi r4, r4, 4		; CHECK-NEXT: sldi r4, r4, 4
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: stxvx v2, r3, r4		; CHECK-NEXT: stxvx v2, r3, r4
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext float %a to fp128		%conv = fpext float %a to fp128
%idxprom = sext i32 %idx to i64		%idxprom = sext i32 %idx to i64
%arrayidx = getelementptr inbounds fp128, fp128* %res, i64 %idxprom		%arrayidx = getelementptr inbounds fp128, fp128* %res, i64 %idxprom
store fp128 %conv, fp128* %arrayidx, align 16		store fp128 %conv, fp128* %arrayidx, align 16
ret void		ret void
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define void @spConv2qp_04(float %a, fp128* nocapture %res) {		define void @spConv2qp_04(float %a, fp128* nocapture %res) {
; CHECK-LABEL: spConv2qp_04:		; CHECK-LABEL: spConv2qp_04:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: xxlor v2, f1, f1		; CHECK-NEXT: xscpsgndp v2, f1, f1
; CHECK-NEXT: xscvdpqp v2, v2		; CHECK-NEXT: xscvdpqp v2, v2
; CHECK-NEXT: stxv v2, 0(r4)		; CHECK-NEXT: stxv v2, 0(r4)
; CHECK-NEXT: blr		; CHECK-NEXT: blr
entry:		entry:
%conv = fpext float %a to fp128		%conv = fpext float %a to fp128
store fp128 %conv, fp128* %res, align 16		store fp128 %conv, fp128* %res, align 16
ret void		ret void
}		}
▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/PowerPC/f128-passByValue.ll

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define fp128 @mixParam_02(fp128 %p1, double %p2, i64* nocapture %p3,		define fp128 @mixParam_02(fp128 %p1, double %p2, i64* nocapture %p3,
; CHECK-LABEL: mixParam_02:		; CHECK-LABEL: mixParam_02:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-DAG: lwz r3, 96(r1)		; CHECK-DAG: lwz r3, 96(r1)
; CHECK: add r4, r7, r9		; CHECK: add r4, r7, r9
; CHECK-NEXT: xxlor v[[REG0:[0-9]+]], f1, f1		; CHECK-NEXT: xscpsgndp v[[REG0:[0-9]+]], f1, f1
; CHECK-DAG: add r4, r4, r10		; CHECK-DAG: add r4, r4, r10
; CHECK: xscvdpqp v[[REG0]], v[[REG0]]		; CHECK: xscvdpqp v[[REG0]], v[[REG0]]
; CHECK-NEXT: add r3, r4, r3		; CHECK-NEXT: add r3, r4, r3
; CHECK-NEXT: clrldi r3, r3, 32		; CHECK-NEXT: clrldi r3, r3, 32
; CHECK-NEXT: std r3, 0(r6)		; CHECK-NEXT: std r3, 0(r6)
; CHECK-NEXT: lxv v[[REG1:[0-9]+]], 0(r8)		; CHECK-NEXT: lxv v[[REG1:[0-9]+]], 0(r8)
; CHECK-NEXT: xsaddqp v2, v[[REG1]], v2		; CHECK-NEXT: xsaddqp v2, v[[REG1]], v2
; CHECK-NEXT: xsaddqp v2, v2, v3		; CHECK-NEXT: xsaddqp v2, v2, v3
Show All 15 Lines	entry:
ret fp128 %add7		ret fp128 %add7
}		}

; Function Attrs: norecurse nounwind		; Function Attrs: norecurse nounwind
define fastcc fp128 @mixParam_02f(fp128 %p1, double %p2, i64* nocapture %p3,		define fastcc fp128 @mixParam_02f(fp128 %p1, double %p2, i64* nocapture %p3,
; CHECK-LABEL: mixParam_02f:		; CHECK-LABEL: mixParam_02f:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: add r4, r4, r6		; CHECK-NEXT: add r4, r4, r6
; CHECK-NEXT: xxlor v[[REG0:[0-9]+]], f1, f1		; CHECK-NEXT: xscpsgndp v[[REG0:[0-9]+]], f1, f1
; CHECK-NEXT: add r4, r4, r7		; CHECK-NEXT: add r4, r4, r7
; CHECK-NEXT: xscvdpqp v[[REG0]], v[[REG0]]		; CHECK-NEXT: xscvdpqp v[[REG0]], v[[REG0]]
; CHECK-NEXT: add r4, r4, r8		; CHECK-NEXT: add r4, r4, r8
; CHECK-NEXT: clrldi r4, r4, 32		; CHECK-NEXT: clrldi r4, r4, 32
; CHECK-NEXT: std r4, 0(r3)		; CHECK-NEXT: std r4, 0(r3)
; CHECK-NEXT: lxv v[[REG1:[0-9]+]], 0(r5)		; CHECK-NEXT: lxv v[[REG1:[0-9]+]], 0(r5)
; CHECK-NEXT: xsaddqp v2, v[[REG1]], v2		; CHECK-NEXT: xsaddqp v2, v[[REG1]], v2
; CHECK-NEXT: xsaddqp v2, v2, v[[REG0]]		; CHECK-NEXT: xsaddqp v2, v2, v[[REG0]]
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/PowerPC/p9_copy_fp.ll

				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mattr=+vsx -ppc-vsr-nums-as-vr \
				; RUN: -mtriple=powerpc64le-unknown-linux-gnu -ppc-asm-full-reg-names < %s \
				; RUN: \| FileCheck %s
				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mattr=+vsx -ppc-vsr-nums-as-vr \
				; RUN: -mtriple=powerpc64-unknown-linux-gnu -ppc-asm-full-reg-names < %s \
				; RUN: \| FileCheck -check-prefix=CHECK-BE %s

				; Function Attrs: norecurse nounwind readnone
				define double @cp_fp1(<2 x double> %v) {
				; CHECK-LABEL: cp_fp1:
				; CHECK: xscpsgndp f1, v2, v2
				; CHECK: blr

				; CHECK-BE-LABEL: cp_fp1:
				; CHECK-BE: xxswapd vs1, v2
				; CHECK-BE: blr
				entry:
				%vecext = extractelement <2 x double> %v, i32 1
				ret double %vecext
				}

				; Function Attrs: norecurse nounwind readnone
				define double @cp_fp2(<2 x double> %v) {
				; CHECK-LABEL: cp_fp2:
				; CHECK: xxswapd vs1, v2
				; CHECK: blr

				; CHECK-BE-LABEL: cp_fp2:
				; CHECK-BE: xscpsgndp f1, v2, v2
				; CHECK-BE: blr
				entry:
				%vecext = extractelement <2 x double> %v, i32 0
				ret double %vecext
				}

				; Function Attrs: norecurse nounwind readnone
				define <2 x double> @cp_fp3(double %v) {
				; CHECK-LABEL: cp_fp3:
				; CHECK: xxspltd v2, vs1, 0
				; CHECK: blr

				; CHECK-BE-LABEL: cp_fp3:
				; CHECK-BE: xscpsgndp v2, f1, f1
				; CHECK-BE: blr
				entry:
				%vecins = insertelement <2 x double> undef, double %v, i32 0
				ret <2 x double> %vecins
				}

llvm/trunk/test/CodeGen/PowerPC/vsx-spill.ll

	Show All 30 Lines
	; CHECK-FISL-NOT: ori			; CHECK-FISL-NOT: ori
	; CHECK-FISL: li r3, -152			; CHECK-FISL: li r3, -152
	; CHECK-FISL-NOT: lis			; CHECK-FISL-NOT: lis
	; CHECK-FISL-NOT: ori			; CHECK-FISL-NOT: ori
	; CHECK-FISL: stxsdx f1, r1, r3			; CHECK-FISL: stxsdx f1, r1, r3
	; CHECK-FISL: blr			; CHECK-FISL: blr

	; CHECK-P9-REG: @foo1			; CHECK-P9-REG: @foo1
	; CHECK-P9-REG: xxlor v2, f1, f1			; CHECK-P9-REG: xscpsgndp v2, f1, f1
	; CHECK-P9-REG: xxlor f1, v2, v2			; CHECK-P9-REG: xscpsgndp f1, v2, v2
	; CHECK-P9-REG: blr			; CHECK-P9-REG: blr

	; CHECK-P9-FISL: @foo1			; CHECK-P9-FISL: @foo1
	; CHECK-P9-FISL: stfd f31, -8(r1)			; CHECK-P9-FISL: stfd f31, -8(r1)
	; CHECK-P9-FISL: blr			; CHECK-P9-FISL: blr

	return: ; preds = %entry			return: ; preds = %entry
	ret double %a			ret double %a
	Show All 12 Lines

	; CHECK-FISL: @foo2			; CHECK-FISL: @foo2
	; CHECK-FISL: xsadddp f1, f1, f1			; CHECK-FISL: xsadddp f1, f1, f1
	; CHECK-FISL: stxsdx f1, r1, r3			; CHECK-FISL: stxsdx f1, r1, r3
	; CHECK-FISL: lxsdx f1, r1, r3			; CHECK-FISL: lxsdx f1, r1, r3
	; CHECK-FISL: blr			; CHECK-FISL: blr

	; CHECK-P9-REG: @foo2			; CHECK-P9-REG: @foo2
	; CHECK-P9-REG: {{xxlor\|xsadddp}} v2, f1, f1			; CHECK-P9-REG: {{xscpsgndp\|xsadddp}} v2, f1, f1
	; CHECK-P9-REG: {{xxlor\|xsadddp}} f1, v2, v2			; CHECK-P9-REG: {{xscpsgndp\|xsadddp}} f1, v2, v2
	; CHECK-P9-REG: blr			; CHECK-P9-REG: blr

	; CHECK-P9-FISL: @foo2			; CHECK-P9-FISL: @foo2
	; CHECK-P9-FISL: xsadddp f1, f1, f1			; CHECK-P9-FISL: xsadddp f1, f1, f1
	; CHECK-P9-FISL: stfd f1, -152(r1)			; CHECK-P9-FISL: stfd f1, -152(r1)
	; CHECK-P9-FISL: lfd f1, -152(r1)			; CHECK-P9-FISL: lfd f1, -152(r1)
	; CHECK-P9-FISL: blr			; CHECK-P9-FISL: blr

	Show All 29 Lines