This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers for P9
ClosedPublic

Authored by amyk on Jul 30 2018, 12:37 PM.

Details

Summary

This patch will address using the xscpsgndp instruction to copy floating point scalar registers instead of the xxlor (specifically XXLORf) instruction that is currently used. Additionally, this patch of utilizing xscpsgndp will apply to P9, while pre-P9 will still use xxlor.

This patch includes:

  • The change in instruction opcode to utilize xscpsgndp instead of xxlor when copying the floating point scalar registers on P9
  • An update of test cases to reflect this behaviour (specifically for P9), while still using xxlor pre-P9
  • An update to test cases to include the -ppc-vsr-nums-as-vr -ppc-asm-full-reg-names llc options

Diff Detail

Event Timeline

amyk created this revision.Jul 30 2018, 12:37 PM
amyk edited the summary of this revision. (Show Details)Jul 31 2018, 3:25 AM

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

Can you clarify this please? Where is this information coming from? According to the UM, XXLOR takes up a whole superslice whereas XSCPSGNDP takes up a single slice so we can dispatch 2 of the former per cycle and 4 of the latter. And the "Pipe Busy Cycles" field for both is 1.

XSCPSGNDP has longer latency (6 cycles) than XXLOR (2 cycles) on POWER8 while it has higher throughput with the same latency on POWER9. So XXLOR is preferable for pre-P9.

Also, the two instructions have different behavior for a denormal input value in my understanding; XSCPSGNDP does normalization but XXLOR does not. Does this difference matter?

Yes, I agree that we should limit this to Power9. Does the comment about normalization only pertain to ISA 2.07? The text from ISA 3.0 is:

Bit 0 of VSR[XT] is set to the contents of bit 0 of VSR[XA].
Bits 1:63 of VSR[XT] are set to the contents of bits 1:63 of VSR[XB].
The contents of doubleword element 1 of VSR[XT] are undefined.

There is no mention of normalization.

Does the comment about normalization only pertain to ISA 2.07?

As I browse the document, neither ISA 2.07 nor 3.0 mention about normalization by xscpsgndp.
P8 UM says xscpsgndp, xvcpsgndp and fmr are normalizing instruction. P9 UM say nothing.
Since fmr is also a normalizing instruction, I feel it is acceptable to use xscpsgndp for coping register.

+1, even for Power9 XSCPSGNDP makes pipeline busy longer than XXLOR, XXLOR is still better.

Can you clarify this please? Where is this information coming from? According to the UM, XXLOR takes up a whole superslice whereas XSCPSGNDP takes up a single slice so we can dispatch 2 of the former per cycle and 4 of the latter. And the "Pipe Busy Cycles" field for both is 1.

My fault, sorry, I saw the data from the wrong column "max ops per cycle". You are right, both are 1 cycle for busy pipe, XSCPSGNDP should outperform on Power9 then.

amyk retitled this revision from [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers to [PowerPC] Emit xscpsgndp instead of xxlor when copying floating point scalar registers for P9.Aug 3 2018, 4:52 AM
amyk edited the summary of this revision. (Show Details)
amyk updated this revision to Diff 158979.Aug 3 2018, 5:00 AM

This revision primarily addresses enabling xscpsgndp for P9 only, as xxlor is still preferable to use for copying floating point scalars pre-P9.

jsji added a subscriber: jsji.EditedAug 3 2018, 9:35 AM

Does the comment about normalization only pertain to ISA 2.07?

As I browse the document, neither ISA 2.07 nor 3.0 mention about normalization by xscpsgndp.
P8 UM says xscpsgndp, xvcpsgndp and fmr are normalizing instruction. P9 UM say nothing.
Since fmr is also a normalizing instruction, I feel it is acceptable to use xscpsgndp for coping register.

P9 UM does clarify the difference:

4.3.6 Handling of Denormal Single-Precision Values in Double-Precision Format
Unlike previous generation processors, such as the POWER8 processor, the POWER9 processor is capable of handling denormal single-precision values as inputs for all subsequent instructions. Whereas, in some cases, the POWER8 processor takes a soft-patch interrupt to allow the interrupt handler to reformat the input operands to a double-precision format and then re-execute the instruction, the POWER9 processor simply executes normally regardless of how that number was produced.

jsji added a comment.EditedAug 3 2018, 9:40 AM

This revision primarily addresses enabling xscpsgndp for P9 only, as xxlor is still preferable to use for copying floating point scalars pre-P9.

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

<snip>

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

I don't think there are any conditions under which we will emit an xxlor that will be eligible for this. That may be a good candidate to peephole and/or fuse together.
Example:

vector double test(double a, vector double b, vector double c, vector double *s) {
  vector double n = (vector double)a;
  *s = n + c * b;
  return n;
}

Is about as close as you can get, but will produce the following on Power9:

xxspltd vs0, vs1, 0
xxlor vs1, vs0, vs0
xvmaddadp vs1, vs35, vs34
xxlor vs34, vs0, vs0
stxv vs1, 0(r9)

Ultimately, the target of the copy will always be used as an input to the destructive operation. If we want to exploit this optimization in the HW, we'd have to forward the source of the copy (and eliminate the second copy in this case as well). But if we're consciously transforming the code to exploit this, the instruction we use for the COPY is immaterial (we can always transform it to XXLOR at the time).

jsji added a comment.EditedAug 9 2018, 11:00 AM

<snip>

No sure how much this will have impact, but maybe we need to consider still using xxlor for destructive instructions?

eg:
In Power ISA 3.0 B, 2.1.5 Destructive Operation Operand Preservation
"The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance."

I don't think there are any conditions under which we will emit an xxlor that will be eligible for this. That may be a good candidate to peephole and/or fuse together.
Example:

vector double test(double a, vector double b, vector double c, vector double *s) {
  vector double n = (vector double)a;
  *s = n + c * b;
  return n;
}

Is about as close as you can get, but will produce the following on Power9:

xxspltd vs0, vs1, 0
xxlor vs1, vs0, vs0
xvmaddadp vs1, vs35, vs34
xxlor vs34, vs0, vs0
stxv vs1, 0(r9)

Ultimately, the target of the copy will always be used as an input to the destructive operation. If we want to exploit this optimization in the HW, we'd have to forward the source of the copy (and eliminate the second copy in this case as well). But if we're consciously transforming the code to exploit this, the instruction we use for the COPY is immaterial (we can always transform it to XXLOR at the time).

FYI.

An ugly example to show that the there are situations that this change will have impact to destructive operations.

$ cat t1.c

double test(double n0, double n1, double n2, double n3, double n4, double n5, double n6, double n7, double n8, double n9,
double n10, double n11, double n12, double n13, double n14, double n15, double n16, double n17, double n18, double n19,
double n20, double n21, double n22, double n23, double n24, double n25, double n26, double n27, double n28, double n29,
double n30, double n31, double n32, double n33,
 double c, double b, double *s) {
  *s = n0 + c * b;
  *s += n1 + *s * b;
  *s += n2 + *s * b;
  *s += n3 + *s * b;
  *s += n4 + *s * b;
  *s += n5+ *s * b;
  *s += n6+ *s * b;
  *s += n7+ *s * b;
  *s += n8+ *s * b;
  *s += n9+ *s * b;
  *s += n10 + *s * b;
  *s += n11 + *s * b;
  *s += n12 + *s * b;
  *s += n13 + *s * b;
  *s += n14 + *s * b;
  *s += n15+ *s * b;
  *s += n16+ *s * b;
  *s += n17+ *s * b;
  *s += n18+ *s * b;
  *s += n19+ *s * b;
  *s += n20 + *s * b;
  *s += n21 + *s * b;
  *s += n22 + *s * b;
  *s += n23 + *s * b;
  *s += n24 + *s * b;
  *s += n25+ *s * b;
  *s += n26+ *s * b;
  *s += n27+ *s * b;
  *s += n28+ *s * b;
  *s += n29+ *s * b;
  *s += n30 + *s * b;
  *s += n31 + *s * b;
  *s += n32 + *s * b;
  *s += n33 + *s * b;
  return n0;
}

clang -S -mcpu=pwr9 -O2 -ffast-math t1.c -mllvm -ppc-vsr-nums-as-vr -mllvm -ppc-asm-full-reg-names -mllvm -enable-post-misched=false

diff of assembly before change and after change:

$ diff -Naur before.s after.s 
--- before.s    2018-08-09 13:52:24.785246846 -0400
+++ after.s     2018-08-09 13:59:38.815493708 -0400
@@ -9,7 +9,7 @@
 # %bb.0:                                # %entry
        lfd f0, 304(r1)
        lxsd v2, 312(r1)
-       xxlor v3, f1, f1
+      xscpsgndp v3, f1, f1
        xsmaddadp v3, v2, f0
        xsadddp f0, v3, f2
        xsmaddadp f0, v3, v2

<snip>

diff of assembly before change and after change:

$ diff -Naur before.s after.s 
--- before.s    2018-08-09 13:52:24.785246846 -0400
+++ after.s     2018-08-09 13:59:38.815493708 -0400
@@ -9,7 +9,7 @@
 # %bb.0:                                # %entry
        lfd f0, 304(r1)
        lxsd v2, 312(r1)
-       xxlor v3, f1, f1
+      xscpsgndp v3, f1, f1
        xsmaddadp v3, v2, f0
        xsadddp f0, v3, f2
        xsmaddadp f0, v3, v2

This is exactly what I was referring to... The situation you describe is not analogous to the situation described in the ISA. According to the ISA, the sequence that will be optimized is:

xxlor XC, XT, XT
xxperm XT, XA, XB

So in this case, the only way we would get the optimized behaviour would be if the "pre-patch" code sequence was:

xxlor v3, f1, f1
xsmaddadp f1, v2, f0

And I'm fairly certain that without source forwarding of the copy, we can never produce such code (but of course, I could be wrong).

jsji added a comment.Aug 10 2018, 6:39 AM

<snip>

This is exactly what I was referring to... The situation you describe is not analogous to the situation described in the ISA. According to the ISA, the sequence that will be optimized is:

xxlor XC, XT, XT
xxperm XT, XA, XB

So in this case, the only way we would get the optimized behaviour would be if the "pre-patch" code sequence was:

xxlor v3, f1, f1
xsmaddadp f1, v2, f0

And I'm fairly certain that without source forwarding of the copy, we can never produce such code (but of course, I could be wrong).

??? Can you please double check what ISA are you referring to?

The description in PowerISA_public.v3.0B https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv is:

As an example, to preserve the XT source register in the xxperm instruction, the following sequence will optimize performance.

xxlor XT,XC,XC /* Copy (XC) to XT
xxperm XT,XA,XB /* Permute, overwriting XT

The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance.

This should be exact the same pattern as in my example: xxlor XT,XC,XC to Copy (XC) to XT, not xxlor XC,XT,XT in your description.

??? Can you please double check what ISA are you referring to?

ISA 3.0

The description in PowerISA_public.v3.0B https://ibm.ent.box.com/s/1hzcwkwf8rbju5h9iyf44wm94amnlcrv is:

As an example, to preserve the XT source register in the xxperm instruction, the following sequence will optimize performance.

xxlor XT,XC,XC /* Copy (XC) to XT
xxperm XT,XA,XB /* Permute, overwriting XT

The set of instructions listed below, when immediately preceded by the xxlor XT,XC,XC instruction in a sequence similar to the above example, will provide optimal performance.

This should be exact the same pattern as in my example: xxlor XT,XC,XC to Copy (XC) to XT, not xxlor XC,XT,XT in your description.

It would appear that the version of the document I have has a bug in it (I had ISA 3.0 - without the B). Yes, I agree that the description in the corrected ISA document matches both of the provided examples.
However, I don't think this should preclude this patch. Register/register copies are far more common than destructive operations so we should emit the best instruction for the copy. As a follow-up, we should detect copies that are inputs to destructive operations and emit the XXLOR for those.

jsji added a comment.Aug 13 2018, 7:00 AM

It would appear that the version of the document I have has a bug in it (I had ISA 3.0 - without the B). Yes, I agree that the description in the corrected ISA document matches both of the provided examples.
However, I don't think this should preclude this patch. Register/register copies are far more common than destructive operations so we should emit the best instruction for the copy. As a follow-up, we should detect copies that are inputs to destructive operations and emit the XXLOR for those.

Yes, agree. It is OK as long as we will follow up for destructive operations. Thanks!

nemanjai accepted this revision.Aug 14 2018, 10:25 AM

LGTM. Since we're all in agreement that we want to make this change for Power9, I think this is fine to go in.

This revision is now accepted and ready to land.Aug 14 2018, 10:25 AM
amyk updated this revision to Diff 161242.Aug 17 2018, 7:33 AM

Updated the diff to since a specific test on line 528 in vsx.ll now using the v4 register instead of vs0.

jsji added a comment.Aug 17 2018, 7:47 AM

Updated the diff to since a specific test on line 528 in vsx.ll now using the v4 register instead of vs0.

Can we split the part of "An update to test cases to include the -ppc-vsr-nums-as-vr -ppc-asm-full-reg-names llc options" into a separate patch and commit it first,
so that it would be clearer to see what is the intended change from xxlor-> xscpsgndp in all the testcases?

This revision was automatically updated to reflect the committed changes.