Page MenuHomePhabricator

PowerPC: Optimize SPE double parameter calling setup
ClosedPublic

Authored by jhibbits on Nov 15 2018, 8:38 AM.

Details

Summary

SPE passes doubles the same as soft-float, in register pairs as i32
types. This is all handled by the target-independent layer. However,
this is not optimal when splitting or reforming the doubles, as it
pushes to the stack and loads from, on either side.

For instance, to pass a double argument to a function, assuming the
double value is in r5, the sequence currently looks like this:

evstdd      5, X(1)
lwz         3, X(1)
lwz         4, X+4(1)

Likewise, to form a double into r5 from args in r3 and r4:

stw         3, X(1)
stw         4, X+4(1)
evldd       5, X(1)

This optimizes the fence to use SPE instructions. Now, to pass a double
to a function:

mr          4, 5
evmergehi   3, 5, 5

And to form a double into r5 from args in r3 and r4:

evmergelo   5, 3, 4

This is comparable to the way that gcc generates the double splits.

This also fixes expanding of builtins to libcalls, where the LowerCallTo() code path was generating intermediate illegal type nodes.

Diff Detail

Repository
rL LLVM

Event Timeline

jhibbits created this revision.Nov 15 2018, 8:38 AM

I have applied this patch to the llvm-toolchain-7 package in Debian and did not see any regressions on x86_64 or 32-Bit PowerPC. Additionally, I have included the patches from https://reviews.llvm.org/D49754 and https://reviews.llvm.org/D54409 saw no regressions on x86_64 and 32-bit PowerPC.

All three patches will be part of the next upload of the llvm-toolchain-7 package in Debian unstable which will be version 1:7.0.1~+rc2-9.

nemanjai added inline comments.Dec 29 2018, 1:33 PM
lib/Target/PowerPC/PPCISelLowering.cpp
393 ↗(On Diff #174220)

No need for braces when there is a single statement in the if/else.

7833 ↗(On Diff #174220)

The early exit should be first.

7837 ↗(On Diff #174220)

The indentation is off. Maybe run clang-format on this function. I don't know which editor you use but if you use Vim, you can run :7827,7845 ! clang-format if you have clang-format in your $PATH.
Also, it seems like you might want to check the input type as well - maybe with an assert if it can't be anything other than f64.

7840 ↗(On Diff #174220)

A more descriptive assert message is probably in order.
Also, perhaps it would be clearer if this was rewritten to:

  • Assert that the constant operand value is less than 2
  • Use a ternary operator to select the opcode (or just have one opcode - see above)
7842 ↗(On Diff #174220)

Line too long?

lib/Target/PowerPC/PPCISelLowering.h
203 ↗(On Diff #174220)

Why not just have EXTRACT_SPE_OP and have it take a constant operand that determines Hi/Lo? Also, for both build and extract, it would be good to add a comment that these correspond almost exactly to BUILD_PAIR and EXTRACT_VECTOR_ELT nodes except that the input types are floating point since i64 isn't a legal type for the target.

jhibbits marked 6 inline comments as done.Dec 29 2018, 8:24 PM
jhibbits added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
7840 ↗(On Diff #174220)

This assert was during some debugging, and I forgot to remove it. However, it does make sense to have an assert along the lines of your suggestion. I"ll make such a change.

7842 ↗(On Diff #174220)

Yeah, just a hair (one character). Reformatting with clang-format will fix that.

lib/Target/PowerPC/PPCISelLowering.h
203 ↗(On Diff #174220)

I'm not sure how to pass a constant through to the tablegen'd layer. These two pseudo-ops are just light wrappers to EVMERGEHI and MR. If there is a way to pass a constant and do the switch down in that layer, then that's acceptable as well.

nemanjai added inline comments.Dec 30 2018, 5:12 AM
lib/Target/PowerPC/PPCISelLowering.h
203 ↗(On Diff #174220)

Sure, there are existing examples. Vector conversion custom nodes are probably quite similar to what you need:

PPCISD::SINT_VEC_TO_FP
PPCISD::UINT_VEC_TO_FP

But there will be others.

jhibbits updated this revision to Diff 181708.Jan 14 2019, 8:37 PM
jhibbits edited the summary of this revision. (Show Details)

Fix expanding builtins to libcalls. Remove the need for intermediate illegal types in expanding and pairing the arguments and return values.

Unfortunately arcanist on my machine seems to be broken working with this repository, so I had to upload a diff manually.

jhibbits updated this revision to Diff 182380.Jan 17 2019, 12:59 PM

Fix argument indices for indexing through OutVals[]. It should be the argument index, not the physical register index.

Hi Justin, I'm watching your work and used your patches to bring SPE into my CLANG for OS-9.
The OutVals[] issue is what I found yesterday as well by debugging the CLANG part.
There is a 2. location of this in PPCTargetLowering::LowerReturn()

 // Copy the result values into the output registers.
for (unsigned i = 0, realI = 0; i != RVLocs.size(); ++i, ++realI) {
  CCValAssign &VA = RVLocs[i];
  assert(VA.isRegLoc() && "Can only return in registers!");

  SDValue Arg = OutVals[realI];

Regards, Kei

Hi Kei, thanks for that. I've updated my code, and will post an updated diff tomorrow.

jhibbits updated this revision to Diff 182715.Jan 20 2019, 1:48 PM

One more argument index fixup.

I have a question:
When compiling

double a;
void func(double x) {
a = x;
}

It is generating

lis 5, a@ha
evstdd 3, a@l(5)

But as the evstdd and evldd are having only 8bit (5bit real) UIMM offset, this code is not working, as the offset a@lo is not known to be 8bit only.
Is this issue already addressed in a patch and I simply haven't seen this, or is this still a missing part?

I would expect

lis 5, a@ha
li 4, a@l
evstddx 3, 4, 5

It seems that this is checked/generated by the PPCISelLowering.cpp SelectAddressRegReg() and/or SelectAddressRegImm()
Actually I'm trying to find out I have enough information in the SDValue N to identify this as a SPE load/store.
Do I missed a patch for this?
Thanks, Kei

Hi Kei, yes you need the patch in D54409, which fixes the offset handling. It should fix your case as well, but I didn't test foreign addresses.

The Patch D54409 is only handling the variables on the stack named in the code as "framedata". I'm going on to find out, how to manage this for global variables. SelectAddressRegReg() and SelectAddressRegImm() are doing this, but there is no information about the Target data. Maybe it needs to be decided somewhere else.

Hmm, I have not yet tried to explore this, but I get a feeling a regression appeared somewhere during the patch iterations. Either this or D54409.
At this point I am consistently getting weird generated instructions for __floatundidf from compiler-rt (compiling with freebsd & -O3), yet I have a correct one in my files.
What makes it strange is that the logs show that the correct and first incorrect examples were generated by the same compiler (binary file) with the same flags, yet I no longer can get the correct one.

Does it reproduce for anyone?

Supposedly correct one:

.set back_chain, -0x20
.set var_10, -0x10
.set var_8, -8
94 21 FF E0                       stwu      r1, back_chain(r1)
3C A0 45 30                       lis       r5, 0x4530
90 61 00 1C                       stw       r3, 0x20+var_8+4(r1)
90 A1 00 18                       stw       r5, 0x20+var_8(r1)
3C A0 43 30                       lis       r5, 0x4330
10 61 1B 01                       evldd     r3, 0x20+var_8(r1)
90 81 00 14                       stw       r4, 0x20+var_10+4(r1)
3C 80 80 02                       lis       r4, -0x7FFE
10 84 DB 01                       evldd     r4, 0xD8(r4) ; note this one
90 A1 00 10                       stw       r5, 0x20+var_10(r1)
10 A1 13 01                       evldd     r5, 0x20+var_10(r1)
10 63 22 E0                       efdadd    r3, r3, r4
10 83 2A E0                       efdadd    r4, r3, r5
10 64 22 2C                       evmergehi r3, r4, r4
38 21 00 20                       addi      r1, r1, 0x20
4E 80 00 20                       blr

What I get now:

94 21 FF E0                       stwu      r1, back_chain(r1)
3C A0 45 30                       lis       r5, 0x4530
90 61 00 1C                       stw       r3, 0x20+var_8+4(r1)
90 A1 00 18                       stw       r5, 0x20+var_8(r1)
3C A0 43 30                       lis       r5, 0x4330
10 61 1B 01                       evldd     r3, 0x20+var_8(r1)
90 81 00 14                       stw       r4, 0x20+var_10+4(r1)
3C 80 80 02                       lis       r4, -0x7FFE
10 8C BB 01                       evldd     r4, 0xB8(r12) ; note this one
90 A1 00 10                       stw       r5, 0x20+var_10(r1)
10 A1 13 01                       evldd     r5, 0x20+var_10(r1)
10 63 22 E0                       efdadd    r3, r3, r4
10 83 2A E0                       efdadd    r4, r3, r5
10 64 22 2C                       evmergehi r3, r4, r4
38 21 00 20                       addi      r1, r1, 0x20
4E 80 00 20                       blr

or

94 21 FF E0                       stwu      r1, back_chain(r1)
3C A0 45 30                       lis       r5, 0x4530
90 61 00 1C                       stw       r3, 0x20+var_8+4(r1)
90 A1 00 18                       stw       r5, 0x20+var_8(r1)
3C A0 43 30                       lis       r5, 0x4330
10 61 1B 01                       evldd     r3, 0x20+var_8(r1)
90 81 00 14                       stw       r4, 0x20+var_10+4(r1)
3C 80 80 01                       lis       r4, -0x7FFF
10 9F 1B 01                       evldd     r4, 0x18(r31) ; note this one
90 A1 00 10                       stw       r5, 0x20+var_10(r1)
10 A1 13 01                       evldd     r5, 0x20+var_10(r1)
10 63 22 E0                       efdadd    r3, r3, r4
10 83 2A E0                       efdadd    r4, r3, r5
10 64 22 2C                       evmergehi r3, r4, r4
38 21 00 20                       addi      r1, r1, 0x20
4E 80 00 20                       blr

Reference source:

double floatundidf(unsigned long long a)
{
    static const double twop52 = 4503599627370496.0; // 0x1.0p52
    static const double twop84 = 19342813113834066795298816.0; // 0x1.0p84
    static const double twop84_plus_twop52 = 19342813118337666422669312.0; // 0x1.00000001p84

    union { uint64_t x; double d; } high = { .d = twop84 };
    union { uint64_t x; double d; } low = { .d = twop52 };

    high.x |= a >> 32;
    low.x |= a & UINT64_C(0x00000000ffffffff);

    const double result = (high.d - twop84_plus_twop52) + low.d;
    return result;
}

As promised I have modified the SelectAddressRegReg() in PPCISelLowering.cpp to create correct evldd(x) and evstdd(x) instructions when accessing global variables.

bool PPCTargetLowering::SelectAddressRegReg(SDValue N, SDValue &Base,

                                          SDValue &Index,
                                          SelectionDAG &DAG) const {
int16_t imm = 0;
if (N.getOpcode() == ISD::ADD) {
  if (hasSPE()) {
    // Is there any SPE load/store (f64) which can't handle 16bit offset?
    for (SDNode::use_iterator UI = N->use_begin(), E = N->use_end();
        UI != E; ++UI) {
      if (UI->getOpcode() == ISD::STORE) {
        // Store has the type Operand[1]
        if ((UI->getNumOperands() >= 2) 
          && (UI->getOperand(1).getSimpleValueType() == MVT::f64)) {
            // This is a f64 store with SPE
            // The instruction evstdd can only handle 8bit offset
            Base = N.getOperand(0);
            Index = N.getOperand(1);
            return true;
        }
      } else
      if (UI->getOpcode() == ISD::LOAD) {
        // Load has the type in Values[0]
        if ((UI->getNumValues() >= 1) 
          && (UI->getSimpleValueType(0) == MVT::f64)) {
            // This is a f64 load with SPE
            // The instruction evldd can only handle 8bit offset
            Base = N.getOperand(0);
            Index = N.getOperand(1);
            return true;
        }
      }
    }
  }
  if (isIntS16Immediate(N.getOperand(1), imm))
    return false;    // r+i
  if (N.getOperand(1).getOpcode() == PPCISD::Lo)
    return false;    // r+i

....
The modification starts with if(hasSPE()) { and ends with the fitting }

What have I done: The SelectAddressRegReg() function is the central decider for "offset+r4" or "r4+r5", but it only knows about 16bit offsets. evldd and evstdd are using 8bit (5bit usable) offsets. Therefore they can't be used when accessing global variables. The patch in here is now looking in the useList if it is a SPE with Load or Store for f64 data. Then it tells, that it must be Register+Register addressing, which is then automatically chaning to evlddx / evstddx.
I have tested this with OS-9 running on a P2020 (e500v2).

Would you like to put my patch into your patch D54583?
Kei

Hi Kei, thanks! I'll gladly add it to one of the patches. It might fit better with D54409, since the purpose of that is to fix the evldd handling in general.

Hi @vit9696, it looks like Kei's patch fixes the issue you're seeing as well.

I placed the info in this thread, as D54583 already has PPCISelLowering.cpp in it. But in real it better fits into D54409.

Hi @vit9696, it looks like Kei's patch fixes the issue you're seeing as well.

Yes, I am currently testing the code with the latest changes, and so far it appears to be resolved. Thanks.

nemanjai requested changes to this revision.Feb 19 2019, 4:13 PM

Also, as we discussed on IRC, please provide full context to make it easier to review.
I am actually OK with this revision as long as the comments are addressed, but I'm just requesting another revision to ensure all the comments have been addressed.

lib/Target/PowerPC/PPCISelLowering.cpp
3164 ↗(On Diff #182715)

Since this isn't local to a very small loop, I think a more descriptive name is in order. Perhaps RegIdx?

3165 ↗(On Diff #182715)

I think it is more obvious to use sizeof(MCPhysReg) rather than sizeof(HiRegList[0]).

3196 ↗(On Diff #182715)

I can't say I am overly familiar with this code, but we don't need to call State.AllocateReg() for the low register?

3550 ↗(On Diff #182715)

I don't think it's useful to duplicate a comment down both paths in a condition.

3555 ↗(On Diff #182715)

Perhaps something like

assert(i + 1 < e &&
       "No second half of double precision argument");
5520 ↗(On Diff #182715)

i, realI, j are not adequate given this rather complex iteration space... Please name them for what they are meant to refer to.

6755 ↗(On Diff #182715)

Can we be consistent about how we specify the index to the extract between here and above? I am OK with either approach as long as we use the same.

lib/Target/PowerPC/PPCInstrInfo.td
237 ↗(On Diff #182715)

Seems like we don't need to separately state that the types are the same and that they're the same size?

This revision now requires changes to proceed.Feb 19 2019, 4:13 PM
jhibbits updated this revision to Diff 188756.Feb 28 2019, 10:32 AM
jhibbits marked 4 inline comments as done.

Address feedback. Provide full context for diffs.

I found an issue with the SPE compare operations. The result of a efdcmpeq , efdcmpgt and efdcmplt is every time the GT-Bit in the Condition Register. This is adressed in one place of the PPCISelDAGToDAG.cpp, but not addressed for a second case of the code generation.
The diff of PPCISelDAGToDAG.cpp is:

--- PPCISelDAGToDAG.cpp 2019-03-12 15:35:45.000000000 +0100
+++ "\\ellcc\\PPCISelDAGToDAG.cpp"      2019-03-27 10:52:21.088326000 +0100
@@ -5039,6 +5038,32 @@ void PPCDAGToDAGISel::Select(SDNode *N)
       PCC |= getBranchHint(PCC, FuncInfo, N->getOperand(4));

     SDValue CondCode = SelectCC(N->getOperand(2), N->getOperand(3), CC, dl);
+
+    if (PPCSubTarget->hasSPE() && N->getOperand(2).getValueType().isFloatingPoint()) {
+      // For SPE instructions, the result is in GT bit of the CR
+      switch(CC) {
+        case ISD::SETOEQ:
+        case ISD::SETEQ:
+        case ISD::SETOLT:
+        case ISD::SETLT:
+        case ISD::SETOGT:
+        case ISD::SETGT:
+                PCC = PPC::PRED_GT;
+                break;
+        case ISD::SETUNE:
+        case ISD::SETNE:
+        case ISD::SETULE:
+        case ISD::SETLE:
+        case ISD::SETUGE:
+        case ISD::SETGE:
+                PCC = PPC::PRED_LE;
+                break;
+        default:
+                break;
+      }
+    }
+
+
     SDValue Ops[] = { getI32Imm(PCC, dl), CondCode,
                         N->getOperand(4), N->getOperand(0) };
     CurDAG->SelectNodeTo(N, PPC::BCC, MVT::Other, Ops);

The following testprogram is checking all methods. And the LIBC++ tests are now correctly compiled and running (about 1000 of 5800 failed before).

#include <stdio.h>

int teq(double a, double b)
{
    printf("%lf == %lf\n",a,b);
    if (a == b)
    {
      printf("equal\n");
      return 1;
    }
    printf("!equal\n");
    return 0;
}

int tne(double a, double b)
{
    printf("%lf != %lf\n",a,b);
    if (a != b)
    {
      printf("notequal\n");
      return 1;
    }
    printf("!notequal\n");
    return 0;
}
int tgt(double a, double b)
{
    printf("%lf > %lf\n",a,b);
    if (a > b)
    {
      printf("greater than\n");
      return 1;
    }
    printf("!greater than\n");
    return 0;
}
int tge(double a, double b)
{
    printf("%lf >= %lf\n",a,b);
    if (a >= b)
    {
      printf("greater equal\n");
      return 1;
    }
    printf("!greater equal\n");
    return 0;
}
int tlt(double a, double b)
{
    printf("%lf < %lf\n",a,b);
    if (a < b)
    {
      printf("less than\n");
      return 1;
    }
    printf("!less than\n");
    return 0;
}

int tle(double a, double b)
{
    printf("%lf <= %lf\n",a,b);
    if (a <= b)
    {
      printf("less equal\n");
      return 1;
    }
    printf("!less equal\n");
    return 0;
}

int main()
{
    teq(5.5,5.5);
    teq(5.5,5.6);
        
    tne(5.5,5.6);
    tne(5.5,5.5);
    
    tgt(5.5,5.6);
    tgt(5.5,5.5);
    tgt(5.5,5.4);
    
    tge(5.5,5.6);
    tge(5.5,5.5);
    tge(5.5,5.4);
    
    tlt(5.5,5.6);
    tlt(5.5,5.5);
    tlt(5.5,5.4);
    
    tle(5.5,5.6);
    tle(5.5,5.5);
    tle(5.5,5.4);
        
    return 0;
}

The result is:

$ eq
5.500000 == 5.500000
equal
5.500000 == 5.600000
!equal
5.500000 != 5.600000
notequal
5.500000 != 5.500000
!notequal
5.500000 > 5.600000
!greater than
5.500000 > 5.500000
!greater than
5.500000 > 5.400000
greater than
5.500000 >= 5.600000
!greater equal
5.500000 >= 5.500000
greater equal
5.500000 >= 5.400000
greater equal
5.500000 < 5.600000
less than
5.500000 < 5.500000
!less than
5.500000 < 5.400000
!less than
5.500000 <= 5.600000
less equal
5.500000 <= 5.500000
less equal
5.500000 <= 5.400000
!less equal

I'm not sure if this fix is good for D54583 or if you like to create a new one.

Best regards, Kei

@kthomsen can you create a new revision just for that diff?

lib/Target/PowerPC/PPCISelLowering.cpp
3165 ↗(On Diff #182715)

This is often spelled out in a macro as nitems(), or sizeofArray() (or similar).

@jhibbits I don't know how to create a new revision here. My idea is to handle this fix via you, as you are already known for the SPE modifications.

@jhibbits I don't know how to create a new revision here. My idea is to handle this fix via you, as you are already known for the SPE modifications.

Fair enough, I'll create a new revision tonight for this change.

LGTM. The remaining comments are stylistic nits that can be addressed on the commit and do not require another round of review. Thank you for your patience with this review.

lib/Target/PowerPC/PPCISelLowering.cpp
3215 ↗(On Diff #188756)

I don't really understand why this is handled differently than the parameters in terms of how the allocation is done. It is perhaps a good indicator that a comment explaining the difference would be useful.

3578 ↗(On Diff #188756)

I think it would be good to keep using the Hi/Lo naming here as well and make it very obvious that you're adding two live-in registers and adding copies from them into virtual regs. Something along the lines of:

if (VA.getLocVT() == MVT::f64 && Subtarget.hasSPE()) {
  assert(i + 1 < e && "No second half of double precision argument");
  unsigned RegLo = MF.addLiveIn(VA.getLocReg(), RC);
  unsigned RegHi = MF.addLiveIn(ArgLocs[++i].getLocReg(), RC);
  SDValue ArgValueLo = DAG.getCopyFromReg(Chain, dl, RegLo, MVT::i32);
  SDValue ArgValueHi = DAG.getCopyFromReg(Chain, dl, RegHi, MVT::i32);

  if (!Subtarget.isLittleEndian())
    std::swap (ArgValueLo, ArgValueHi);
  ArgValue = DAG.getNode(PPCISD::BUILD_SPE64, dl, MVT::f64, ArgValueLo, 
                         ArgValueHi);
} else {
  // existing code ...
}
5545 ↗(On Diff #188756)

I think it would be nice to add a comment explaining what the induction variables track (correct the below if it's wrong):

i - Tracks the index into the list of registers allocated for the call
RealArgIdx - Tracks the index into the list of actual function arguments
j - Tracks the index into the list of byval arguments
5600 ↗(On Diff #188756)

Minor nit: this doesn't really match the naming convention. Perhaps IsLE?

6756 ↗(On Diff #188756)

Similar comment for this code as above regarding induction variables and the naming of isLittleEndian.

lib/Target/PowerPC/PPCISelLowering.h
1114 ↗(On Diff #188756)

This does not appear to be used. Perhaps an artifact remaining from a previous revision?

lib/Target/PowerPC/PPCInstrInfo.td
236 ↗(On Diff #188756)

I think all the types have to be exact - i32 inputs, f64 output. Might as well make that explicit: [SDTCisVT<0, f64>, SDTCisVT<1, i32>, SDTCisVT<2, i32>]

Similarly below.

nemanjai accepted this revision.Mar 31 2019, 3:09 AM

Forgot to select Accept Revision from the pulldown.

This revision is now accepted and ready to land.Mar 31 2019, 3:09 AM

@jhibbits @kthomsen Sorry for the delay, I checked the change in PPCISelDAGToDAG.cpp, and it indeed fixes the issue. So far I ran into no other bugs and have no objection of this getting merged.

jhibbits marked 7 inline comments as done.Apr 2 2019, 1:12 PM
jhibbits added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
3215 ↗(On Diff #188756)

Yeah, they're identical, but on second look they shouldn't be. Return values can only be in R3 and R4, so they should be identical in every way except the register list.

lib/Target/PowerPC/PPCISelLowering.h
1114 ↗(On Diff #188756)

Yes, it's an artifact.

This revision was automatically updated to reflect the committed changes.
jhibbits marked 2 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2019, 8:15 PM