Page MenuHomePhabricator

[PowerPC] Lump the constants to save one addis for each constant access
Needs ReviewPublic

Authored by steven.zhang on Nov 9 2020, 1:00 AM.

Details

Reviewers
nemanjai
MaskRay
stefanp
jsji
masoud.ataei
Group Reviewers
Restricted Project
Summary

For now, we are placing the constant into TOC and whenever it is accessed, we need addis/addi + load. See:

double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }

And this is what we have now:

        addis 2, 12, .TOC.-.Lfunc_gep0@ha
        addi 2, 2, .TOC.-.Lfunc_gep0@l
.Lfunc_lep0:
        .localentry     X, .Lfunc_lep0-.Lfunc_gep0
# %bb.0:                                # %entry
        addis 3, 2, .LCPI0_0@toc@ha
        lfd 0, .LCPI0_0@toc@l(3)             #<-- addi is folding into lfd
        addis 3, 2, .LCPI0_1@toc@ha
        xsmuldp 0, 1, 0
        lfd 1, .LCPI0_1@toc@l(3)
        addis 3, 2, .LCPI0_2@toc@ha
        xsadddp 0, 0, 1
        lfd 1, .LCPI0_2@toc@l(3)
        addis 3, 2, .LCPI0_3@toc@ha
        xsmuldp 0, 0, 1
        lfd 1, .LCPI0_3@toc@l(3)
        xsadddp 1, 0, 1
        blr

It can be optimized as grouping all the constants together into RO data section, so that their relative positions are fixed. Then, create a symbol in TOC which point to that data section. The benefit for this optimization is to reduce the GOT size and improve the performance as the addis is saved. It works like this:

        .section        .data.rel.ro,"aw",@progbits
        .p2align        3                               # -- Begin function X
.LCPI0_0:
        .quad   0x402cc28f5c28f5c3              # double 14.380000000000001
        .quad   0x4002b851eb851eb8              # double 2.3399999999999999
        .quad   0x40120c49ba5e353f              # double 4.5119999999999996
        .quad   0x3ff3ae147ae147ae              # double 1.23
.Lfunc_gep0:
        addis 2, 12, .TOC.-.Lfunc_gep0@ha
        addi 2, 2, .TOC.-.Lfunc_gep0@l
.Lfunc_lep0:
        .localentry     X, .Lfunc_lep0-.Lfunc_gep0
# %bb.0:                                # %entry
        addis 3, 2, .LC0@toc@ha
        ld 3, .LC0@toc@l(3)
        lfd 0, 24(3)
        xsmuldp 0, 1, 0
        lfd 1, 16(3)
        xsadddp 0, 0, 1
        lfd 1, 8(3)
        xsmuldp 0, 0, 1
        lfdx 1, 0, 3
        xsadddp 1, 0, 1
        blr

.LC0:
        .tc .LCPI0_0[TC],.LCPI0_0

This optimization has been discussed before. See PowerPC/README.txt for more information.

Lump the constant pool for each function into ONE pic object, and reference
pieces of it as offsets from the start.  For functions like this (contrived
to have lots of constants obviously):

double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }

We generate:

_X:
        lis r2, ha16(.CPI_X_0)
        lfd f0, lo16(.CPI_X_0)(r2)
        lis r2, ha16(.CPI_X_1)
        lfd f2, lo16(.CPI_X_1)(r2)
        fmadd f0, f1, f0, f2
        lis r2, ha16(.CPI_X_2)
        lfd f1, lo16(.CPI_X_2)(r2)
        lis r2, ha16(.CPI_X_3)
        lfd f2, lo16(.CPI_X_3)(r2)
        fmadd f1, f0, f1, f2
        blr

It would be better to materialize .CPI_X into a register, then use immediates
off of the register to avoid the lis's.  This is even more important in PIC
mode.

Note that this (and the static variable version) is discussed here for GCC:
http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html

Here's another example (the sgn function):
double testf(double a) {
       return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
}

it produces a BB like this:
LBB1_1: ; cond_true
        lis r2, ha16(LCPI1_0)
        lfs f0, lo16(LCPI1_0)(r2)
        lis r2, ha16(LCPI1_1)
        lis r3, ha16(LCPI1_2)
        lfs f2, lo16(LCPI1_2)(r3)
        lfs f3, lo16(LCPI1_1)(r2)
        fsub f0, f0, f1
        fsel f1, f0, f2, f3
        blr

Some limitation:

  • If there is only one constant, we will have one extra load with this patch. But the load could be optimized by linker if it merges the TOC. It is not easy inside compiler to handle it as ISEL is done basing on per BB, and we don't know if there are other constants until other BBs are selected. Any thoughts ?
  • Lump the constant with the same type. Technical speaking, all the constants could be lumped together as far as the alignment is handle carefully.

Diff Detail

Event Timeline

steven.zhang created this revision.Nov 9 2020, 1:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2020, 1:00 AM

Rebase test.

Gentle ping ...

qiucf added a subscriber: qiucf.Dec 16 2020, 10:52 PM

If there is only one constant, we will have one extra load with this patch. But the load could be optimized by linker if it merges the TOC. It is not easy inside compiler to handle it as ISEL is done basing on per BB, and we don't know if there are other constants until other BBs are selected. Any thoughts ?

As discussed, FunctionLoweringInfo is 'global' to SelectionDAG. Can we take advantage of it to record some global information? FunctionLoweringInfo::set has already scanned over each BB, each instruction.

Also, test changes in this patch are really huge. Are they all related?

llvm/test/CodeGen/PowerPC/constant-pool.ll
26

Regression?

steven.zhang planned changes to this revision.Dec 16 2020, 11:28 PM

That means we have to make it as target independent. I will take a look to see if there is nice way to make it as target independent.

Rebase the patch.

If there is only one constant, we will have one extra load with this patch. But the load could be optimized by linker if it merges the TOC. It is not easy inside compiler to handle it as ISEL is done basing on per BB, and we don't know if there are other constants until other BBs are selected. Any thoughts ?

As discussed, FunctionLoweringInfo is 'global' to SelectionDAG. Can we take advantage of it to record some global information? FunctionLoweringInfo::set has already scanned over each BB, each instruction.

The FunctionLoweringInfo won't help as we will do the transformation during the DAGCombine and it could remove/add new constants that cause the out of sync.

Also, test changes in this patch are really huge. Are they all related?

Yes, they are all relative as we are changing the basic of access the constant pool.

steven.zhang added inline comments.Dec 21 2020, 2:32 AM
llvm/test/CodeGen/PowerPC/constant-pool.ll
26

Yes, we will produce an extra load if there is single constant. But the linker will optimize away this extra load.

nemanjai requested changes to this revision.Dec 21 2020, 8:33 AM

Rebase the patch.

Huh? Something wrong with this rebase? Not only is there an issue with getSize() I pointed out, but there are also uses of non-existent functions:

PPCISelLowering.h:740:61: error: only virtual member functions can be marked 'override'
                            unsigned TargetFlags = 0) const override;
                                                            ^~~~~~~~
PPCISelLowering.cpp:2997:12: error: no member named 'getConstantPool' in 'llvm::TargetLowering'; did you mean simply 'getConstantPool'?
    return TargetLowering::getConstantPool(C, DAG, NewAlign, VT, Alignment,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           getConstantPool
PPCISelLowering.cpp:2987:28: note: 'getConstantPool' declared here
SDValue PPCTargetLowering::getConstantPool(const Constant *C, SelectionDAG &DAG,
                           ^
PPCISelLowering.cpp:3018:12: error: no member named 'getConstantPool' in 'llvm::TargetLowering'; did you mean simply 'getConstantPool'?
    return TargetLowering::getConstantPool(C, DAG, NewAlign, VT, Alignment,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           getConstantPool
PPCISelLowering.cpp:2987:28: note: 'getConstantPool' declared here
SDValue PPCTargetLowering::getConstantPool(const Constant *C, SelectionDAG &DAG,
                           ^
3 errors generated.

Also, I don't see adequate testing here (it may be hidden among all the test case changes and I missed it). There should be a test case with multiple constants loaded from the constant pool for all the types as well as for mixed types. AFAICT most of the test changes are single constants getting their own symbol in the TOC - which neither adequately tests this feature nor provides improvements of any kind (as pointed out in the description).

llvm/lib/Target/PowerPC/PPCConstantPoolValue.h
29

If the returned vector is const, why do we copy it?

33

Is this an issue with the particular revision you developed this on? The base class doesn't have this as a virtual member function so compilation fails because of override.

This revision now requires changes to proceed.Dec 21 2020, 8:33 AM

Rebase the patch.

Huh? Something wrong with this rebase? Not only is there an issue with getSize() I pointed out, but there are also uses of non-existent functions:

PPCISelLowering.h:740:61: error: only virtual member functions can be marked 'override'
                            unsigned TargetFlags = 0) const override;
                                                            ^~~~~~~~
PPCISelLowering.cpp:2997:12: error: no member named 'getConstantPool' in 'llvm::TargetLowering'; did you mean simply 'getConstantPool'?
    return TargetLowering::getConstantPool(C, DAG, NewAlign, VT, Alignment,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           getConstantPool
PPCISelLowering.cpp:2987:28: note: 'getConstantPool' declared here
SDValue PPCTargetLowering::getConstantPool(const Constant *C, SelectionDAG &DAG,
                           ^
PPCISelLowering.cpp:3018:12: error: no member named 'getConstantPool' in 'llvm::TargetLowering'; did you mean simply 'getConstantPool'?
    return TargetLowering::getConstantPool(C, DAG, NewAlign, VT, Alignment,
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           getConstantPool
PPCISelLowering.cpp:2987:28: note: 'getConstantPool' declared here
SDValue PPCTargetLowering::getConstantPool(const Constant *C, SelectionDAG &DAG,
                           ^
3 errors generated.

Also, I don't see adequate testing here (it may be hidden among all the test case changes and I missed it). There should be a test case with multiple constants loaded from the constant pool for all the types as well as for mixed types. AFAICT most of the test changes are single constants getting their own symbol in the TOC - which neither adequately tests this feature nor provides improvements of any kind (as pointed out in the description).

There are two parent revisions that this patch depends on. You need to apply them first. I will add a new test to summarize all the pattern we can handle.

The test you are looking for is llvm/test/CodeGen/PowerPC/constant-pool.ll in fact.

There are two parent revisions that this patch depends on. You need to apply them first. I will add a new test to summarize all the pattern we can handle.

Which ones? I don't see them listed in "Parent Revisions"

The test you are looking for is llvm/test/CodeGen/PowerPC/constant-pool.ll in fact.

Not quite. There seems to only be float/double/double double/vector and only for P9/P10. What happens with:

  1. enough constants that the load is widened
  2. what happens with -mcpu=pwr8 to see the impact of this when there are no D-Form loads for vectors

Also, the test case with multiple constants should show the constant pool (including alignment).
A simple example exhibiting this behaviour would be something like

void test(double *ArrD, unsigned short *ArrS, float *ArrF) {
// Ensure that these are still widened to 8b + 4b + 2b
  ArrS[0] = 12;
  ArrS[1] = 44;
  ArrS[2] = 8;
  ArrS[3] = 98;
  ArrS[4] = 271;
  ArrS[5] = 888;
  ArrS[6] = 99;

// These are not vectorized, check 4b alignment
  ArrF[0] += 999.88f;
  ArrF[1] += 861.15f;

// These are vectorized, check 16b alignment
  ArrD[0] = 342.2312;
  ArrD[1] = 664435.988;
  ArrD[3] = 12.222;
  ArrD[4] = 12.222;

// Check 8b alignment
  ArrD[5] += 2377.797889;
}

There are two parent revisions that this patch depends on. You need to apply them first. I will add a new test to summarize all the pattern we can handle.

Which ones? I don't see them listed in "Parent Revisions"

The parent revision is D91050 and D89108.

llvm/lib/Target/PowerPC/PPCConstantPoolValue.h
29

Good catch.

Address comments and add tests in llvm/test/CodeGen/PowerPC/constant-pool.ll to show the padding added for alignment. Also, add the RUN line for Power8 as requested.

The test you are looking for is llvm/test/CodeGen/PowerPC/constant-pool.ll in fact.

Not quite. There seems to only be float/double/double double/vector and only for P9/P10. What happens with:

  1. enough constants that the load is widened

IIUC, we won't have any impact on the optimization of DAG as what we changed is on the step that legalize the constant pool, which is done after DAGCombine. Anyway, I will add this test point with your test.

  1. what happens with -mcpu=pwr8 to see the impact of this when there are no D-Form loads for vectors

It is done.

Also, the test case with multiple constants should show the constant pool (including alignment).

Good suggestion. It is done.

A simple example exhibiting this behaviour would be something like

void test(double *ArrD, unsigned short *ArrS, float *ArrF) {
// Ensure that these are still widened to 8b + 4b + 2b
  ArrS[0] = 12;
  ArrS[1] = 44;
  ArrS[2] = 8;
  ArrS[3] = 98;
  ArrS[4] = 271;
  ArrS[5] = 888;
  ArrS[6] = 99;

// These are not vectorized, check 4b alignment
  ArrF[0] += 999.88f;
  ArrF[1] += 861.15f;

// These are vectorized, check 16b alignment
  ArrD[0] = 342.2312;
  ArrD[1] = 664435.988;
  ArrD[3] = 12.222;
  ArrD[4] = 12.222;

// Check 8b alignment
  ArrD[5] += 2377.797889;
}

Thank for for the test. I have added into constant-pool.ll

steven.zhang planned changes to this revision.Dec 24 2020, 2:06 AM

The dependent patch changed. Need rebase.

Rebase the patch again.