This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use alias analysis in the load/store optimization pass.
ClosedPublic

Authored by mcrosier on Mar 13 2017, 11:00 AM.

Details

Summary

This allows the optimization to rearrange loads and stores more aggressively to form more load/store pairs. Saw a number of fairly significant code size reductions. Performance testing in flight.

Chad

Diff Detail

Repository
rL LLVM

Event Timeline

mcrosier created this revision.Mar 13 2017, 11:00 AM
gberry edited edge metadata.Mar 13 2017, 11:25 AM

I would be interested to see load/store pair stats differences and compile time impact of this change.

I would be interested to see load/store pair stats differences and compile time impact of this change.

I'll work on getting the compile-time difference, Geoff.

Here are the static opcode diffs for SPEC2000 for those benchmarks with more that 5 static instructions removed:

> ./spec2000/eon/diffs/eon_base.arm_linux.diff <

Opcode static count diff summary:

   -30  ldr x [x #]
   -14  str x [x #]
    -4  str w [x #]
    -2  ldr d [x #]
    -2  ldr q [x #]
    -2  stp w w [x #]
    -2  str q [x #]
     1  ldp d d [x #]
     1  ldp q q [x #]
     1  stp q q [x #]
     7  ldp x x [x #]
     9  stp x x [x #]
    13  mov x x
-------------------------
    32  added (excluding nops)
    56  removed (excluding nops)
   -24  net (excluding nops)

> ./spec2000/twolf/diffs/twolf_base.arm_linux.diff <

Opcode static count diff summary:

    -8  str w [x #]
    -7  ldr w [x #]
    -3  ldrsw x [x #]
     1  sxtw x w
     1  ldpsw x x [x #]
     4  ldp w w [x #]
     4  stp w w [x #]
-------------------------
    10  added (excluding nops)
    18  removed (excluding nops)
    -8  net (excluding nops)

> ./spec2000/gcc/diffs/cc1_base.arm_linux.diff <

Opcode static count diff summary:

   -38  ldr x [x #]
   -31  str x [x #]
   -12  str w [x #]
    -8  ldr w [x #]
     4  ldp w w [x #]
     5  stp w w [x #]
    16  stp x x [x #]
    19  ldp x x [x #]
-------------------------
    44  added (excluding nops)
    89  removed (excluding nops)
   -45  net (excluding nops)

> ./spec2000/perlbmk/diffs/perlbmk_base.arm_linux.diff <

Opcode static count diff summary:

   -32  ldr x [x #]
   -16  str x [x #]
    -8  ldr w [x #]
    -2  str w [x #]
     1  stp w w [x #]
     4  ldp w w [x #]
     8  stp x x [x #]
    16  ldp x x [x #]
-------------------------
    29  added (excluding nops)
    58  removed (excluding nops)
   -29  net (excluding nops)

> ./spec2000/crafty/diffs/crafty_base.arm_linux.diff <

Opcode static count diff summary:

   -22  str x [x #]
    -8  str w [x #]
    -6  ldr x [x #]
     3  ldp x x [x #]
     4  stp w w [x #]
    11  stp x x [x #]
-------------------------
    18  added (excluding nops)
    36  removed (excluding nops)
   -18  net (excluding nops)

> ./spec2000/mesa/diffs/mesa_base.arm_linux.diff <

Opcode static count diff summary:

   -66  ldr s [x #]
   -36  str q [x #]
   -24  str s [x #]
   -14  ldr w [x #]
   -12  str q [x #] #
    -2  str w [x #]
    -2  ldr q [x #]
     1  ldp q q [x #]
     1  stp w w [x #]
     7  ldp w w [x #]
    12  stp s s [x #]
    12  add x x #
    24  stp q q [x #]
    33  ldp s s [x #]
-------------------------
    90  added (excluding nops)
   156  removed (excluding nops)
   -66  net (excluding nops)

> ./spec2000/vortex/diffs/vortex_base.arm_linux.diff <

Opcode static count diff summary:

  -168  str x [x #]
   -38  str w [x #]
   -16  ldr w [x #]
    -4  ldr x [x #]
     2  ldp x x [x #]
     8  ldp w w [x #]
    15  stp w w [x #]
    86  stp x x [x #]
-------------------------
   111  added (excluding nops)
   226  removed (excluding nops)
  -115  net (excluding nops)

Here are the static opcode diffs for SPEC2006 for those benchmarks with more that 5 static instructions removed:

> ./spec2006/h264ref/diffs/h264ref_base.arm_linux.diff <

Opcode static count diff summary:

   -95  str w [x #]
   -62  str x [x #]
   -24  ldr w [x #]
    -6  ldr x [x #]
    -2  ldr q [x #]
    -2  str q [x #]
    -1  scvtf s w
    -1  fdiv s s s
    -1  adrp x  
    -1  fcvtzs w s
    -1  add x x #
     1  ldp q q [x #]
     1  stp q q [x #]
     3  ldp x x [x #]
    12  ldp w w [x #]
    33  stp x x [x #]
    43  stp w w [x #]
-------------------------
    93  added (excluding nops)
   196  removed (excluding nops)
  -103  net (excluding nops)

> ./spec2006/povray/diffs/povray_base.arm_linux.diff <

Opcode static count diff summary:

   -65  str x [x #]
   -42  ldr x [x #]
    -8  ldr s [x #]
    -6  ldr d [x #]
    -6  str s [x #]
    -6  str w [x #]
    -5  ldr w [x #]
    -4  ldr q [x #]
    -4  str q [x #]
    -4  str d [x #]
    -1  ldrsw x [x #]
     1  sxtw x w
     2  stp q q [x #]
     2  ldp q q [x #]
     2  stp w w [x #]
     2  stp d d [x #]
     3  ldp d d [x #]
     3  stp s s [x #]
     3  ldp w w [x #]
     4  ldp s s [x #]
    21  ldp x x [x #]
    33  stp x x [x #]
-------------------------
    76  added (excluding nops)
   151  removed (excluding nops)
   -75  net (excluding nops)

> ./spec2006/gcc/diffs/gcc_base.arm_linux.diff <

Opcode static count diff summary:

   -62  str x [x #]
   -20  ldr x [x #]
   -14  str w [x #]
    -2  ldr w [x #]
     1  ldp w w [x #]
     7  stp w w [x #]
    10  ldp x x [x #]
    31  stp x x [x #]
-------------------------
    49  added (excluding nops)
    98  removed (excluding nops)
   -49  net (excluding nops)

> ./spec2006/perlbench/diffs/perlbench_base.arm_linux.diff <

Opcode static count diff summary:

   -50  ldr x [x #]
   -19  str x [x #]
   -10  ldr w [x #]
    -8  str w [x #]
     3  stp w w [x #]
     5  ldp w w [x #]
    10  stp x x [x #]
    25  ldp x x [x #]
-------------------------
    43  added (excluding nops)
    87  removed (excluding nops)
   -44  net (excluding nops)

> ./spec2006/dealII/diffs/dealII_base.arm_linux.diff <

Opcode static count diff summary:

  -362  str x [x #]
   -64  str w [x #]
   -42  ldr x [x #]
   -12  ldr q [x #]
    -8  str q [x #]
    -2  ldr w [x #]
    -2  str d [x #]
    -1  sub x x #
     1  stp x x [x #] #
     1  ldp w w [x #]
     1  stp d d [x #]
     4  stp q q [x #]
     6  ldp q q [x #]
    21  ldp x x [x #]
    32  stp w w [x #]
   180  stp x x [x #]
-------------------------
   246  added (excluding nops)
   493  removed (excluding nops)
  -247  net (excluding nops)

> ./spec2006/xalancbmk/diffs/Xalan_base.arm_linux.diff <

Opcode static count diff summary:

   -86  str x [x #]
   -23  ldr x [x #]
    -8  str q [x #]
    -4  ldr q [x #]
     1  mov x x
     2  ldp q q [x #]
     4  stp q q [x #]
    11  ldp x x [x #]
    43  stp x x [x #]
-------------------------
    61  added (excluding nops)
   121  removed (excluding nops)
   -60  net (excluding nops)

> ./spec2006/gobmk/diffs/gobmk_base.arm_linux.diff <

Opcode static count diff summary:

   -36  str w [x #]
    -4  ldr w [x #]
    -2  ldr q [x #]
     1  ldp q q [x #]
     2  ldp w w [x #]
    18  stp w w [x #]
-------------------------
    21  added (excluding nops)
    42  removed (excluding nops)
   -21  net (excluding nops)

Here are the relative stats for SPEC2000/SPEC2006 combine using llvm statistics:

Message                                                                                    Diff  %age
--------------------------------------------------------------------------------------  -------  -------
aarch64-ldst-opt - Number of load/store pair instructions generated                        +969  1.63%
aarch64-ldst-opt - Number of loads from stores promoted                                    +232  362.50%
aarch64-ldst-opt - Number of narrow zero stores promoted                                    +20  2.41%
aarch64-ldst-opt - Number of post-index updates folded                                       -3  -0.12%
asm-printer - Number of machine instrs printed                                             -974  -0.03%
assembler - Number of emitted object file bytes                                           -3616  -0.01%
assembler - Number of evaluated fixups                                                       +8  0.00%
assembler - Number of fragment layouts                                                       +0  0.00%
basicaa - Number of times a GEP is decomposed                                           +125690  0.28%
basicaa - Number of times the limit to decompose GEPs is reached                            +43  0.04%
bdce - Number of instructions removed (unused)                                               +0  0.00%
bdce - Number of instructions trivialized (dead bits)                                        +0  0.00%
bitcode-reader - Number of MDStrings loaded                                                  +0  0.00%
branch-relaxation - Number of conditional branches relaxed                                   +0  0.00%
branchfolding - Number of block tails merged                                                 -6  -0.01%
mccodeemitter - Number of MC fixups created.                                                 +8  0.00%
mccodeemitter - Number of MC instructions emitted.                                         -974  -0.03%
mcexpr - Number of MCExpr evaluations                                                       +16  0.00%
memory-builtins - Number of arguments with unsolved size and offset                         +32  0.04%
memory-builtins - Number of load instructions with unsolved size and offset                 +84  0.16%
MatzeB edited edge metadata.Mar 13 2017, 1:31 PM

Looks good to me. Could you do the sanity checking that we do the right thing in CodeGen and compute the AliasAnalysis information only once for all of CodeGen and not repeat it for different passes? (i.e. -debug-pass=Executions should only show them computed once for all CodeGen passes). Feel free to delegate this task to https://reviews.llvm.org/D30839 if you want ;-)

Looks good to me. Could you do the sanity checking that we do the right thing in CodeGen and compute the AliasAnalysis information only once for all of CodeGen and not repeat it for different passes? (i.e. -debug-pass=Executions should only show them computed once for all CodeGen passes). Feel free to delegate this task to https://reviews.llvm.org/D30839 if you want ;-)

Sure, Matthias. After this change the Function AA result is freed after the AArch64 load/store optimization pass, rather than the Machine LICM pass (i.e., the change only extends the lifetime of the AA info and doesn't not require it to be recomputed).

Performance results for SPEC2000/2006 are neutral, so this is mostly just a code size reduction optimization.

MatzeB accepted this revision.Mar 14 2017, 10:54 AM

Looks good to me. Could you do the sanity checking that we do the right thing in CodeGen and compute the AliasAnalysis information only once for all of CodeGen and not repeat it for different passes? (i.e. -debug-pass=Executions should only show them computed once for all CodeGen passes). Feel free to delegate this task to https://reviews.llvm.org/D30839 if you want ;-)

Sure, Matthias. After this change the Function AA result is freed after the AArch64 load/store optimization pass, rather than the Machine LICM pass (i.e., the change only extends the lifetime of the AA info and doesn't not require it to be recomputed).

That's fine; I just wanted to make sure we do not compute it twice because something in codegen fails to preserve.

The code change itself is obvious. LGTM.

This revision is now accepted and ready to land.Mar 14 2017, 10:54 AM

Compile-time regression tests on the llvm-test-suite and SPEC200X resulted in a net 1.288% improvement in compile time. I suspect that's really just noise, but the main take away is that there were no regression identified. Will commit soon..

This revision was automatically updated to reflect the committed changes.