This is an archive of the discontinued LLVM Phabricator instance.

[RAGreedy] Enable -consider-local-interval-cost for AArch64
ClosedPublic

Authored by sanwou01 on Oct 25 2019, 10:45 AM.

Details

Summary

The greedy register allocator occasionally decides to insert a large number of
unnecessary copies, see below for an example. The -consider-local-interval-cost
option (which X86 already enables by default) fixes this. We enable this option
for AArch64 only after receiving feedback that this change is not beneficial for
PowerPC.

We evaluated the impact of this change on compile time, code size and
performance benchmarks.

This option has a small impact on compile time, measured on CTMark. A 0.1%
geomean regression on -O1 and -O2, and 0.2% geomean for -O3, with at most 0.5%
on individual benchmarks.

The effect on both code size and performance on AArch64 for the LLVM test suite
is nil on the geomean with individual outliers (ignoring short exec_times)
between:

               best     worst
size..text     -3.3%    +0.0%
exec_time      -5.8%    +2.3%

On SPEC CPU® 2017 (compiled for AArch64) there is a minor reduction (-0.2% at
most) in code size on some benchmarks, with a tiny movement (-0.01%) on the
geomean. Neither intrate nor fprate show any change in performance.

This patch makes the following changes.

  • For the AArch64 target, enableAdvancedRASplitCost() now returns true.
  • Ensures that -consider-local-interval-cost=false can disable the new behaviour if necessary.

This matrix multiply example:

   $ cat test.c
   long A[8][8];
   long B[8][8];
   long C[8][8];

   void run_test() {
     for (int k = 0; k < 8; k++) {
       for (int i = 0; i < 8; i++) {
	 for (int j = 0; j < 8; j++) {
	   C[i][j] += A[i][k] * B[k][j];
	 }
       }
     }
   }

results in the following generated code on AArch64:

$ clang --target=aarch64-arm-none-eabi -O3 -S test.c -o -
[...]
                                      // %for.cond1.preheader
                                      // =>This Inner Loop Header: Depth=1
      add     x14, x11, x9
      str     q0, [sp, #16]           // 16-byte Folded Spill
      ldr     q0, [x14]
      mov     v2.16b, v15.16b
      mov     v15.16b, v14.16b
      mov     v14.16b, v13.16b
      mov     v13.16b, v12.16b
      mov     v12.16b, v11.16b
      mov     v11.16b, v10.16b
      mov     v10.16b, v9.16b
      mov     v9.16b, v8.16b
      mov     v8.16b, v31.16b
      mov     v31.16b, v30.16b
      mov     v30.16b, v29.16b
      mov     v29.16b, v28.16b
      mov     v28.16b, v27.16b
      mov     v27.16b, v26.16b
      mov     v26.16b, v25.16b
      mov     v25.16b, v24.16b
      mov     v24.16b, v23.16b
      mov     v23.16b, v22.16b
      mov     v22.16b, v21.16b
      mov     v21.16b, v20.16b
      mov     v20.16b, v19.16b
      mov     v19.16b, v18.16b
      mov     v18.16b, v17.16b
      mov     v17.16b, v16.16b
      mov     v16.16b, v7.16b
      mov     v7.16b, v6.16b
      mov     v6.16b, v5.16b
      mov     v5.16b, v4.16b
      mov     v4.16b, v3.16b
      mov     v3.16b, v1.16b
      mov     x12, v0.d[1]
      fmov    x15, d0
      ldp     q1, q0, [x14, #16]
      ldur    x1, [x10, #-256]
      ldur    x2, [x10, #-192]
      add     x9, x9, #64             // =64
      mov     x13, v1.d[1]
      fmov    x16, d1
      ldr     q1, [x14, #48]
      mul     x3, x15, x1
      mov     x14, v0.d[1]
      fmov    x17, d0
      mov     x18, v1.d[1]
      fmov    x0, d1
      mov     v1.16b, v3.16b
      mov     v3.16b, v4.16b
      mov     v4.16b, v5.16b
      mov     v5.16b, v6.16b
      mov     v6.16b, v7.16b
      mov     v7.16b, v16.16b
      mov     v16.16b, v17.16b
      mov     v17.16b, v18.16b
      mov     v18.16b, v19.16b
      mov     v19.16b, v20.16b
      mov     v20.16b, v21.16b
      mov     v21.16b, v22.16b
      mov     v22.16b, v23.16b
      mov     v23.16b, v24.16b
      mov     v24.16b, v25.16b
      mov     v25.16b, v26.16b
      mov     v26.16b, v27.16b
      mov     v27.16b, v28.16b
      mov     v28.16b, v29.16b
      mov     v29.16b, v30.16b
      mov     v30.16b, v31.16b
      mov     v31.16b, v8.16b
      mov     v8.16b, v9.16b
      mov     v9.16b, v10.16b
      mov     v10.16b, v11.16b
      mov     v11.16b, v12.16b
      mov     v12.16b, v13.16b
      mov     v13.16b, v14.16b
      mov     v14.16b, v15.16b
      mov     v15.16b, v2.16b
      ldr     q2, [sp]                // 16-byte Folded Reload
      fmov    d0, x3
      mul     x3, x12, x1
[...]

With -consider-local-interval-cost the same section of code results in the
following:

$ clang --target=aarch64-arm-none-eabi -mllvm -consider-local-interval-cost -O3 -S test.c -o -
[...]
.LBB0_1:                              // %for.cond1.preheader
                                      // =>This Inner Loop Header: Depth=1
      add     x14, x11, x9
      ldp     q0, q1, [x14]
      ldur    x1, [x10, #-256]
      ldur    x2, [x10, #-192]
      add     x9, x9, #64             // =64
      mov     x12, v0.d[1]
      fmov    x15, d0
      mov     x13, v1.d[1]
      fmov    x16, d1
      ldp     q0, q1, [x14, #32]
      mul     x3, x15, x1
      cmp     x9, #512                // =512
      mov     x14, v0.d[1]
      fmov    x17, d0
      fmov    d0, x3
      mul     x3, x12, x1
[...]

Event Timeline

sanwou01 created this revision.Oct 25 2019, 10:45 AM
sanwou01 updated this revision to Diff 226461.Oct 25 2019, 10:56 AM

Commit message formatting fixes

sanwou01 edited the summary of this revision. (Show Details)Oct 25 2019, 10:57 AM

Can you add tests for other targets as well? eg: PowerPC, RISC-V?
And maybe pre-commit the testcases, so that it is easier to see what the diff .

@ZhangKang Can you help to do size and performance evaluation on SPEC2017 on PowerPC. Thanks.

+1 here. It fixes an issue that has already been found on X86 and we've now found a case of on AArch64. It might well come up on other architectures, it's just impossible to tell where. Turning this one for all architectures sounds like the most sensible way forward. If X86 are accepting the compile times changes (which don't seem too big), I believe they will be acceptable elsewhere too.

The fact that no other tests have changed is a good sign. I do think this can cause a bit of performance noise from the register allocation changing. In the testing I ran it was just positives and negatives in different places.

Because this is target independent though, it's probably worth sending a message over to llvm-dev explaining the problem and that we think it's best to turn this on for all backends. Give anyone interested a chance to raise concerns.

lkail added a subscriber: lkail.Oct 27 2019, 7:03 PM

Can you add tests for other targets as well? eg: PowerPC, RISC-V?
And maybe pre-commit the testcases, so that it is easier to see what the diff .

@ZhangKang Can you help to do size and performance evaluation on SPEC2017 on PowerPC. Thanks.

Ok, I will run the patch on SPEC2017 for PowerPC soon.

sanwou01 updated this revision to Diff 226654.Oct 28 2019, 6:35 AM

Rebased on top of test case precommit.

sanwou01 updated this revision to Diff 226657.Oct 28 2019, 6:44 AM

Try again

@jsji I have now committed the test case with -consider-local-interval-cost disabled, so the impact should be a bit more obvious now.

As Dave points out, it is practically impossible to predict where this behaviour would come up in other architectures, so finding test cases for it is not straightforward. As much as I'd like to add tests for the other platforms, I don't have any, nor a good way to find them!

@ZhangKang I'm looking forward to seeing your findings on PowerPC, thanks for running those!

This comment was removed by ZhangKang.

Hmm. That's interesting. I'm guessing the flags you use between base and peak are quite different? The only consistent results was exchange2_r, which was getting better! How noisy are these results? What kind of a confidence interval do you have?

It may be simpler to make this AArch64 only instead, same as the X86 backend previously had. We know that there it fixes the issue with repeatedly spilling registers, and the performance is otherwise not largely effected.

jsji added a comment.Oct 31 2019, 7:17 AM

For base test, this patch has not a good performance on PPC.

Thanks @ZhangKang for the testing results.

I'm guessing the flags you use between base and peak are quite different?

BASE and PEAK options are similar to what used in SPEC publish, yes, they are quite different.

How noisy are these results? What kind of a confidence interval do you have?

@ZhangKang , please double confirm the results
(I believe the result should be quite reliable if you are using our standard perf testing harness, which will run it 3 times on the quiet performance machine)

It may be simpler to make this AArch64 only instead,

I am not sure other target like RISCV, but if above result on PowerPC is reliable,
then yes, I would prefer we don't enable it by default for PowerPC for now.
We need to look into details of those big degradations first, fix the problems if any, then we can consider enable it by default.

Hmm. That's interesting. I'm guessing the flags you use between base and peak are quite different? The only consistent results was exchange2_r, which was getting better! How noisy are these results? What kind of a confidence interval do you have?

It may be simpler to make this AArch64 only instead, same as the X86 backend previously had. We know that there it fixes the issue with repeatedly spilling registers, and the performance is otherwise not largely effected.

@jsji @dmgreen , I have check the result on PPC. My old test result has used the wrong driver for some cases.
For spec base test, 3 cases downgraded more than 1%, the biggest is -4.55%. Only one case upgraded more than 1%(2.27%).
For spec peak test, 3 cases downgraded more than 1%, the biggest is -3.48%. No case upgraded more than 1%.

This patch has bad impact on spec fp test, the geomean of fp base test downgraded 0.69%, and the geomean of fp peak test downgraded 0.43%.

So I would prefer we don't enable it by default for PowerPC.

sanwou01 updated this revision to Diff 228240.Nov 7 2019, 7:08 AM
sanwou01 retitled this revision from [RAGreedy] Enable -consider-local-interval-cost by default to [RAGreedy] Enable -consider-local-interval-cost for AArch64.
sanwou01 edited the summary of this revision. (Show Details)
sanwou01 removed subscribers: lkail, ppc-slack, wuzish and 2 others.

Enable -consider-local-interval-cost for AArch64 only instead of all targets.

sanwou01 added subscribers: wuzish, jsji, ZhangKang.
dmgreen accepted this revision.Nov 7 2019, 8:40 AM

It's a shame about the PPC changes. It's surprising that a change like this would cause those regressions on something as large as SPEC.

Doing this just for aarch64 sounds good though. We have a lot of registers so it makes this very expensive when we hit it, and the perf changes there don't look large at all.

LGTM.

This revision is now accepted and ready to land.Nov 7 2019, 8:40 AM
jsji added a comment.Nov 7 2019, 10:16 AM

It's a shame about the PPC changes. It's surprising that a change like this would cause those regressions on something as large as SPEC.

Yeah, we should have a look later to understand why this is causing degs in PowerPC. @ZhangKang

This revision was automatically updated to reflect the committed changes.