Page MenuHomePhabricator

[AArch64] Favor post-increments
Needs ReviewPublic

Authored by SjoerdMeijer on Oct 19 2020, 6:06 AM.

Details

Summary

This implements target hook shouldFavorPostInc for AArch64, which is queried in LoopStrengthReduce that can bring loops in a better form to generate post-increments.

Motivating examples are similar to: test/CodeGen/AArch64/arm64-scaled_iv.ll. I have changed that first from an opt to an llc test locally, and the diff here is against this local change. Using that example, more efficient is to transform this:

LBB0_1:
    ldr     d0, [x1, x8]
    ldr     d1, [x10, x8]
    fmul    d0, d0, d1
    str     d0, [x9, x8]
    add     x8, x8, #8
    cmp     w8, #152
    b.ne    LBB0_1

into this:

LBB0_1:
        ldr     d0, [x1], #8
        ldr     d1, [x9], #8
        subs    w10, w10, #1
        fmul    d0, d0, d1
        str     d0, [x8], #8
        b.ne    LBB0_1

Which improves one benchmark with 1.2%, and didn't show any changes in another which I also didn't expect to be impacted (just checking as a sanity check).

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Oct 19 2020, 6:06 AM
SjoerdMeijer requested review of this revision.Oct 19 2020, 6:06 AM
samparker added a comment.EditedOct 19 2020, 7:14 AM

How about results from the LLVM test suite? This should benefit size and performance?

So far as we have used it, this option mean "aggressively optimizer for postincs". That is useful for MVE where not using postinc's in loops can have a significant effect on performance. AArch64 I'm less sure about, but it might make sense. The option tends to make loops use more registers, which might be OK on an architecture with a lot of registers.

Which improves one benchmark with 1.2%, and didn't show any changes in another which I also didn't expect to be impacted (just checking as a sanity check).

Does this mean you've ran 2 benchmarks? What does SPEC or the llvm test suite say?

Cheers guys, that's fair, will give the llvm test suite a try too.

SjoerdMeijer added a comment.EditedOct 20 2020, 12:41 AM

I've ran CTMark and filter out the test that run very shortly with --filter-short and see the same trend, i.e. no regressions and some okay improvements:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        before after  diff 
 test-suite :: CTMark/lencod/lencod.test         8.57   8.58  0.0%
 test-suite...Mark/mafft/pairlocalalign.test    56.49  56.46 -0.0%
 test-suite...TMark/7zip/7zip-benchmark.test    10.23  10.15 -0.7%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.93   3.90 -0.8%
 test-suite :: CTMark/Bullet/bullet.test        10.17  10.01 -1.6%
 test-suite :: CTMark/SPASS/SPASS.test          26.04  25.09 -3.7%
 Geomean difference                                          -1.1%
          before      after      diff
count  6.000000   6.000000   6.000000
mean   19.238417  19.031417 -0.011363
std    19.723614  19.679435  0.013708
min    3.933800   3.901200  -0.036644
25%    8.971825   8.934025  -0.013844
50%    10.198600  10.080500 -0.007888
75%    22.087150  21.352350 -0.002162
max    56.486600  56.464800  0.000327

How many runs was that? How much noise is there?

It's encouraging at least. Can you run SPEC too? Tamar has a good low-noise system. This isn't a small change to be taken lightly, even if the patch is small in number of lines changed.

Just curious if there's something in particular you are concerned about? I am just asking because then I can focus on that.

I have rerun experiment and instead of 5 runs I am comparing 10 runs by taking the minimum runtime of each of the runs in this way:

../test-suite/utils/compare.py --filter-short before.1.json before.2.json before.3.json before.4.json before.5.json before.6.json before.7.json before.8.json before.9.json before.10.json vs after.1.json after.2.json after.3.json after.4.json after.5.json after.6.json after.7.json after.8.json after.9.json after.10.json

This gives:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        lhs    rhs    diff 
 test-suite...Mark/mafft/pairlocalalign.test    56.41  56.48  0.1%
 test-suite :: CTMark/Bullet/bullet.test         9.99   9.98 -0.0%
 test-suite :: CTMark/SPASS/SPASS.test          25.11  25.08 -0.1%
 test-suite :: CTMark/lencod/lencod.test         8.58   8.57 -0.2%
 test-suite...TMark/7zip/7zip-benchmark.test    10.16  10.11 -0.4%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.86   3.84 -0.5%
 Geomean difference                                          -0.2%
             lhs        rhs      diff
count  6.000000   6.000000   6.000000
mean   19.016133  19.009183 -0.001906
std    19.665927  19.699874  0.002340
min    3.858400   3.839700  -0.004847
25%    8.933775   8.919975  -0.003747
50%    10.071450  10.047500 -0.001556
75%    21.368850  21.336775 -0.000573
max    56.406300  56.476400  0.001243

Ah wait a minute, I am doing the last experiment again, just double checking if I haven't make mistake running them

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

FWIW the CTmark subset runs for quite a short time on beefier cores, so it might be good to also do some SPEC runs.

Yep, cheers, hopefully SPEC is better and more conclusive. The 1.2% uplift in one benchmark was on baremetal aarch64, will check if I can run some more things on that too.

SPECInt numbers:

500.perlbench_r	-0.36%
502.gcc_r	 0.25%
505.mcf_r	-0.23%
520.omnetpp_r	-0.16%
523.xalancbmk_r	-0.48%
525.x264_r	 1.38%
531.deepsjeng_r	-0.20%
541.leela_r	-0.11%
548.exchange2_r	 0.01%
557.xz_r	-0.03%

These are runtime reduction in percentage, so negative is good, postive number is a regression.
Overall, a small win, does what I want, except that 525.x264 is a bit of a negative outlier that makes things less rosy, and needs looking into.
And also need to look what Sam has cooked up in D89894....

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll
91

How is this test changing? Just the initial operand for the add?

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

I analysed x264 and couldn't find any concerning codegen changes in the top 6 hottest functions in the profile. Then, I did more runs and concluded I must have been looking at noise (again) as I don't see that 1.38% regression anymore. It is more like 0.5% if it happens. Overall, my conclusion is in line with yours: this change is neutral on bigger cores worst case, but probably a small gain, and indeed in my first experiment on an in-order core I see decent speed-ups. Intuitively this makes sense, because probably the bigger ooo cores can deal better with inefficient code, while the smaller ones are more sensitive for this and an optimisation has more effect. While looking at x264, I did observe this for some cases:

The option tends to make loops use more registers,

I saw some more registers being used in preheaders to setup pointers, but then didn't seem to affect the loop.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

Cheers, will look at this now.