This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Favor pre-increments and implement TTI::getPreferredAddressingMode
AbandonedPublic

Authored by SjoerdMeijer on Oct 19 2020, 6:06 AM.

Details

Summary

We are missing an opportunity to generate more efficient addressing modes, preindexed loads/stores, which avoids increments of pointers with separate add instructions. This implements target hook getPreferredAddressingMode for AArch64, which is queried in LoopStrengthReduce that can bring loops in a better form to generate pre-increments.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Oct 19 2020, 6:06 AM
SjoerdMeijer requested review of this revision.Oct 19 2020, 6:06 AM
samparker added a comment.EditedOct 19 2020, 7:14 AM

How about results from the LLVM test suite? This should benefit size and performance?

So far as we have used it, this option mean "aggressively optimizer for postincs". That is useful for MVE where not using postinc's in loops can have a significant effect on performance. AArch64 I'm less sure about, but it might make sense. The option tends to make loops use more registers, which might be OK on an architecture with a lot of registers.

Which improves one benchmark with 1.2%, and didn't show any changes in another which I also didn't expect to be impacted (just checking as a sanity check).

Does this mean you've ran 2 benchmarks? What does SPEC or the llvm test suite say?

Cheers guys, that's fair, will give the llvm test suite a try too.

SjoerdMeijer added a comment.EditedOct 20 2020, 12:41 AM

I've ran CTMark and filter out the test that run very shortly with --filter-short and see the same trend, i.e. no regressions and some okay improvements:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        before after  diff 
 test-suite :: CTMark/lencod/lencod.test         8.57   8.58  0.0%
 test-suite...Mark/mafft/pairlocalalign.test    56.49  56.46 -0.0%
 test-suite...TMark/7zip/7zip-benchmark.test    10.23  10.15 -0.7%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.93   3.90 -0.8%
 test-suite :: CTMark/Bullet/bullet.test        10.17  10.01 -1.6%
 test-suite :: CTMark/SPASS/SPASS.test          26.04  25.09 -3.7%
 Geomean difference                                          -1.1%
          before      after      diff
count  6.000000   6.000000   6.000000
mean   19.238417  19.031417 -0.011363
std    19.723614  19.679435  0.013708
min    3.933800   3.901200  -0.036644
25%    8.971825   8.934025  -0.013844
50%    10.198600  10.080500 -0.007888
75%    22.087150  21.352350 -0.002162
max    56.486600  56.464800  0.000327

How many runs was that? How much noise is there?

It's encouraging at least. Can you run SPEC too? Tamar has a good low-noise system. This isn't a small change to be taken lightly, even if the patch is small in number of lines changed.

Just curious if there's something in particular you are concerned about? I am just asking because then I can focus on that.

I have rerun experiment and instead of 5 runs I am comparing 10 runs by taking the minimum runtime of each of the runs in this way:

../test-suite/utils/compare.py --filter-short before.1.json before.2.json before.3.json before.4.json before.5.json before.6.json before.7.json before.8.json before.9.json before.10.json vs after.1.json after.2.json after.3.json after.4.json after.5.json after.6.json after.7.json after.8.json after.9.json after.10.json

This gives:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        lhs    rhs    diff 
 test-suite...Mark/mafft/pairlocalalign.test    56.41  56.48  0.1%
 test-suite :: CTMark/Bullet/bullet.test         9.99   9.98 -0.0%
 test-suite :: CTMark/SPASS/SPASS.test          25.11  25.08 -0.1%
 test-suite :: CTMark/lencod/lencod.test         8.58   8.57 -0.2%
 test-suite...TMark/7zip/7zip-benchmark.test    10.16  10.11 -0.4%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.86   3.84 -0.5%
 Geomean difference                                          -0.2%
             lhs        rhs      diff
count  6.000000   6.000000   6.000000
mean   19.016133  19.009183 -0.001906
std    19.665927  19.699874  0.002340
min    3.858400   3.839700  -0.004847
25%    8.933775   8.919975  -0.003747
50%    10.071450  10.047500 -0.001556
75%    21.368850  21.336775 -0.000573
max    56.406300  56.476400  0.001243

Ah wait a minute, I am doing the last experiment again, just double checking if I haven't make mistake running them

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

FWIW the CTmark subset runs for quite a short time on beefier cores, so it might be good to also do some SPEC runs.

Yep, cheers, hopefully SPEC is better and more conclusive. The 1.2% uplift in one benchmark was on baremetal aarch64, will check if I can run some more things on that too.

SPECInt numbers:

500.perlbench_r	-0.36%
502.gcc_r	 0.25%
505.mcf_r	-0.23%
520.omnetpp_r	-0.16%
523.xalancbmk_r	-0.48%
525.x264_r	 1.38%
531.deepsjeng_r	-0.20%
541.leela_r	-0.11%
548.exchange2_r	 0.01%
557.xz_r	-0.03%

These are runtime reduction in percentage, so negative is good, postive number is a regression.
Overall, a small win, does what I want, except that 525.x264 is a bit of a negative outlier that makes things less rosy, and needs looking into.
And also need to look what Sam has cooked up in D89894....

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll
91 ↗(On Diff #299030)

How is this test changing? Just the initial operand for the add?

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

I analysed x264 and couldn't find any concerning codegen changes in the top 6 hottest functions in the profile. Then, I did more runs and concluded I must have been looking at noise (again) as I don't see that 1.38% regression anymore. It is more like 0.5% if it happens. Overall, my conclusion is in line with yours: this change is neutral on bigger cores worst case, but probably a small gain, and indeed in my first experiment on an in-order core I see decent speed-ups. Intuitively this makes sense, because probably the bigger ooo cores can deal better with inefficient code, while the smaller ones are more sensitive for this and an optimisation has more effect. While looking at x264, I did observe this for some cases:

The option tends to make loops use more registers,

I saw some more registers being used in preheaders to setup pointers, but then didn't seem to affect the loop.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

Cheers, will look at this now.

SjoerdMeijer added a comment.EditedFeb 12 2021, 2:30 AM

I wanted to abandon this change in favour of D89894. The reason is that D89894 works for unrolled loops, and this change doesn't. But D89894 doesn't really work when things haven't been unrolled, so I think there's actually value in having them both, and I will progress that.

This means shouldFavorPostInc() would need to make a decision if it is working on unrolled loops or not, but the current interface doesn't allow that:

/// \return True is LSR should make efforts to create/preserve post-inc
/// addressing mode expressions.
bool shouldFavorPostInc() const;
/// Return true if LSR should make efforts to generate indexed addressing
/// modes that operate across loop iterations.
bool shouldFavorBackedgeIndex(const Loop *L) const;

I will first make shouldFavorPostInc consistent with shouldFavorBackedgeIndex to accept Loop *L as an argument, so that we can look at its induction variable and step size.
After that interface change, I will then continue here to implement that.

Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?

Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?

That would be best, agreed, as it's quite incomprehensible at the moment what things are and how they interact.

I will see if I can first refactor this into 1 call.

Ideally LSR would account for the pre/postinc in its cost modelling and automatically pick the correct one. The shouldFavorPostInc would work on top of that to more consistently produce postinc in cases like MVE where they are so beneficial. But until that happens this sounds like a fine plan. And we can probably use the same thing to do something better under MVE, where not all loops are vector loops, after all.

If it is doing any real amount of processing on the loop, it will need to be calling shouldFavorPostInc less, caching the result somewhere for the loop.

SjoerdMeijer updated this revision to Diff 324940.EditedFeb 19 2021, 3:59 AM
SjoerdMeijer retitled this revision from [AArch64] Favor post-increments to [AArch64] Favor post-increments and implement TTI::getPreferredAddressingMode.
SjoerdMeijer edited the summary of this revision. (Show Details)

This is now "rebased" on the changes that introduced getPreferredAddressingMode, see also the updated title/description of this change.

I have rerun numbers and not much has fundamentally changed: this is a good thing to do. The only new insight is that if we also start doing runtime loop unrolling, the picture change. In that case, preindexed addressing modes are better. But with getPreferredAddressingMode now in place, we can easily adapt to that (see also my comment in the implementation of getPreferredAddressingMode) when we start doing that, which is what we will be looking into next.

fhahn added inline comments.Feb 19 2021, 8:16 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287

What if the loop has multiple induction phis and one of those has a large constant step or unknown step? Should we instead iterate over all phis and check all induction phis (using InductionDescriptor::isInductionPHI to identify them)?

SjoerdMeijer added inline comments.Feb 19 2021, 8:46 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287

Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I think we are interested in what is called the PrimaryInduction in the loop vectoriser, i.e. the one that actually controls the loop. And this seems to match exactly with what getInductionVariable() promises to return, which is used by getInductionDescriptor. That's why this looked okay to me...

fhahn added inline comments.Feb 19 2021, 9:04 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287

I am not sure I completely understand why the IV that controls the loop is special when it comes to picking the addressing mode for the loop? As : D97050 indicates, getInductionVariable is quite brittle and probably misses additional cases, so if we can avoid using it the code should be more robust.

If we have multiple IVs, would we not be interested if we can use post-increments for all of the ones that access memory? You could have a loop with 2 IVs, one to control the loop and one to access memory, like below. If we use the offset of the IV controlling the loop, post-index is profitable, but there won't be any accesses for that variable. (in this example, the inductions can probably be simplified to a single one, but it keeps things simple)

int I=0, J=0;

while (I != N) {
  Ptr[J] = 0;
  I++;
 J += 2000;
}
SjoerdMeijer added inline comments.Feb 19 2021, 9:41 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287

Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a more fine-grained (precise) heuristic.

I would need to give this some more thoughts, but the current heuristic is based on the distinction between unrolled loops not not unrolled loops (which is why I use the primary IV) which is a simple heuristic that seems to work. In general, I think this is quite a difficult problem, given that there are a few addressing modes and potentially quite some inductions to analyse. This will come at a cost, and it's unclear at the point if it will improve results.

I have verified that the implementation in D89894 works (for runtime unrolled loops), but that indeed reveals an inefficiency (missed opportunity) in the load store optimiser as noted there, which means we can't use that yet for enabling pre-indexed accesses. But perhaps I can use that heuristic, which does a bit more analysis, to decide when *not* to generate pre-indexed but post-indexed accesses.

But this slightly improved heuristic in D89894 may still not be precise enough for your liking... I will try to experiment a little bit with that, but at the moment I tend to think that this is a step forward, and this could be improved when we find the need for that?

This abandons the idea of looking at the IV, and incorporates D89894 to look at the pointer uses.

dmgreen added inline comments.Feb 24 2021, 3:15 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1293

I'm not sure I understand this logic. Won't it detect many things that are not unrolled loops? Why does it start at 0?

I think it's worth added a few deliberate tests for the different cases this is trying to optimize. Some for simple loops, more complex unrolled loops and vectorized/interleaved loops. Showing how the codegen has changed now that we change this default option.

llvm/test/CodeGen/AArch64/vldn_shuffle.ll
2

I don't think we should be adding this to unrelated tests if we can help it. For tests like this that are testing something else, it's OK to just check the default.

Same for all the other tests. From what I can tell they seem the same or better?

Thanks for commenting Dave. I will have another go at this, and try to come up with a better analysis, at least one we understand.

SjoerdMeijer retitled this revision from [AArch64] Favor post-increments and implement TTI::getPreferredAddressingMode to [AArch64] Favor pre-increments and implement TTI::getPreferredAddressingMode.
SjoerdMeijer edited the summary of this revision. (Show Details)
SjoerdMeijer added reviewers: stelios-arm, NickGuy.

This is changing our approach to preferring pre-indexed addressing modes, because:

  • this what we want for runtime unrolled loops, which is what we want to address next,
  • post indexed is currently better for some cases, but that's mainly cause because we miss an opportunity in the load/store optimiser. With that fixed, expectation is that pre-indexed gives the same or better perf than post-indexed.
  • for what it is worth, pre-indexed is also the default for ARM,

And as a consequence of the above, this implementation becomes really straightforward, which is another benefit.
But again, we can't commit this yet, because it depends on changes in the load/store optimiser which we will address first.

Matt added a subscriber: Matt.Apr 13 2021, 7:21 AM
SjoerdMeijer abandoned this revision.Mar 17 2023, 1:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:40 AM
Herald added a subscriber: StephenFan. · View Herald Transcript