We are missing an opportunity to generate more efficient addressing modes, preindexed loads/stores, which avoids increments of pointers with separate add instructions. This implements target hook getPreferredAddressingMode for AArch64, which is queried in LoopStrengthReduce that can bring loops in a better form to generate pre-increments.
Diff Detail
Unit Tests
Event Timeline
How about results from the LLVM test suite? This should benefit size and performance?
So far as we have used it, this option mean "aggressively optimizer for postincs". That is useful for MVE where not using postinc's in loops can have a significant effect on performance. AArch64 I'm less sure about, but it might make sense. The option tends to make loops use more registers, which might be OK on an architecture with a lot of registers.
Which improves one benchmark with 1.2%, and didn't show any changes in another which I also didn't expect to be impacted (just checking as a sanity check).
Does this mean you've ran 2 benchmarks? What does SPEC or the llvm test suite say?
I've ran CTMark and filter out the test that run very shortly with --filter-short and see the same trend, i.e. no regressions and some okay improvements:
Tests: 10 Short Running: 4 (filtered out) Remaining: 6 Metric: exec_time Program before after diff test-suite :: CTMark/lencod/lencod.test 8.57 8.58 0.0% test-suite...Mark/mafft/pairlocalalign.test 56.49 56.46 -0.0% test-suite...TMark/7zip/7zip-benchmark.test 10.23 10.15 -0.7% test-suite...:: CTMark/sqlite3/sqlite3.test 3.93 3.90 -0.8% test-suite :: CTMark/Bullet/bullet.test 10.17 10.01 -1.6% test-suite :: CTMark/SPASS/SPASS.test 26.04 25.09 -3.7% Geomean difference -1.1% before after diff count 6.000000 6.000000 6.000000 mean 19.238417 19.031417 -0.011363 std 19.723614 19.679435 0.013708 min 3.933800 3.901200 -0.036644 25% 8.971825 8.934025 -0.013844 50% 10.198600 10.080500 -0.007888 75% 22.087150 21.352350 -0.002162 max 56.486600 56.464800 0.000327
How many runs was that? How much noise is there?
It's encouraging at least. Can you run SPEC too? Tamar has a good low-noise system. This isn't a small change to be taken lightly, even if the patch is small in number of lines changed.
Just curious if there's something in particular you are concerned about? I am just asking because then I can focus on that.
I have rerun experiment and instead of 5 runs I am comparing 10 runs by taking the minimum runtime of each of the runs in this way:
../test-suite/utils/compare.py --filter-short before.1.json before.2.json before.3.json before.4.json before.5.json before.6.json before.7.json before.8.json before.9.json before.10.json vs after.1.json after.2.json after.3.json after.4.json after.5.json after.6.json after.7.json after.8.json after.9.json after.10.json
This gives:
Tests: 10 Short Running: 4 (filtered out) Remaining: 6 Metric: exec_time Program lhs rhs diff test-suite...Mark/mafft/pairlocalalign.test 56.41 56.48 0.1% test-suite :: CTMark/Bullet/bullet.test 9.99 9.98 -0.0% test-suite :: CTMark/SPASS/SPASS.test 25.11 25.08 -0.1% test-suite :: CTMark/lencod/lencod.test 8.58 8.57 -0.2% test-suite...TMark/7zip/7zip-benchmark.test 10.16 10.11 -0.4% test-suite...:: CTMark/sqlite3/sqlite3.test 3.86 3.84 -0.5% Geomean difference -0.2% lhs rhs diff count 6.000000 6.000000 6.000000 mean 19.016133 19.009183 -0.001906 std 19.665927 19.699874 0.002340 min 3.858400 3.839700 -0.004847 25% 8.933775 8.919975 -0.003747 50% 10.071450 10.047500 -0.001556 75% 21.368850 21.336775 -0.000573 max 56.406300 56.476400 0.001243
Ah wait a minute, I am doing the last experiment again, just double checking if I haven't make mistake running them
Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.
FWIW the CTmark subset runs for quite a short time on beefier cores, so it might be good to also do some SPEC runs.
Yep, cheers, hopefully SPEC is better and more conclusive. The 1.2% uplift in one benchmark was on baremetal aarch64, will check if I can run some more things on that too.
SPECInt numbers:
500.perlbench_r -0.36% 502.gcc_r 0.25% 505.mcf_r -0.23% 520.omnetpp_r -0.16% 523.xalancbmk_r -0.48% 525.x264_r 1.38% 531.deepsjeng_r -0.20% 541.leela_r -0.11% 548.exchange2_r 0.01% 557.xz_r -0.03%
These are runtime reduction in percentage, so negative is good, postive number is a regression.
Overall, a small win, does what I want, except that 525.x264 is a bit of a negative outlier that makes things less rosy, and needs looking into.
And also need to look what Sam has cooked up in D89894....
Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.
The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.
Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.
llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll | ||
---|---|---|
91 ↗ | (On Diff #299030) | How is this test changing? Just the initial operand for the add? |
I analysed x264 and couldn't find any concerning codegen changes in the top 6 hottest functions in the profile. Then, I did more runs and concluded I must have been looking at noise (again) as I don't see that 1.38% regression anymore. It is more like 0.5% if it happens. Overall, my conclusion is in line with yours: this change is neutral on bigger cores worst case, but probably a small gain, and indeed in my first experiment on an in-order core I see decent speed-ups. Intuitively this makes sense, because probably the bigger ooo cores can deal better with inefficient code, while the smaller ones are more sensitive for this and an optimisation has more effect. While looking at x264, I did observe this for some cases:
The option tends to make loops use more registers,
I saw some more registers being used in preheaders to setup pointers, but then didn't seem to affect the loop.
Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.
Cheers, will look at this now.
I wanted to abandon this change in favour of D89894. The reason is that D89894 works for unrolled loops, and this change doesn't. But D89894 doesn't really work when things haven't been unrolled, so I think there's actually value in having them both, and I will progress that.
This means shouldFavorPostInc() would need to make a decision if it is working on unrolled loops or not, but the current interface doesn't allow that:
/// \return True is LSR should make efforts to create/preserve post-inc /// addressing mode expressions. bool shouldFavorPostInc() const; /// Return true if LSR should make efforts to generate indexed addressing /// modes that operate across loop iterations. bool shouldFavorBackedgeIndex(const Loop *L) const;
I will first make shouldFavorPostInc consistent with shouldFavorBackedgeIndex to accept Loop *L as an argument, so that we can look at its induction variable and step size.
After that interface change, I will then continue here to implement that.
Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?
That would be best, agreed, as it's quite incomprehensible at the moment what things are and how they interact.
I will see if I can first refactor this into 1 call.
Ideally LSR would account for the pre/postinc in its cost modelling and automatically pick the correct one. The shouldFavorPostInc would work on top of that to more consistently produce postinc in cases like MVE where they are so beneficial. But until that happens this sounds like a fine plan. And we can probably use the same thing to do something better under MVE, where not all loops are vector loops, after all.
If it is doing any real amount of processing on the loop, it will need to be calling shouldFavorPostInc less, caching the result somewhere for the loop.
This is now "rebased" on the changes that introduced getPreferredAddressingMode, see also the updated title/description of this change.
I have rerun numbers and not much has fundamentally changed: this is a good thing to do. The only new insight is that if we also start doing runtime loop unrolling, the picture change. In that case, preindexed addressing modes are better. But with getPreferredAddressingMode now in place, we can easily adapt to that (see also my comment in the implementation of getPreferredAddressingMode) when we start doing that, which is what we will be looking into next.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1287 | What if the loop has multiple induction phis and one of those has a large constant step or unknown step? Should we instead iterate over all phis and check all induction phis (using InductionDescriptor::isInductionPHI to identify them)? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1287 | Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I think we are interested in what is called the PrimaryInduction in the loop vectoriser, i.e. the one that actually controls the loop. And this seems to match exactly with what getInductionVariable() promises to return, which is used by getInductionDescriptor. That's why this looked okay to me... |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1287 | I am not sure I completely understand why the IV that controls the loop is special when it comes to picking the addressing mode for the loop? As : D97050 indicates, getInductionVariable is quite brittle and probably misses additional cases, so if we can avoid using it the code should be more robust. If we have multiple IVs, would we not be interested if we can use post-increments for all of the ones that access memory? You could have a loop with 2 IVs, one to control the loop and one to access memory, like below. If we use the offset of the IV controlling the loop, post-index is profitable, but there won't be any accesses for that variable. (in this example, the inductions can probably be simplified to a single one, but it keeps things simple) int I=0, J=0; while (I != N) { Ptr[J] = 0; I++; J += 2000; } |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1287 | Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a more fine-grained (precise) heuristic. I would need to give this some more thoughts, but the current heuristic is based on the distinction between unrolled loops not not unrolled loops (which is why I use the primary IV) which is a simple heuristic that seems to work. In general, I think this is quite a difficult problem, given that there are a few addressing modes and potentially quite some inductions to analyse. This will come at a cost, and it's unclear at the point if it will improve results. I have verified that the implementation in D89894 works (for runtime unrolled loops), but that indeed reveals an inefficiency (missed opportunity) in the load store optimiser as noted there, which means we can't use that yet for enabling pre-indexed accesses. But perhaps I can use that heuristic, which does a bit more analysis, to decide when *not* to generate pre-indexed but post-indexed accesses. But this slightly improved heuristic in D89894 may still not be precise enough for your liking... I will try to experiment a little bit with that, but at the moment I tend to think that this is a step forward, and this could be improved when we find the need for that? |
This abandons the idea of looking at the IV, and incorporates D89894 to look at the pointer uses.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
1293 | I'm not sure I understand this logic. Won't it detect many things that are not unrolled loops? Why does it start at 0? I think it's worth added a few deliberate tests for the different cases this is trying to optimize. Some for simple loops, more complex unrolled loops and vectorized/interleaved loops. Showing how the codegen has changed now that we change this default option. | |
llvm/test/CodeGen/AArch64/vldn_shuffle.ll | ||
2 | I don't think we should be adding this to unrelated tests if we can help it. For tests like this that are testing something else, it's OK to just check the default. Same for all the other tests. From what I can tell they seem the same or better? |
Thanks for commenting Dave. I will have another go at this, and try to come up with a better analysis, at least one we understand.
This is changing our approach to preferring pre-indexed addressing modes, because:
- this what we want for runtime unrolled loops, which is what we want to address next,
- post indexed is currently better for some cases, but that's mainly cause because we miss an opportunity in the load/store optimiser. With that fixed, expectation is that pre-indexed gives the same or better perf than post-indexed.
- for what it is worth, pre-indexed is also the default for ARM,
And as a consequence of the above, this implementation becomes really straightforward, which is another benefit.
But again, we can't commit this yet, because it depends on changes in the load/store optimiser which we will address first.
What if the loop has multiple induction phis and one of those has a large constant step or unknown step? Should we instead iterate over all phis and check all induction phis (using InductionDescriptor::isInductionPHI to identify them)?