Consider the processor cost model when folding loads and stores to use the pre or post index addressing modes.
Details
Diff Detail
Event Timeline
The load/store opt pass is already pretty expensive in terms of compile-time. Did you see any compile-time regressions in your testing? Also, what performance results have you collected?
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
592 | +1 to Florian's comment. Also, this comment should remain in optimizeBlock. | |
1408 | Should be } else { |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
653 | You could just add SM and STI to the AArch64LoadStoreOpt class and then initialize it in runOnMachineFunction. | |
682 | You can drop this else. | |
1380 | Add a comment: // Evaluate if the new instruction is a better choice than the old ones. | |
1381 | I think you can return NextI here. This will ensure we skip the Update instruction (i.e., the base address add/sub), which is never a candidate for LdStUpdateMerging. |
Cost model check is done at the very end when we are merging instructions, after most of the other checks are already done. Having said that, yes compile time regression test is a good idea. As for the performance, A72 and A57 both have higher latency with decoder bubble on load with pre-post update. That is definitely going to benefit where we look at cost model and decide not to create LD update. If we were to leave LD/ADD in epilogue and this pass not create LD update instruction, that's benefit as well.
No objection in high level. IMHO, it will not cause significant compile time regression, but better to make sure. I also agree with splitting this patch: one for the profitability check and one for refactoring.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
649 | No need to repeat this check for every candiate. | |
667–669 | Can we guarantee these pointers are always none null? |
There's some code refactoring here that I'll put in a separate patch soon.
Also, since the the extra check is only performed for candidate instructions, the impact on compile time is negligible.
The performance difference can be significant, especially in FP benchmarks with tight loops over gobbles of data.
Thank you for your feedback.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
667–669 | Yes because getSchedClassDesc() returns the address of a static array element. |
Please, let me know if I addressed your concerns before I split the code refactoring in another patch.
Thank you,
One minor comment, but feel free to go ahead and split the patch.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
655 | Rather than repeatedly call hasInstrSchedModel(), why not just store the result in a bool much like you did with OptForSize.. |
Updated the patch to contain just the cost model evaluation after separating the code refactoring in a separate patch, D40090.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
697 | Why don't we use <= , instead of < ? |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
697 | Because typically decoding an instruction into multiple uops costs more in the front end of the pipeline than decoding multiple instructions into single uops. |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
696 | Can you add more comment why you do this? Is this target independent ? |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
696 | I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into more than one uop. How bad the hiccup is, if at all, does depend on the target though. Some decrease the decode bandwidth, typically by inserting a bubble whose size depends on the design. This heuristic is an attempt at mitigating the new instruction inducing such a bubble. If the new instruction has a shorter latency, then it's chosen. One might wonder if it's still a good choice if it induces a bubble, but I could not devise a satisfying heuristic. If the latency of the new instruction is the same as the combined latency of the both old ones, then the potential of inducing a bubble is considered. If either of the old instructions had multiple uops, then even if the new one has them too it's probably no worse than before. However, if neither of the old instructions resulted in multiple uops, the new one is chosen only if it results in fewer uops than before. One might argue that, if bubbles when decoding into multiple uops are the norm among targets, it'd be better to choose the new instruction only if it doesn't potentially induce bubbles itself. If the new instruction has a longer latency, then it's discarded. Again, if it mitigates decode bubbles it might still be profitable, but the conditions seem hard to weigh in general. |
For me, the heuristic seems pretty reasonable, but I'm not fully sure if this can make a right decision target-independently. Someone else who have better idea in hardware level may need to review this.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
691 | Isn't it possible to do just TSM.getNumMicroOps(&MIA) and remove above *SCA = TSM.resolveSchedClass(&MIA) in line 678. | |
707 | I'm not sure if it's okay to compare values from computeInstrLatency() and getNumMicroOps() ? |
Add test case for the STREAM benchmark checking the results for most targets. This way, the target maintainers can evaluate the effects of this patch on the generated code for the respective targets.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
696 | It seems like if you want to take this multi-uop bubble into account, you should make it explicit, and make the subtarget opt in to it. if (newLatency < oldLatency) return true; if (newLatency > oldLatency) return false; if (newUops < oldUops) return true; if (newUops > oldUops) return false; if (newMultiUopPenalty < oldMultiUopPenalty) return true; return false; | |
707 | I think Jun's point was that these values are in completely different units, so comparing them doesn't make sense. |
Like the previous update, but favoring code size on A57, the only other target I can test on, while keeping the same performance.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
714 | I would suggest that unless we know of a case where this makes a difference, make the simpler change of just doing return UopDif <= 0 at this point, assuming that has the desired effect for your target. This assumes that with latency and uops being equal, a reduction in instructions could still be beneficial. | |
llvm/test/CodeGen/AArch64/ldst-opt.ll | ||
1034 | Are these changes intentional? | |
1343 | Are all of these changes needed? | |
llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll | ||
98 ↗ | (On Diff #127610) | Are these changes needed? |
llvm/test/CodeGen/AArch64/stream-neon.ll | ||
1 ↗ | (On Diff #127610) | This seems like it is too large to be a lit test. Is it really testing anything that hasn't been tested already? |
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp | ||
---|---|---|
714 | This would lead to the test below, stream-neon.ll, to fail for the Exynos targets. The reason is that, for instance, though the latency of loading a pair of quad registers with the post index addressing mode has the same latency as when using the register offset addressing mode, it is a two uop instruction, which severely limits the decode bandwidth by inserting a bubble. This is a fairly common issue in other targets too, which this heuristic tries to capture, assuming that it's better to keep the decode bandwidth for a couple single uop instructions than incurring a decode bubble for a multiple uop instruction. Except, of course, when optimizing for size. | |
llvm/test/CodeGen/AArch64/ldst-opt.ll | ||
1034 | Yes, though not necessary. Checking for the return is not germane to the feature being tested. | |
1343 | Ditto. | |
llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll | ||
98 ↗ | (On Diff #127610) | Yes, since the changes in this patch affect the resulting code for this test, unless it's optimized for size. |
llvm/test/CodeGen/AArch64/stream-neon.ll | ||
1 ↗ | (On Diff #127610) | Probably not. Perhaps the same could be accomplished if ldst-opt.ll were changed to test for the same targets as this test, when the result would be same. |
Here are some results I got on a Juno board with {{-mcpu=cortex-a57}}:
SPECint2000 | % |
---|---|
164.gzip | 0.0 |
175.vpr | -0.4 |
176.gcc | 0.0 |
177.mesa | 0.8 |
179.art | 0.1 |
181.mcf | -0.3 |
183.equake | 0.9 |
186.crafty | 0.8 |
188.ammp | 0.0 |
197.parser | 0.3 |
252.eon | -0.2 |
253.perlbmk | -0.2 |
254.gap | 0.1 |
255.vortex | 0.0 |
256.bzip2 | 0.5 |
300.twolf | 5.1 |
I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions. I have the following general reservations as well:
- does using the max latency of the load/store and add make sense given that the operations are dependent?
- does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.
This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.
I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?
I have the following general reservations as well:
- does using the max latency of the load/store and add make sense given that the operations are dependent?
They're only dependent for the pre index addressing mode. However, since the latency of the load and of the store is considerably larger even in this case, methinks that it's a sensible approximation.
- does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.
In previous versions of this patch I tried to weigh both metrics, but found it difficult to come up with a satisfying heuristic. Any ideas?
This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.
Perhaps the cost function could be target specific behind. Thoughts?
Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?
Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.
That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?
FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.
Were a test with -mcpu=falkor added to llvm/test/CodeGen/AArch64/ldst-opt.ll, how should the checks look like?
Also, can you please address my first question in https://reviews.llvm.org/D39976#999520 ?
I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.
I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.
I would expect this change to not change the code generated for Falkor, so whatever is generated currently
That's not how I understand the scheduling model to work. Resources are used for 'ResourceCycles' cycles (which defaults to 1), so in this case ST and VSD are used for 1 cycle. The Latency of the store is set to 0, since it doesn't write a register, so the latency doesn't mean anything as far as I can tell. The pre/post increment version has a Latency of 3 on the first def index (i.e. the updated base register) since that is the latency of reading this new register value.
This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.
I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.
AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.
That's not what the code looks like it is doing. isReplacementProfitable() is only being called from mergeUpdateInsn(), which is only called when folding to form base register incrementing load/stores.
I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.
AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.
The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.
For now, as it's a generic enough interface to be used by many other peephole optimizations.
AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.
The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.
So am I.
Thank you for the feedback.
This is just moving some code from optimizeBlock? If that's unrelated having this in a separate patch would make it slightly easier to review IMO :)