This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Query the target when folding loads and stores
Needs ReviewPublic

Authored by evandro on Nov 13 2017, 1:17 PM.

Details

Summary

Consider the processor cost model when folding loads and stores to use the pre or post index addressing modes.

Diff Detail

Event Timeline

evandro created this revision.Nov 13 2017, 1:17 PM
fhahn added a subscriber: fhahn.Nov 13 2017, 2:26 PM
fhahn added inline comments.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
598

This is just moving some code from optimizeBlock? If that's unrelated having this in a separate patch would make it slightly easier to review IMO :)

1668

This is just moving some code from optimizeBlock?

mcrosier edited edge metadata.Nov 14 2017, 6:48 AM

The load/store opt pass is already pretty expensive in terms of compile-time. Did you see any compile-time regressions in your testing? Also, what performance results have you collected?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
599

+1 to Florian's comment. Also, this comment should remain in optimizeBlock.

1403–1404

Should be

} else {
mcrosier added inline comments.Nov 14 2017, 6:48 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
660

You could just add SM and STI to the AArch64LoadStoreOpt class and then initialize it in runOnMachineFunction.

689

You can drop this else.

1373

Add a comment:

// Evaluate if the new instruction is a better choice than the old ones.
1374

I think you can return NextI here. This will ensure we skip the Update instruction (i.e., the base address add/sub), which is never a candidate for LdStUpdateMerging.

The load/store opt pass is already pretty expensive in terms of compile-time. Did you see any compile-time regressions in your testing? Also, what performance results have you collected?

Cost model check is done at the very end when we are merging instructions, after most of the other checks are already done. Having said that, yes compile time regression test is a good idea. As for the performance, A72 and A57 both have higher latency with decoder bubble on load with pre-post update. That is definitely going to benefit where we look at cost model and decide not to create LD update. If we were to leave LD/ADD in epilogue and this pass not create LD update instruction, that's benefit as well.

junbuml edited edge metadata.Nov 14 2017, 9:30 AM

No objection in high level. IMHO, it will not cause significant compile time regression, but better to make sure. I also agree with splitting this patch: one for the profitability check and one for refactoring.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
656

No need to repeat this check for every candiate.

674–676

Can we guarantee these pointers are always none null?

evandro marked 10 inline comments as done.Nov 14 2017, 11:56 AM

There's some code refactoring here that I'll put in a separate patch soon.

Also, since the the extra check is only performed for candidate instructions, the impact on compile time is negligible.

The performance difference can be significant, especially in FP benchmarks with tight loops over gobbles of data.

Thank you for your feedback.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
674–676

Yes because getSchedClassDesc() returns the address of a static array element.

evandro updated this revision to Diff 122891.Nov 14 2017, 11:58 AM
evandro marked an inline comment as done.

Please, let me know if I addressed your concerns before I split the code refactoring in another patch.

Thank you,

Please, let me know if I addressed your concerns before I split the code refactoring in another patch.

One minor comment, but feel free to go ahead and split the patch.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
657

Rather than repeatedly call hasInstrSchedModel(), why not just store the result in a bool much like you did with OptForSize..

evandro marked an inline comment as done.Nov 14 2017, 3:59 PM
evandro updated this revision to Diff 123048.Nov 15 2017, 9:52 AM

Updated the patch to contain just the cost model evaluation after separating the code refactoring in a separate patch, D40090.

junbuml added inline comments.Nov 16 2017, 8:53 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

Why don't we use <= , instead of < ?

evandro added inline comments.Nov 16 2017, 11:52 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

Because typically decoding an instruction into multiple uops costs more in the front end of the pipeline than decoding multiple instructions into single uops.

evandro updated this revision to Diff 123361.Nov 17 2017, 9:23 AM

Clarify the heuristics used when evaluating the costs.

evandro marked 2 inline comments as done.Nov 17 2017, 9:26 AM
junbuml added inline comments.Nov 17 2017, 9:40 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

Can you add more comment why you do this? Is this target independent ?

evandro added inline comments.Nov 17 2017, 10:49 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into more than one uop. How bad the hiccup is, if at all, does depend on the target though. Some decrease the decode bandwidth, typically by inserting a bubble whose size depends on the design. This heuristic is an attempt at mitigating the new instruction inducing such a bubble.

If the new instruction has a shorter latency, then it's chosen. One might wonder if it's still a good choice if it induces a bubble, but I could not devise a satisfying heuristic.

If the latency of the new instruction is the same as the combined latency of the both old ones, then the potential of inducing a bubble is considered. If either of the old instructions had multiple uops, then even if the new one has them too it's probably no worse than before. However, if neither of the old instructions resulted in multiple uops, the new one is chosen only if it results in fewer uops than before. One might argue that, if bubbles when decoding into multiple uops are the norm among targets, it'd be better to choose the new instruction only if it doesn't potentially induce bubbles itself.

If the new instruction has a longer latency, then it's discarded. Again, if it mitigates decode bubbles it might still be profitable, but the conditions seem hard to weigh in general.

evandro updated this revision to Diff 124428.Nov 27 2017, 11:39 AM
evandro set the repository for this revision to rL LLVM.

Try to make the heuristics code clearer.

evandro updated this revision to Diff 124461.Nov 27 2017, 1:51 PM

Update test cases.

evandro updated this revision to Diff 124649.Nov 28 2017, 3:21 PM

Add diagnostic messages.

For me, the heuristic seems pretty reasonable, but I'm not fully sure if this can make a right decision target-independently. Someone else who have better idea in hardware level may need to review this.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
691

Isn't it possible to do just TSM.getNumMicroOps(&MIA) and remove above *SCA = TSM.resolveSchedClass(&MIA) in line 678.

707

I'm not sure if it's okay to compare values from computeInstrLatency() and getNumMicroOps() ?

evandro marked an inline comment as done.Dec 6 2017, 1:23 PM
evandro added a subscriber: apinski-cavium.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
691

Yes, it is.

707

That's what the heuristic does, as briefly explained in the debug message below.

evandro updated this revision to Diff 126199.Dec 8 2017, 1:06 PM
evandro marked an inline comment as done.

Add test case for the STREAM benchmark checking the results for most targets. This way, the target maintainers can evaluate the effects of this patch on the generated code for the respective targets.

evandro marked an inline comment as done.Dec 8 2017, 1:08 PM

¡Ping! 🔔🔔

gberry added inline comments.Dec 18 2017, 11:21 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

It seems like if you want to take this multi-uop bubble into account, you should make it explicit, and make the subtarget opt in to it.
The simple heuristic here to me would be:

if (newLatency < oldLatency)
  return true;
if (newLatency > oldLatency)
  return false;
if (newUops < oldUops)
  return true;
if (newUops > oldUops)
  return false;
if (newMultiUopPenalty < oldMultiUopPenalty)
  return true;
return false;
707

I think Jun's point was that these values are in completely different units, so comparing them doesn't make sense.

evandro added inline comments.Dec 18 2017, 11:49 AM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695

How do you propose that the new penalty be calculated?

707

It´s very common in process control to use one variable to effect another, as long as there´s at least an implied correlation between them.

evandro updated this revision to Diff 127429.Dec 18 2017, 3:50 PM

Attempt at Interpreting @gberry´s suggestion.

evandro updated this revision to Diff 127610.Dec 19 2017, 3:31 PM

Like the previous update, but favoring code size on A57, the only other target I can test on, while keeping the same performance.

¡Ping! 🔔🔔

gberry added inline comments.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
714

I would suggest that unless we know of a case where this makes a difference, make the simpler change of just doing

return UopDif <= 0

at this point, assuming that has the desired effect for your target. This assumes that with latency and uops being equal, a reduction in instructions could still be beneficial.

llvm/test/CodeGen/AArch64/ldst-opt.ll
1034

Are these changes intentional?

1343

Are all of these changes needed?

llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll
98

Are these changes needed?

llvm/test/CodeGen/AArch64/stream-neon.ll
1 ↗(On Diff #127610)

This seems like it is too large to be a lit test. Is it really testing anything that hasn't been tested already?

evandro added inline comments.Jan 11 2018, 1:15 PM
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
714

This would lead to the test below, stream-neon.ll, to fail for the Exynos targets. The reason is that, for instance, though the latency of loading a pair of quad registers with the post index addressing mode has the same latency as when using the register offset addressing mode, it is a two uop instruction, which severely limits the decode bandwidth by inserting a bubble. This is a fairly common issue in other targets too, which this heuristic tries to capture, assuming that it's better to keep the decode bandwidth for a couple single uop instructions than incurring a decode bubble for a multiple uop instruction. Except, of course, when optimizing for size.

llvm/test/CodeGen/AArch64/ldst-opt.ll
1034

Yes, though not necessary. Checking for the return is not germane to the feature being tested.

1343

Ditto.

llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll
98

Yes, since the changes in this patch affect the resulting code for this test, unless it's optimized for size.

llvm/test/CodeGen/AArch64/stream-neon.ll
1 ↗(On Diff #127610)

Probably not. Perhaps the same could be accomplished if ldst-opt.ll were changed to test for the same targets as this test, when the result would be same.

¡Ping! 🔔🔔

Here are some results I got on a Juno board with {{-mcpu=cortex-a57}}:

SPECint2000%
164.gzip0.0
175.vpr-0.4
176.gcc0.0
177.mesa0.8
179.art0.1
181.mcf-0.3
183.equake0.9
186.crafty0.8
188.ammp0.0
197.parser0.3
252.eon-0.2
253.perlbmk-0.2
254.gap0.1
255.vortex0.0
256.bzip20.5
300.twolf5.1

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions. I have the following general reservations as well:

  • does using the max latency of the load/store and add make sense given that the operations are dependent?
  • does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.

This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.

evandro added a comment.EditedFeb 6 2018, 12:52 PM

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?

I have the following general reservations as well:

  • does using the max latency of the load/store and add make sense given that the operations are dependent?

They're only dependent for the pre index addressing mode. However, since the latency of the load and of the store is considerably larger even in this case, methinks that it's a sensible approximation.

  • does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.

In previous versions of this patch I tried to weigh both metrics, but found it difficult to come up with a satisfying heuristic. Any ideas?

This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.

Perhaps the cost function could be target specific behind. Thoughts?

evandro updated this revision to Diff 133518.Feb 8 2018, 3:29 PM
evandro retitled this revision from [AArch64] Consider the cost model when folding loads and stores to [AArch64] Query the target when folding loads and stores.

Make the heuristics processor specific.

¡Ping! 🔔🔔

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

Were a test with -mcpu=falkor added to llvm/test/CodeGen/AArch64/ldst-opt.ll, how should the checks look like?

Also, can you please address my first question in https://reviews.llvm.org/D39976#999520 ?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

Were a test with -mcpu=falkor added to llvm/test/CodeGen/AArch64/ldst-opt.ll, how should the checks look like?

I would expect this change to not change the code generated for Falkor, so whatever is generated currently

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?

That's not how I understand the scheduling model to work. Resources are used for 'ResourceCycles' cycles (which defaults to 1), so in this case ST and VSD are used for 1 cycle. The Latency of the store is set to 0, since it doesn't write a register, so the latency doesn't mean anything as far as I can tell. The pre/post increment version has a Latency of 3 on the first def index (i.e. the updated base register) since that is the latency of reading this new register value.

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

That's not what the code looks like it is doing. isReplacementProfitable() is only being called from mergeUpdateInsn(), which is only called when folding to form base register incrementing load/stores.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

That's not what the code looks like it is doing. isReplacementProfitable() is only being called from mergeUpdateInsn(), which is only called when folding to form base register incrementing load/stores.

For now, as it's a generic enough interface to be used by many other peephole optimizations.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.

So am I.

Thank you for the feedback.

mcrosier resigned from this revision.Apr 11 2018, 11:29 AM