This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Assign cost of scaling used in addressing mode for ARM cores
ClosedPublic

Authored by javed.absar on Sep 23 2016, 3:22 AM.

Details

Summary

This patch assigns cost of the scaling used in addressing. On many ARM cores, a negated register offset takes longer than a non-negated register offset in a register-offset addressing mode. For instance:

  1. LDR R0, [R1, R2 LSL #2]
  2. LDR R0, [R1, -R2 LSL #2]

Above, (1) takes less cycles than (2).

By assigning appropriate scaling factor cost, we enable the LLVM to make the right trade-offs in the optimization and code-selection phase.

The patch improves the performance as follows –

Cortex-A53 :   spec.twolf:  2.4%, ShootoutC++_matrix: 28.4%,  Stanford/Puzzle:  12.4%, IndirectAddressing-dbl:  5.49%
Cortex-A57 :  spec2006.hmmer: 1.5% , spec2006.lbm: 1.1%

The patch also improves performance on other third-party benchmarks

Diff Detail

Repository
rL LLVM

Event Timeline

javed.absar retitled this revision from to Assign cost of scaling used in addressing mode for ARM cores.
javed.absar updated this object.
javed.absar added a subscriber: llvm-commits.
jmolloy added inline comments.
lib/Target/ARM/ARMISelLowering.cpp
11434 ↗(On Diff #72247)

Where did "2" come from here? Why not "1"? (a rationale would be useful)

This seems very broad-brush. I'd like to see it restricted down to a number of cores where this actually costs cycles. I have no idea if Apple's cores have this property, for example. And do our large cores, like Cortex-A72?

javed.absar retitled this revision from Assign cost of scaling used in addressing mode for ARM cores to [ARM] Assign cost of scaling used in addressing mode for ARM cores.

Hi James:
Thanks for the review and feedback.
I have made changes based on your comments:

  1. Have reduced the scope to Cortex-A53 and Cortex-A57 where I can confirm the performance gains currently based on benchmark runs (results shared previously).
  2. Have adjusted the cost from '2' to '1' as this is sufficient to differentiate.

Thanks

rengolin requested changes to this revision.Sep 26 2016, 9:18 AM
rengolin edited edge metadata.

Hi Javed,

The improvements are indeed impressive, but the implementation is in the wrong place. We're moving away from using CPU flags and CPU specific code gen decisions. We need to move this into a proper cost model.

cheers,
--renato

lib/Target/ARM/ARMSubtarget.h
415 ↗(On Diff #72473)

We don't want these at all. Please, make sure this is a feature flag, or a cost model / scheduler decision.

This revision now requires changes to proceed.Sep 26 2016, 9:18 AM

Also, you mention all the improvements, but not regressions. Can you share the full picture of the benchmark results?

Hi Renato:
Here is the full list of performance regressions and improvements.

Cortex-A53
--Performance Regressions
----MultiSource/Applications/SIBsim4/SIBsim4.............................................................................. 2.00%
---MultiSource/Applications/sqlite3/sqlite3......................................................................................1.55%

++Performance Improvements
++++SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix..........................28.42%
++++SingleSource/Benchmarks/Stanford/Puzzle......................................................................12.40%
++++MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.....5.49%
++++SingleSource/Benchmarks/Misc/ReedSolomon.................................................................4.32%
++++MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt..3.39%
++++MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt..........................2.05%
++++SingleSource/Benchmarks/Shootout/matrix .......................................................................1.95%
++++MultiSource/Benchmarks/mafft/pairlocalalign....................................................................1.63%
++++MultiSource/Applications/obsequi/Obsequi........................................................................1.41%
++++MultiSource/Benchmarks/7zip/7zip-benchmark................................................................1.09%

++++External/SPEC/CINT2000/300.twolf/300.twolf......................................................................2.38%

Cortex-A57
--Performance Regressions
----MultiSource/Applications/hexxagon/hexxagon.....................................................................5.59%
----MultiSource/Benchmarks/Ptrdist/anagram/anagram..........................................................5.46%
----SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists......................................3.15%
----External/SPEC/CINT2000/253.perlbmk/253.perlbmk............................................................1.64%

++Performance Improvements
++++MultiSource/Benchmarks/mafft/pairlocalalign.............................................................12.46%
++++SingleSource/Benchmarks/Shootout-C++/Shootout-C++-matrix.........................5.50%
++++SingleSource/Benchmarks/Shootout/matrix ...................................................................4.88%
++++SingleSource/Benchmarks/Stanford/Puzzle.....................................................................4.36%
++++MultiSource/Benchmarks/Fhourstones/fhourstones.....................................................1.90%
++++MultiSource/Applications/SIBsim4/SIBsim4......................................................................1.56%
++++SingleSource/Benchmarks/Misc/ReedSolomon...............................................................1.27%

++++spec.cpu2006.ref.456_hmmer..................................................................................................1.55%
++++spec.cpu2006.ref.470_lbm.........................................................................................................1.07%

Best Regards
Javed

javed.absar updated this revision to Diff 73480.Oct 4 2016, 8:05 AM
javed.absar edited edge metadata.

Hi Renato:

I have now changed the implementation as you requested (sub-target feature flag). Also I have shared the full list of improvements and regressions of the benchmarks. Please let me know if this is all ok now.

Best Regards
Javed

Hi Renato:

I have changed the implementation as you suggested (sub-target feature flag). Also have shared the full list of improvements and regressions for the benchmarks. Please let me know if you are satisfied and I can commit the patch.
Thanks and Best Regards
Javed

rengolin added inline comments.Oct 11 2016, 8:06 AM
lib/Target/ARM/ARMISelLowering.cpp
11604 ↗(On Diff #73480)

This is not really a cost model... but I don't have a better idea. This is a very specific feature.

@jmolloy, are you happy with this?

11608 ↗(On Diff #73480)

I guess -1 here means illegal, though without seeing the code that is using it, I can't assume.

lib/Target/ARM/ARMISelLowering.h
299 ↗(On Diff #73480)

Where is this being used?

javed.absar added inline comments.Oct 11 2016, 8:18 AM
lib/Target/ARM/ARMISelLowering.cpp
11608 ↗(On Diff #73480)

You are right. The same is mentioned in Analysis/TargetTransformInfo.h wherein it is stated "If the AM is not supported, it returns a negative value".

lib/Target/ARM/ARMISelLowering.h
299 ↗(On Diff #73480)

It is used in LoopStrengthReduce, where decisions to transform a loop is taken based on a cost-model that takes scaling-factor cost into consideration.

Whenever @jmolloy is happy, I'm happy. Thanks for the changes.

lib/Target/ARM/ARMISelLowering.cpp
11608 ↗(On Diff #73480)

Right, ok.

lib/Target/ARM/ARMISelLowering.h
299 ↗(On Diff #73480)

Oh, right, I missed the "override".

jmolloy added inline comments.Oct 12 2016, 5:43 AM
lib/Target/ARM/ARMISelLowering.cpp
11604 ↗(On Diff #73480)

I know you asked for it to be a feature flag, but... I'm not particularly keen on it as a feature flag. Are we moving towards feature flags for cost model decisions now? :/

Wouldn't something like:

switch (Subtarget->getCPUName()) {
default: return 0;
case Cortex-R52:
   return AM.Scale < 0 ? 1 : 0;
}

be alright? It's more cost-model-like...

rengolin added inline comments.Oct 12 2016, 5:59 AM
lib/Target/ARM/ARMISelLowering.cpp
11604 ↗(On Diff #73480)

Good lord, no! We're moving away from core checks. :)

jmolloy accepted this revision.Oct 12 2016, 6:15 AM
jmolloy added a reviewer: jmolloy.
jmolloy added inline comments.
lib/Target/ARM/ARMISelLowering.cpp
11604 ↗(On Diff #73480)

Huh. Well it LGTM then. The cost model (1 or 0) doesn't seem particularly bad.

Hi James/Renato:

Please could one of you mark this as 'accepted' if you are ok with it?

Thanks

Hi James/Renato:

Please could one of you mark this as 'accepted' if you are ok with it?

Thanks

I did already.

This revision was automatically updated to reflect the committed changes.