This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Thumb2: favor R4-R7 over R12/LR in allocation order when opt for minsize
ClosedPublic

Authored by ostannard on Feb 23 2017, 11:39 PM.

Details

Summary

For Thumb2, we prefer low regs (costPerUse = 0) to allow narrow encoding. However, current allocation order is like:

R0-R3, R12, LR, R4-R11

As a result, a lot of instructs that use R12/LR will be wide instrs.

This patch changes the allocation order to:

R0-R7, R12, LR, R8-R11

for thumb2 and -Oz.

In most cases, there is no extra push/pop instrs as they will be folded into
existing ones. There might be slight performance impact due to more stack
usage, so we only enable it when opt for min size.

For an embedded application with 83K code, this patch saves 430 bytes (0.5%).

Diff Detail

Repository
rL LLVM

Event Timeline

weimingz created this revision.Feb 23 2017, 11:39 PM
eastig added a subscriber: eastig.Feb 24 2017, 2:14 AM
rengolin edited edge metadata.

I'm not sure about this change. I think it makes sense, but I don't think it's done in a way to integrate with the existing code.

I'm adding Quentin (the code owner for the register allocator) to have a more informed opinion.

I'd also expect some benchmarks (speed and size) on a few benchmarks, hopefully in a few thumb2 architectures. A 0.5% code size improvement on "an embedded application" isn't enough, I think.

cheers,
--renato

lib/CodeGen/RegisterClassInfo.cpp
129

I'm not very knowledgeable in this part of the code, but it seems you're destroying everything the above code was trying to do with the RCI order. It looks to me as though you should try to add the logic into the loop above, rather than splitting and discarding.

lib/Target/ARM/ARMBaseRegisterInfo.cpp
58

This option seems unnecessary. I mean, it's good to have it in order to track performance, but this change looks beneficial to all thumb targets. So, unless this brings performance or code size issue for some but not all thumb, I think this should be removed, and make the cost be calculated solely on isThumb2.

efriedma added subscribers: llvm-commits, efriedma.

Please repost with llvm-commits on the CC list, so the patch gets sent to the mailing list.

Sure, I will get more numbers for other benchmarks.

lib/CodeGen/RegisterClassInfo.cpp
129

The code will maintain the original order if two registers have the same priority by using stable_sort: one register has lower CostPerUse, it has higher priority. Otherwise, the caller saved register have higher priority. Everything equal, the original order is kept.

qcolombet added inline comments.Mar 7 2017, 12:47 PM
lib/CodeGen/RegisterClassInfo.cpp
129

Renato is right, you're doing more that changing the order of the CSRs.
Although this is probably not used in-tree, there is nothing that prevents to set a different cost for each register. Thus a stable sort on the cost per use may give a very different order than the raw order.

For instance, the raw order could well be in terms of cost per use {10, 8, 3, 12, etc.}.

Instead, what I would suggest is:

  • Define an alternative order for MF
  • Add a callback just to disable the special case for CSR, i.e., the callback you have works, just use it inside the loop not to populate the CSRAlias
weimingz updated this revision to Diff 95889.Apr 19 2017, 11:03 PM

Below are the measurements of text size of benchmarks.
Build flag: "-Oz -mthumb -mcpu=cortex-m3 -fomit-frame-pointer"
Baseline has extra flag: " -mllvm -arm-favor-r4-r7=false"

benchmarkbaselinefavor low regreductionreduction (%)
spec2000/ammp73304729843200.436538
spec2000/art8459845900
spec2000/bzip21877118739320.170476
spec2000/crafty1335061332422640.197744
spec2000/eon2175612172513100.142489
spec2000/equake1140911373360.31554
spec2000/gap2677362673124240.158365
spec2000/gcc8009028000508520.10638
spec2000/gzip2156321515480.222604
spec2000/mcf53665308581.08088
spec2000/mesa30572130427914420.471672
spec2000/parser58571584591120.191221
spec2000/perlbmk3110203107242960.0951707
spec2000/twolf119785119905-120-0.100179
spec2000/vortex3221333215815520.171358
spec2000/vpr90682905701120.123509
coremark62816188931.48066
spec2006/astar1928219284-2-0.0103724
spec2006/bzip23399133897940.276544
spec2006/dealII1864927186170732200.172661
spec2006/gcc1835601183353120700.11277
spec2006/gobmk1151212114954216700.145065
spec2006/h264ref3356523352184340.129301
spec2006/hmmer1530251526933320.216958
spec2006/lbm72827270120.16479
spec2006/libquantum1992619888380.190706
spec2006/mcf54845446380.692925
spec2006/milc64512642362760.427827
spec2006/namd1541101538302800.181688
spec2006/omnetpp4297504295761740.0404887
spec2006/perlbench6032756029473280.0543699
spec2006/povray5555245548866380.114847
spec2006/sjeng86646863023440.397018
spec2006/soplex2106912101695220.247756
spec2006/sphinx31008391006411980.196353
spec2006/xalancbmk2819035281788711480.0407232

Hi Weiming,

Sorry for the delay. I have two major points with this approach:

  1. We're basically selecting different patterns based on different flags by crossing too many wires (table-gen reg definition, sub-target feature, command line option, optimisation flags). Dynamically changing with command line options is going to be harder to test and reproduce user bugs.
  2. The benchmark numbers for code size reduction aren't still really *that* encouraging.

Ultimately, having a simple ARM/Thumb1/2 set would be as far as I'd go.

@qcolombet is there some prior art for that kind of fiddling? By looking at the other register files, the most that happens is things like "is64bit", in the same way we have "isThumb1".

cheers,
--renato

ostannard commandeered this revision.Jul 2 2019, 5:12 AM
ostannard added a reviewer: weimingz.
ostannard updated this revision to Diff 207525.Jul 2 2019, 5:34 AM
ostannard edited the summary of this revision. (Show Details)
ostannard edited reviewers, added: dmgreen; removed: jmolloy.
  • Rebase onto trunk.
  • Move all controlling conditions into ARMSubtarget
  • Remove command-line option (the only reason to disable this is benchmarking it)
  • Simplify test with inline asm, should be more robust

This is giving me ~0.1% code size reduction on SPEC2006 (comparing -Oz without this patch to -Oz with it), and ~0.25% reduction on 8 mbed-os examples.

Herald added a project: Restricted Project. · View Herald TranscriptJul 2 2019, 5:34 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
jmolloy added a subscriber: jmolloy.Jul 2 2019, 6:37 AM
jmolloy added inline comments.
llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
295 ↗(On Diff #207525)

All virtual functions should have a docstring.

llvm/lib/CodeGen/RegisterClassInfo.cpp
118 ↗(On Diff #207525)

Although you're not using it here, it seems to me very natural for such a callback to take the PhysReg too.

ostannard marked an inline comment as done.Jul 3 2019, 2:12 AM
ostannard added inline comments.
llvm/lib/CodeGen/RegisterClassInfo.cpp
118 ↗(On Diff #207525)

I'd rather not add extra code which we don't currently have a use for. That can easily be done later if it does turn out to be useful.

ostannard updated this revision to Diff 207725.Jul 3 2019, 2:13 AM
  • Added doc string
  • Remembered to git-add the test this time
jmolloy added inline comments.Jul 3 2019, 2:21 AM
llvm/lib/CodeGen/RegisterClassInfo.cpp
118 ↗(On Diff #207525)

I agree with the principle and would usually advocate for it. But in this case the SubtargetInfo API is a public API that is used by out-of-tree targets. Having a sensible API that isn't overfit to the current in-tree targets when it's ~zero cost is something we should aim for, IMO.

ostannard updated this revision to Diff 207731.Jul 3 2019, 2:36 AM
  • Add a PhysReg parameter to ignoreCSRForAllocationOrder
  • Check that the register is a GPR in the ARM implementation. The other register classes have the callee-saved regs last, so this doesn't make any difference to the generated code, but might avoid surprising behaviour in the future.
jmolloy accepted this revision.Jul 3 2019, 2:42 AM

LGTM, thanks for the change Oliver!

This revision is now accepted and ready to land.Jul 3 2019, 2:42 AM