This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add heuristic to avoid lowering calls to blx for Thumb1 in ARMTargetLowering::LowerCall
Needs ReviewPublic

Authored by prathamesh on Sep 9 2020, 8:54 PM.

Details

Reviewers
dmgreen
efriedma
Summary

Hi,
This is a follow-up on https://reviews.llvm.org/D79785.
This patch implements a heuristic to avoid lowering calls to blx if MF.getFunction().arg_size() + Outs.size() < (number of registers) - 1, since we need at least one register for holding function's address. It converts all calls to bl for the attached test-case. However it might not be able to detect cases when we need more than one register to compute arguments. For that, the approach in D79785, can catch some of these, by folding tLDRpci, tBLXr -> tBL.
Does this patch look reasonable ?
Testing with make check-llvm with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON shows no unexpected failures.

I have a couple of questions:
(a) How do we get number of available registers for subtarget in LowerCall ?
(b) I assume Outs.size() will correspond to number of arguments passed to the function ?
TargetLowering.h has following comment above LowerCall():

/// The outgoing arguments to the call are described by the Outs array,
/// and the values to be returned by the call are described by the Ins

/// array.

Thanks!

Diff Detail

Event Timeline

prathamesh created this revision.Sep 9 2020, 8:54 PM
prathamesh requested review of this revision.Sep 9 2020, 8:54 PM
dmgreen added inline comments.Sep 10 2020, 12:20 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

Hmm. Not sure. Perhaps this can use something like getRegClassFor(MVT::i32)->getNumRegs()? It's still probably a very rough estimate of allocatable registers.

llvm/test/CodeGen/ARM/minsize-call-cse-2.ll
11

Can you add more tests of various sizes.

efriedma added inline comments.Sep 10 2020, 2:15 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

I don't understand the intent here.

In Thumb1 mode, there are four callee-save registers that are considered allocatable: r4-r7. We must use one of them to store the address of an indirect call. (We could potentially use high registers, but that isn't implemented.) None of them are ever used to pass arguments. Given that, why does the number of arguments to the function matter? Why does the number of caller-save registers matter?

prathamesh added inline comments.Sep 11 2020, 5:03 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

IIUC, r0-r3 are caller saved, and before making any calls, they need to be copied into remaining registers or saved to memory.

For example:

define void @f(i32 %x, i32 %y, i32 %z, i32 %w) optsize minsize  {
entry:
  call void @g(i32 %x, i32 %y)
  call void @g(i32 %x, i32 %y)
  call void @g(i32 %x, i32 %y)
  call void @h(i32 %z, i32 %w)
  ret void
}

declare void @g(i32, i32)
declare void @h(i32, i32)

code-gen:

push    {r3, r4, r5, r6, r7, lr}
str        r3, [sp]                        @ 4-byte Spill
mov     r5, r2
mov     r6, r1
mov     r7, r0
ldr        r4, .LCPI0_0
blx        r4
mov      r0, r7
mov      r1, r6
blx        r4
mov      r0, r7
mov      r1, r6
blx        r4
mov      r0, r5
ldr        r1, [sp]                        @ 4-byte Reload
bl          h
pop     {r3, r4, r5, r6, r7, pc}

In this case, it copies r2, r1, r0 into r5, r6, r7 respectively and uses r4 for function's address.
Since there is no register left to copy r3, it is spilled into memory.

However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available.
So the condition below should probably be:
PreferIndirect = MF.getFunction().arg_size() + Outs.size() < 4 ?
(altho that also makes it more restrictive).

Btw, compiling with lowering to indirect call and without, result in same sized binaries for above test-case.
I wonder, if we want to disable the indirect call heuristic only if the register holding function's address gets spilled since it's repeatedly rematerialized before each call (similar to the original test-case) ? In which case, the approach in D79785 seems to be the only correct one.

efriedma added inline comments.Sep 11 2020, 2:35 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available.

Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call.

If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated.

prathamesh added inline comments.Sep 13 2020, 9:02 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call.

Right, r0-r3 won't be usable for holding function's address in this case since they will be call clobbered.
I incorrectly assumed they would be and checked for nRegs - 1.

If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated.

Hmm, you're right. At this point, I am stumped for finding a heuristic to gauge register pressure in LowerCall that can cover all cases. Do you have any suggestions ?
Thanks!

efriedma added inline comments.Sep 14 2020, 12:02 PM
llvm/lib/Target/ARM/ARMISelLowering.cpp
2261

Maybe we should do this transform after isel? Much easier to reason about which operations are actually between the repeated calls at that point.