This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add heuristic to avoid lowering calls to blx for Thumb1 in ARMTargetLowering::LowerCall
Needs ReviewPublic

Authored by prathamesh on Sep 9 2020, 8:54 PM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma

Summary

Hi,
This is a follow-up on https://reviews.llvm.org/D79785.
This patch implements a heuristic to avoid lowering calls to blx if MF.getFunction().arg_size() + Outs.size() < (number of registers) - 1, since we need at least one register for holding function's address. It converts all calls to bl for the attached test-case. However it might not be able to detect cases when we need more than one register to compute arguments. For that, the approach in D79785, can catch some of these, by folding tLDRpci, tBLXr -> tBL.
Does this patch look reasonable ?
Testing with make check-llvm with -DLLVM_ENABLE_EXPENSIVE_CHECKS=ON shows no unexpected failures.

I have a couple of questions:
(a) How do we get number of available registers for subtarget in LowerCall ?
(b) I assume Outs.size() will correspond to number of arguments passed to the function ?
TargetLowering.h has following comment above LowerCall():

/// The outgoing arguments to the call are described by the Outs array,
/// and the values to be returned by the call are described by the Ins

/// array.

Thanks!

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

prathamesh created this revision.Sep 9 2020, 8:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2020, 8:54 PM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

prathamesh requested review of this revision.Sep 9 2020, 8:54 PM

Harbormaster completed remote builds in B71178: Diff 290866.Sep 9 2020, 9:28 PM

dmgreen added inline comments.Sep 10 2020, 12:20 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	Hmm. Not sure. Perhaps this can use something like getRegClassFor(MVT::i32)->getNumRegs()? It's still probably a very rough estimate of allocatable registers.
llvm/test/CodeGen/ARM/minsize-call-cse-2.ll
11	Can you add more tests of various sizes.

dmgreen added a reviewer: efriedma.Sep 10 2020, 12:20 PM

efriedma added inline comments.Sep 10 2020, 2:15 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	I don't understand the intent here. In Thumb1 mode, there are four callee-save registers that are considered allocatable: r4-r7. We must use one of them to store the address of an indirect call. (We could potentially use high registers, but that isn't implemented.) None of them are ever used to pass arguments. Given that, why does the number of arguments to the function matter? Why does the number of caller-save registers matter?

prathamesh added inline comments.Sep 11 2020, 5:03 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	IIUC, r0-r3 are caller saved, and before making any calls, they need to be copied into remaining registers or saved to memory. For example: define void @f(i32 %x, i32 %y, i32 %z, i32 %w) optsize minsize { entry: call void @g(i32 %x, i32 %y) call void @g(i32 %x, i32 %y) call void @g(i32 %x, i32 %y) call void @h(i32 %z, i32 %w) ret void } declare void @g(i32, i32) declare void @h(i32, i32) code-gen: push {r3, r4, r5, r6, r7, lr} str r3, [sp] @ 4-byte Spill mov r5, r2 mov r6, r1 mov r7, r0 ldr r4, .LCPI0_0 blx r4 mov r0, r7 mov r1, r6 blx r4 mov r0, r7 mov r1, r6 blx r4 mov r0, r5 ldr r1, [sp] @ 4-byte Reload bl h pop {r3, r4, r5, r6, r7, pc} In this case, it copies r2, r1, r0 into r5, r6, r7 respectively and uses r4 for function's address. Since there is no register left to copy r3, it is spilled into memory. However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available. So the condition below should probably be: PreferIndirect = MF.getFunction().arg_size() + Outs.size() < 4 ? (altho that also makes it more restrictive). Btw, compiling with lowering to indirect call and without, result in same sized binaries for above test-case. I wonder, if we want to disable the indirect call heuristic only if the register holding function's address gets spilled since it's repeatedly rematerialized before each call (similar to the original test-case) ? In which case, the approach in D79785 seems to be the only correct one.

efriedma added inline comments.Sep 11 2020, 2:35 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available. Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call. If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated.

prathamesh added inline comments.Sep 13 2020, 9:02 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call. Right, r0-r3 won't be usable for holding function's address in this case since they will be call clobbered. I incorrectly assumed they would be and checked for nRegs - 1. If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated. Hmm, you're right. At this point, I am stumped for finding a heuristic to gauge register pressure in LowerCall that can cover all cases. Do you have any suggestions ? Thanks!

efriedma added inline comments.Sep 14 2020, 12:02 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
2261	Maybe we should do this transform after isel? Much easier to reason about which operations are actually between the repeated calls at that point.

alanphipps added a subscriber: alanphipps.Oct 23 2020, 2:32 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMISelLowering.cpp

9 lines

test/

CodeGen/

ARM/

minsize-call-cse-2.ll

20 lines

Diff 290866

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,249 Lines • ▼ Show 20 Lines	if (isa<GlobalAddressSDNode>(Callee)) {
auto *GV = cast<GlobalAddressSDNode>(Callee)->getGlobal();		auto *GV = cast<GlobalAddressSDNode>(Callee)->getGlobal();
if (CLI.CB) {		if (CLI.CB) {
auto *BB = CLI.CB->getParent();		auto *BB = CLI.CB->getParent();
PreferIndirect = Subtarget->isThumb() && Subtarget->hasMinSize() &&		PreferIndirect = Subtarget->isThumb() && Subtarget->hasMinSize() &&
count_if(GV->users(), [&BB](const User *U) {		count_if(GV->users(), [&BB](const User *U) {
return isa<Instruction>(U) &&		return isa<Instruction>(U) &&
cast<Instruction>(U)->getParent() == BB;		cast<Instruction>(U)->getParent() == BB;
}) > 2;		}) > 2;

		// FIXME: How to obtain number of available registers ?
		// Hardcoded for now.
		unsigned nRegs = 7;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'nRegs' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'nRegs' [readability-identifier-naming]…
		dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. Not sure. Perhaps this can use something like getRegClassFor(MVT::i32)->getNumRegs()? It's still probably a very rough estimate of allocatable registers. dmgreen: Hmm. Not sure. Perhaps this can use something like getRegClassFor(MVT::i32)->getNumRegs()? It's…
		efriedmaUnsubmitted Not Done Reply Inline Actions I don't understand the intent here. In Thumb1 mode, there are four callee-save registers that are considered allocatable: r4-r7. We must use one of them to store the address of an indirect call. (We could potentially use high registers, but that isn't implemented.) None of them are ever used to pass arguments. Given that, why does the number of arguments to the function matter? Why does the number of caller-save registers matter? efriedma: I don't understand the intent here. In Thumb1 mode, there are four callee-save registers that…
		prathameshAuthorUnsubmitted Not Done Reply Inline Actions IIUC, r0-r3 are caller saved, and before making any calls, they need to be copied into remaining registers or saved to memory. For example: define void @f(i32 %x, i32 %y, i32 %z, i32 %w) optsize minsize { entry: call void @g(i32 %x, i32 %y) call void @g(i32 %x, i32 %y) call void @g(i32 %x, i32 %y) call void @h(i32 %z, i32 %w) ret void } declare void @g(i32, i32) declare void @h(i32, i32) code-gen: push {r3, r4, r5, r6, r7, lr} str r3, [sp] @ 4-byte Spill mov r5, r2 mov r6, r1 mov r7, r0 ldr r4, .LCPI0_0 blx r4 mov r0, r7 mov r1, r6 blx r4 mov r0, r7 mov r1, r6 blx r4 mov r0, r5 ldr r1, [sp] @ 4-byte Reload bl h pop {r3, r4, r5, r6, r7, pc} In this case, it copies r2, r1, r0 into r5, r6, r7 respectively and uses r4 for function's address. Since there is no register left to copy r3, it is spilled into memory. However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available. So the condition below should probably be: PreferIndirect = MF.getFunction().arg_size() + Outs.size() < 4 ? (altho that also makes it more restrictive). Btw, compiling with lowering to indirect call and without, result in same sized binaries for above test-case. I wonder, if we want to disable the indirect call heuristic only if the register holding function's address gets spilled since it's repeatedly rematerialized before each call (similar to the original test-case) ? In which case, the approach in D79785 seems to be the only correct one. prathamesh: IIUC, r0-r3 are caller saved, and before making any calls, they need to be copied into…
		efriedmaUnsubmitted Not Done Reply Inline Actions However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than 4 params) for holding function's address if r4 -r7 were not available. Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call. If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated. efriedma: > However, I think I wrongly assumed that it could use one of r0-r3 (if function had less than…
		prathameshAuthorUnsubmitted Done Reply Inline Actions Well, maybe I was a little imprecise. In general, we can use them for holding an indirect call address. But it's useless for the purpose of this optimization because it would get clobbered by the call. Right, r0-r3 won't be usable for holding function's address in this case since they will be call clobbered. I incorrectly assumed they would be and checked for nRegs - 1. If you're trying to gauge register pressure, anything related to the number of arguments isn't going to be effective: it isn't really correlated. Hmm, you're right. At this point, I am stumped for finding a heuristic to gauge register pressure in LowerCall that can cover all cases. Do you have any suggestions ? Thanks! prathamesh: > Well, maybe I was a little imprecise. In general, we can use them for holding an indirect…
		efriedmaUnsubmitted Not Done Reply Inline Actions Maybe we should do this transform after isel? Much easier to reason about which operations are actually between the repeated calls at that point. efriedma: Maybe we should do this transform after isel? Much easier to reason about which operations are…

		// Check that there is at least one register available for holding
		// function's address
		if (PreferIndirect && Subtarget->isThumb1Only())
		PreferIndirect = MF.getFunction().arg_size() + Outs.size() < nRegs - 1;
}		}
}		}
if (isTailCall) {		if (isTailCall) {
// Check if it's really possible to do a tail call.		// Check if it's really possible to do a tail call.
isTailCall = IsEligibleForTailCallOptimization(		isTailCall = IsEligibleForTailCallOptimization(
Callee, CallConv, isVarArg, isStructRet,		Callee, CallConv, isVarArg, isStructRet,
MF.getFunction().hasStructRetAttr(), Outs, OutVals, Ins, DAG,		MF.getFunction().hasStructRetAttr(), Outs, OutVals, Ins, DAG,
PreferIndirect);		PreferIndirect);
▲ Show 20 Lines • Show All 16,766 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/minsize-call-cse-2.ll

This file was added.

				; RUN: llc < %s \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv6m-arm-none-eabi"

				; CHECK-LABEL: f:
				; CHECK: bl g
				; CHECK: bl g
				; CHECK: bl g
				; CHECK: bl g
				define void @f(i32* %p, i32 %x, i32 %y, i32 %z, i32 %a) optsize minsize {
				dmgreenUnsubmitted Not Done Reply Inline Actions Can you add more tests of various sizes. dmgreen: Can you add more tests of various sizes.
				entry:
				call void @g(i32* %p, i32 %x, i32 %y, i32 %z, i32 %a)
				call void @g(i32* %p, i32 %x, i32 %y, i32 %z, i32 %a)
				call void @g(i32* %p, i32 %x, i32 %y, i32 %z, i32 %a)
				call void @g(i32* %p, i32 %x, i32 %y, i32 %z, i32 %a)
				ret void
				}

				declare void @g(i32*,i32,i32,i32,i32)