This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/ARM/
-
Target/
-
ARM/
-
ARMISelLowering.cpp
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
urem-opt-size.ll

Differential D24133

[ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM)
ClosedPublic

Authored by pbarrio on Sep 1 2016, 7:46 AM.

Download Raw Diff

Details

Reviewers

scott-0
rengolin
compnerd
jmolloy

Commits

rGfc752bb70aa8: [ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM)
rL280808: [ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM)

Summary

This saves a library call to __aeabi_uidivmod. However, the
processor must feature hardware division in order to benefit from
the transformation.

Diff Detail

Repository: rL LLVM

Event Timeline

pbarrio updated this revision to Diff 70009.Sep 1 2016, 7:46 AM

pbarrio retitled this revision from to [ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM).

pbarrio updated this object.

pbarrio added reviewers: rengolin, jmolloy, scott-0.

pbarrio added a subscriber: llvm-commits.

Herald added subscribers: samparker, rengolin, aemerson. · View Herald TranscriptSep 1 2016, 7:46 AM

Nice!

lib/Target/ARM/ARMISelLowering.cpp
12109 ↗	(On Diff #70009)	Why not inline this into the `Div`?

This revision is now accepted and ready to land.Sep 1 2016, 7:55 AM

This doesn't seem appropriate for -Oz (which happens to be set in the test your actually modifying).

So, I think that's a worthy optimisation, but I worry about the multi-cycle that the selection dag takes to get there.

The validation mechanism keeps cycling between divs and mods, merging and extending them until, quite by accident, things fall into shape. For instance, another patch to add "rt_div" reports that library call being called three times.

One thing we don't do, for example, is to merge divs into divmods that only need the mods, so you'll end up with a call to idiv + idivmod. In here, you'll end up with DIV+DIV+MUL(MOD) with the two first divs being identical.

The only comfort is that pure mods need the div anyway, so in the simple case, it's ok.

cheers,
--renato

lib/Target/ARM/ARMISelLowering.cpp
12109 ↗	(On Diff #70009)	Because the merge of divs+mods happens elsewhere. I believe divs are already covered by instructions where possible, but not mods. However, mods fallback here, so this is the right place, I think.
test/CodeGen/ARM/urem-opt-size.ll
43 ↗	(On Diff #70009)	Please, add tests for signed/unsigned, mod and div+mod, div+div+mod and see if they merge (if not, leave a FIXME comment). Some of those tests will probably need to be in divmod-eabi.ll. Also, please make sure to include regex for the registers, to make sure that the result from udiv gets passed correctly to the mul + add.

rengolin requested changes to this revision.Sep 1 2016, 12:20 PM

rengolin edited edge metadata.

This revision now requires changes to proceed.Sep 1 2016, 12:20 PM

Hi Tim,

Actually this optimization can be good for Oz; in fact it's our primary target. One udiv and one mls is 8 bytes, which OK is 4 bytes more than the call but doesn't require any argument setup/tear down.

Also, we can avoid linking in uidivmod at all, which in small programs is *significant* in terms of code size reduction.

James

In D24133#531896, @jmolloy wrote:

Also, we can avoid linking in uidivmod at all, which in small programs is *significant* in terms of code size reduction.

That's true.

Division with all the speed optimisations and "code maintenance improvements" is a *very* large algorithm, and can incur into calls between themselves to save copy&paste. Just look at our compiler-rt implementation... :)

Improved testing:

New signed remainder test.
New sdiv + srem test.
Registers are given as regular expressions.

All tests merge to udiv/sdiv + mls.

I am not sure about adding a div+div+mod test, though. I have checked that
two identical divs + remainder merge just fine. However, is that an IR that
we can typically get? I would expect the two divs to be merged before getting
to the back-end.

I would expect the two divs to be merged before getting to the back-end.

I think the important thing is that those two divs will be merged in SDAG before your code even runs, so is not possible to test.

James

Right, it makes sense. I didn't know that these sort of optimizations were also undertaken in the selection DAG.

Renato, thank you for the review; the testing looks better now. Could you have a look at the new patch when you have a few minutes? Let me know if you still see some room for improvement.

rengolin added inline comments.Sep 7 2016, 2:48 AM

test/CodeGen/ARM/urem-opt-size.ll
56 ↗	(On Diff #70377)	Can you also add the other CHECKs so that we're clear on what's the expected behaviour on all tested archs?

Additional checks added to the new tests.

LGTM, thanks!

This revision is now accepted and ready to land.Sep 7 2016, 3:51 AM

Closed by commit rL280808: [ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM) (authored by pabbar01). · Explain WhySep 7 2016, 5:57 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer mentioned this in D25077: [ARM] Lower UDIV+UREM to UDIV+MLS (and the same for SREM).Sep 30 2016, 1:25 AM

SjoerdMeijer mentioned this in rL283098: [ARM] Code size optimisation to lower udiv+urem to udiv+mls instead of a.Oct 4 2016, 1:05 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

ARM/

ARMISelLowering.cpp

19 lines

test/

CodeGen/

ARM/

urem-opt-size.ll

43 lines

Diff 70523

llvm/trunk/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,092 Lines • ▼ Show 20 Lines	assert((Subtarget->isTargetAEABI() \|\| Subtarget->isTargetAndroid() \|\|
Subtarget->isTargetGNUAEABI() \|\| Subtarget->isTargetMuslAEABI()) &&		Subtarget->isTargetGNUAEABI() \|\| Subtarget->isTargetMuslAEABI()) &&
"Register-based DivRem lowering only");		"Register-based DivRem lowering only");
unsigned Opcode = Op->getOpcode();		unsigned Opcode = Op->getOpcode();
assert((Opcode == ISD::SDIVREM \|\| Opcode == ISD::UDIVREM) &&		assert((Opcode == ISD::SDIVREM \|\| Opcode == ISD::UDIVREM) &&
"Invalid opcode for Div/Rem lowering");		"Invalid opcode for Div/Rem lowering");
bool isSigned = (Opcode == ISD::SDIVREM);		bool isSigned = (Opcode == ISD::SDIVREM);
EVT VT = Op->getValueType(0);		EVT VT = Op->getValueType(0);
Type Ty = VT.getTypeForEVT(DAG.getContext());		Type Ty = VT.getTypeForEVT(DAG.getContext());
		SDLoc dl(Op);

		// If the target has hardware divide, use divide + multiply + subtract:
		// div = a / b
		// rem = a - b * div
		// return {div, rem}
		// This should be lowered into UDIV/SDIV + MLS later on.
		if (Subtarget->hasDivide()) {
		unsigned DivOpcode = isSigned ? ISD::SDIV : ISD::UDIV;
		const SDValue Dividend = Op->getOperand(0);
		const SDValue Divisor = Op->getOperand(1);
		SDValue Div = DAG.getNode(DivOpcode, dl, VT, Dividend, Divisor);
		SDValue Mul = DAG.getNode(ISD::MUL, dl, VT, Div, Divisor);
		SDValue Rem = DAG.getNode(ISD::SUB, dl, VT, Dividend, Mul);

		SDValue Values[2] = {Div, Rem};
		return DAG.getNode(ISD::MERGE_VALUES, dl, DAG.getVTList(VT, VT), Values);
		}

RTLIB::Libcall LC = getDivRemLibcall(Op.getNode(),		RTLIB::Libcall LC = getDivRemLibcall(Op.getNode(),
VT.getSimpleVT().SimpleTy);		VT.getSimpleVT().SimpleTy);
SDValue InChain = DAG.getEntryNode();		SDValue InChain = DAG.getEntryNode();

TargetLowering::ArgListTy Args = getDivRemArgList(Op.getNode(),		TargetLowering::ArgListTy Args = getDivRemArgList(Op.getNode(),
DAG.getContext());		DAG.getContext());

SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),		SDValue Callee = DAG.getExternalSymbol(getLibcallName(LC),
getPointerTy(DAG.getDataLayout()));		getPointerTy(DAG.getDataLayout()));

Type RetTy = (Type)StructType::get(Ty, Ty, nullptr);		Type RetTy = (Type)StructType::get(Ty, Ty, nullptr);

SDLoc dl(Op);
TargetLowering::CallLoweringInfo CLI(DAG);		TargetLowering::CallLoweringInfo CLI(DAG);
CLI.setDebugLoc(dl).setChain(InChain)		CLI.setDebugLoc(dl).setChain(InChain)
.setCallee(getLibcallCallingConv(LC), RetTy, Callee, std::move(Args))		.setCallee(getLibcallCallingConv(LC), RetTy, Callee, std::move(Args))
.setInRegister().setSExtResult(isSigned).setZExtResult(!isSigned);		.setInRegister().setSExtResult(isSigned).setZExtResult(!isSigned);

std::pair<SDValue, SDValue> CallInfo = LowerCallTo(CLI);		std::pair<SDValue, SDValue> CallInfo = LowerCallTo(CLI);
return CallInfo.first;		return CallInfo.first;
}		}
▲ Show 20 Lines • Show All 787 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/urem-opt-size.ll

	; When optimising for minimum size, we don't want to expand a div to a mul			; When optimising for minimum size, we don't want to expand a div to a mul
	; and a shift sequence. As a result, the urem instruction e.g. will not be			; and a shift sequence. As a result, the urem instruction e.g. will not be
	; expanded to a sequence of umull, lsrs, muls and sub instructions, but			; expanded to a sequence of umull, lsrs, muls and sub instructions, but
	; just a call to __aeabi_uidivmod.			; just a call to __aeabi_uidivmod.
	;			;
				; When the processor features hardware division, UDIV + UREM can be turned
				; into UDIV + MLS. This prevents the library function __aeabi_uidivmod to be
				; pulled into the binary. The test uses ARMv7-M.
				;
	; RUN: llc -mtriple=armv7a-eabi -mattr=-neon -verify-machineinstrs %s -o - \| FileCheck %s			; RUN: llc -mtriple=armv7a-eabi -mattr=-neon -verify-machineinstrs %s -o - \| FileCheck %s
				; RUN: llc -mtriple=thumbv7m-eabi -verify-machineinstrs %s -o - \| FileCheck %s -check-prefix=V7M

	target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv7m-arm-none-eabi"			target triple = "thumbv7m-arm-none-eabi"

	define i32 @foo1() local_unnamed_addr #0 {			define i32 @foo1() local_unnamed_addr #0 {
	entry:			entry:
	; CHECK-LABEL: foo1:			; CHECK-LABEL: foo1:
	; CHECK:__aeabi_idiv			; CHECK:__aeabi_idiv
	; CHECK-NOT: smmul			; CHECK-NOT: smmul
	%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()			%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()
	%div = sdiv i32 %call, 1000000			%div = sdiv i32 %call, 1000000
	ret i32 %div			ret i32 %div
	}			}

	define i32 @foo2() local_unnamed_addr #0 {			define i32 @foo2() local_unnamed_addr #0 {
	entry:			entry:
	; CHECK-LABEL: foo2:			; CHECK-LABEL: foo2:
	; CHECK: __aeabi_uidiv			; CHECK: __aeabi_uidiv
	; CHECK-NOT: umull			; CHECK-NOT: umull
	%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()			%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()
	%div = udiv i32 %call, 1000000			%div = udiv i32 %call, 1000000
	ret i32 %div			ret i32 %div
	}			}

				; Test for unsigned remainder
	define i32 @foo3() local_unnamed_addr #0 {			define i32 @foo3() local_unnamed_addr #0 {
	entry:			entry:
	; CHECK-LABEL: foo3:			; CHECK-LABEL: foo3:
	; CHECK: __aeabi_uidivmod			; CHECK: __aeabi_uidivmod
	; CHECK-NOT: umull			; CHECK-NOT: umull
				; V7M-LABEL: foo3:
				; V7M: udiv [[R2:r[0-9]+]], [[R0:r[0-9]+]], [[R1:r[0-9]+]]
				; V7M: mls {{r[0-9]+}}, [[R2]], [[R1]], [[R0]]
				; V7M-NOT: __aeabi_uidivmod
	%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()			%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()
	%rem = urem i32 %call, 1000000			%rem = urem i32 %call, 1000000
	%cmp = icmp eq i32 %rem, 0			%cmp = icmp eq i32 %rem, 0
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	ret i32 %conv			ret i32 %conv
	}			}

				; Test for signed remainder
				define i32 @foo4() local_unnamed_addr #0 {
				entry:
				; CHECK-LABEL: foo4:
				; CHECK:__aeabi_idivmod
				; V7M-LABEL: foo4:
				; V7M: sdiv [[R2:r[0-9]+]], [[R0:r[0-9]+]], [[R1:r[0-9]+]]
				; V7M: mls {{r[0-9]+}}, [[R2]], [[R1]], [[R0]]
				; V7M-NOT: __aeabi_idivmod
				%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()
				%rem = srem i32 %call, 1000000
				ret i32 %rem
				}

				; Check that doing a sdiv+srem has the same effect as only the srem,
				; as the division needs to be computed anyway in order to calculate
				; the remainder (i.e. make sure we don't end up with two divisions).
				define i32 @foo5() local_unnamed_addr #0 {
				entry:
				; CHECK-LABEL: foo5:
				; CHECK:__aeabi_idivmod
				; V7M-LABEL: foo5:
				; V7M: sdiv [[R2:r[0-9]+]], [[R0:r[0-9]+]], [[R1:r[0-9]+]]
				; V7M-NOT: sdiv
				; V7M: mls {{r[0-9]+}}, [[R2]], [[R1]], [[R0]]
				; V7M-NOT: __aeabi_idivmod
				%call = tail call i32 bitcast (i32 (...)* @GetValue to i32 ()*)()
				%div = sdiv i32 %call, 1000000
				%rem = srem i32 %call, 1000000
				%add = add i32 %div, %rem
				ret i32 %add
				}

	declare i32 @GetValue(...) local_unnamed_addr			declare i32 @GetValue(...) local_unnamed_addr

	attributes #0 = { minsize nounwind optsize }			attributes #0 = { minsize nounwind optsize }