This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Analysis/
-
Analysis/
-
InlineCost.cpp
-
test/Transforms/Inline/ARM/
-
Transforms/
-
Inline/
-
ARM/
-
loop-add.ll
-
loop-memcpy.ll
-
loop-noinline.ll

Differential D52716

[Inliner] Penalise inlining of calls with loops at Oz
ClosedPublic

Authored by dmgreen on Oct 1 2018, 4:01 AM.

Download Raw Diff

Details

Reviewers

efriedma
chandlerc
fhahn
zzheng
eraman
javed.absar

Commits

rGba9f245b0df9: [Inliner] Penalise inlining of calls with loops at Oz
rL346134: [Inliner] Penalise inlining of calls with loops at Oz

Summary

We currently seem to underestimate the size of functions with loops in them, both in terms of absolute code size and in the difficulties of dealing with such code. (Functions, for example, can be tail merged to further reduce codesize). At -Oz, we can then increase code size by inlining small loops multiple times.

This attempts to penalise functions with loops at -Oz by adding a CallPenalty for each top level loop in the function. It uses LI (and hence DT) to calculate the number of loops. As we are dealing with minsize, the inline threshold is small and functions at this point should be relatively small, making the construction of these cheap (although we really only care if/how many functions there are, so there may be a cheaper way to do that.)

Diff Detail

Repository: rL LLVM

Event Timeline

dmgreen created this revision.Oct 1 2018, 4:01 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptOct 1 2018, 4:01 AM

Herald added a subscriber: haicheng. · View Herald Transcript

Before we start adding more heuristics, I'd like to make sure the existing heuristics are working correctly. In particular, for the given testcases, NumInstructionsSimplified seems way larger than it should be.

For loop-noinline.ll the "Simplified" instructions (the ones that cost nothing) appear to be:
br %body
3 x phis
one of the geps in the loop
the gep outside the loop
the ret

For the other testcase it's 8, again the 2 x unconditional branches, 3 x phis and 2 x geps + ret.

I guess we could argue the gep without a load isn't free, but I'm not sure why the other gep in the loop isn't free. In the second test case the two branches for the loop arn't free because they aren't fallthroughs, but that sounds tough to model.

Perhaps we could say that phis in loops have one "setup" cost? We would have to detect that the phi comes from a loop in that case.

We could probably improve our handling of GEPs; for example, a GEP used by a loop PHI node is probably not free.

In the second test case the two branches for the loop arn't free because they aren't fallthroughs, but that sounds tough to model.

Probably not that hard to model: if a block has more than one unconditional branch predecessor, count each branch beyond the first. Or something like that; maybe we could also do something more fancy for conditional branches, to guess whether they lower to one or two branch instructions at the machine level.

We could also try to improve our modeling of constants: we aren't assigning any cost to materializing @digits, or any cost to materializing "32" in @call2.

Perhaps we could say that phis in loops have one "setup" cost? We would have to detect that the phi comes from a loop in that case.

The issue isn't really whether the PHI is in a loop; it has to do with the uses of the operand. A PHI will eventually be lowered to a copy in each predecessor. The copy can be eliminated by the register allocator in many cases, but not all cases; the distinguishing factor is whether the PHI's operand is live after the PHI.

Granted, just refusing to inline loops at -Oz probably does the right thing in most cases, but I'd like to see how far we can get by improving the cost model rather than artificially skewing it.

I've added a new memcpy test from the original reproducer. It's a byte memcpy (people seem to love writing those), which I think is worth focusing on because its small, but still increases codesize. It expands to:

	b	.LBB1_2
  .LBB1_1:
	ldrb	r3, [r1], #1
	subs	r2, #1
	strb	r3, [r0], #1
  .LBB1_2:
	cmp	r2, #0
	bne	.LBB1_1
	bx	lr

I would say from this that it's hard to argue the geps arn't free (unless we count that they are 32bit instructions? Do we do a lot of modelling of that?). They may be non-free on other architectures.

The llvm IR is 13 instuctions, 8 of which are simplified so 5 remain. Starting at a score of -45 (CallPenalty + Call + 3*Args), we end up at -20. We could add 1 for the branch, and another 2 (maybe) for the geps. That would get our score to -5, so we'd need another 2 instructions to cost something.

Or we'd need to not start out with such a negative score. In this case, I think the args make sense, the CallPenalty of 25 is a bit high maybe? As I understand it this number (for codesize) is essentially like changing our threshold from <=0 to <=25, and saying "more optimisations may happen, lets fudge it a bit". So changing it would mean less inlining in general (for minsize, as in it would affect more than just loops. It's a larger change.)

any cost to materializing "32" in @call2.

Yeah, that was what I was thinking of for the setup cost of the phi's not being free. I see what you mean about operands remaining live after the phi. The same could be said for function arguments, right? They may be free if they can be put into the correct register, but wont be if they are also needed after the call.

We could probably improve our handling of GEPs; for example, a GEP used by a loop PHI node is probably not free.

Exactly where would a gep be be free? Just loads and stores and other geps? Perhaps memcpy's if they are small enough? Or from other instructions in general?

I think that was a kind of long way of saying, if we are going to try to do this accurately, we should probably start at the CallPenalty. Let me know what you think, I'll try and get more numbers.

unless we count that they are 32bit instructions? Do we do a lot of modelling of that?

The inliner doesn't try to model that at all. It's not really that important outside of -Oz, anyway, and it's sort of a hard problem given that the size of an instruction depends on register allocation.

Exactly where would a gep be be free? Just loads and stores and other geps? Perhaps memcpy's if they are small enough? Or from other instructions in general?

Something like that. IIRC the inliner is a bit more generous at the moment, on the assumption that some GEPs will eventually simplify.

is essentially like changing our threshold from <=0 to <=25, and saying "more optimisations may happen, lets fudge it a bit"

In this context, yes. (There are other places that use CallPenalty, but they aren't really relevant here.) But changing it is effectively equivalent to just changing the inline threshold. If you think OptMinSizeThreshold is wrong, you can try modifying it (it doesn't necessarily have to be a positive number).

The same could be said for function arguments, right? They may be free if they can be put into the correct register, but wont be if they are also needed after the call.

Yes.

dmgreen mentioned this in D53405: [Inliner] Attempt to more accurately model the cost of loops at minsize.Oct 18 2018, 10:34 AM

Now ignores loops that will never be executed. I also have some code that uses SCEV to calculate if the backedge count is <= 1 and allow inlining there. It doesn't seem to come up very often though and needed some plumbing to get SE's/TLI's in the right places.

Eli, what do you think between this and D53405?

Friendly Ping

From the discussion here and on D53405, it seems like the pieces we aren't modeling are hard to model well, and it will be hard to get them all right. And I think you're right that it's unlikely to be profitable to inline loops. So LGTM

This revision is now accepted and ready to land.Nov 2 2018, 2:10 PM

Closed by commit rL346134: [Inliner] Penalise inlining of calls with loops at Oz (authored by dmgreen). · Explain WhyNov 5 2018, 6:56 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Analysis/

InlineCost.cpp

20 lines

test/

Transforms/

Inline/

ARM/

loop-add.ll

95 lines

loop-memcpy.ll

87 lines

loop-noinline.ll

49 lines

Diff 172584

llvm/trunk/lib/Analysis/InlineCost.cpp

Show All 17 Lines
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/BlockFrequencyInfo.h"		#include "llvm/Analysis/BlockFrequencyInfo.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/ConstantFolding.h"		#include "llvm/Analysis/ConstantFolding.h"
#include "llvm/Analysis/CFG.h"		#include "llvm/Analysis/CFG.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Config/llvm-config.h"		#include "llvm/Config/llvm-config.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
		#include "llvm/IR/Dominators.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/GlobalAlias.h"		#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/InstVisitor.h"		#include "llvm/IR/InstVisitor.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"

▲ Show 20 Lines • Show All 1,839 Lines • ▼ Show 20 Lines	InlineResult CallAnalyzer::analyzeCall(CallSite CS) {
bool OnlyOneCallAndLocalLinkage =		bool OnlyOneCallAndLocalLinkage =
F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();		F.hasLocalLinkage() && F.hasOneUse() && &F == CS.getCalledFunction();
// If this is a noduplicate call, we can still inline as long as		// If this is a noduplicate call, we can still inline as long as
// inlining this would cause the removal of the caller (so the instruction		// inlining this would cause the removal of the caller (so the instruction
// is not actually duplicated, just moved).		// is not actually duplicated, just moved).
if (!OnlyOneCallAndLocalLinkage && ContainsNoDuplicateCall)		if (!OnlyOneCallAndLocalLinkage && ContainsNoDuplicateCall)
return "noduplicate";		return "noduplicate";

		// Loops generally act a lot like calls in that they act like barriers to
		// movement, require a certain amount of setup, etc. So when optimising for
		// size, we penalise any call sites that perform loops. We do this after all
		// other costs here, so will likely only be dealing with relatively small
		// functions (and hence DT and LI will hopefully be cheap).
		if (Caller->optForMinSize()) {
		DominatorTree DT(F);
		LoopInfo LI(DT);
		int NumLoops = 0;
		for (Loop *L : LI) {
		// Ignore loops that will not be executed
		if (DeadBlocks.count(L->getHeader()))
		continue;
		NumLoops++;
		}
		Cost += NumLoops * InlineConstants::CallPenalty;
		}

// We applied the maximum possible vector bonus at the beginning. Now,		// We applied the maximum possible vector bonus at the beginning. Now,
// subtract the excess bonus, if any, from the Threshold before		// subtract the excess bonus, if any, from the Threshold before
// comparing against Cost.		// comparing against Cost.
if (NumVectorInstructions <= NumInstructions / 10)		if (NumVectorInstructions <= NumInstructions / 10)
Threshold -= VectorBonus;		Threshold -= VectorBonus;
else if (NumVectorInstructions <= NumInstructions / 2)		else if (NumVectorInstructions <= NumInstructions / 2)
Threshold -= VectorBonus/2;		Threshold -= VectorBonus/2;

▲ Show 20 Lines • Show All 278 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/Inline/ARM/loop-add.ll

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; CHECK-LABEL: void @doCalls
				define void @doCalls(i8* nocapture %p1, i8* nocapture %p2, i32 %n) #0 {
				entry:
				%div = lshr i32 %n, 1
				; CHECK: call void @LoopCall
				tail call void @LoopCall(i8* %p1, i8* %p2, i32 %div) #0

				%div2 = lshr i32 %n, 2
				; CHECK: call void @LoopCall
				tail call void @LoopCall(i8* %p1, i8* %p2, i32 %div2) #0

				; CHECK-NOT: call void @LoopCall
				tail call void @LoopCall(i8* %p2, i8* %p1, i32 0) #0

				; CHECK-NOT: call void @LoopCall_internal
				tail call void @LoopCall_internal(i8* %p1, i8* %p2, i32 %div2) #0

				%div3 = lshr i32 %n, 4
				; CHECK-NOT: call void @SimpleCall
				tail call void @SimpleCall(i8* %p2, i8* %p1, i32 %div3) #0
				ret void
				}

				; CHECK-LABEL: define void @LoopCall
				define void @LoopCall(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%c = icmp ne i32 %num, 0
				br i1 %c, label %while.cond, label %while.end

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr2, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%1 = trunc i32 %num.addr.0 to i8
				%conv1 = add i8 %0, %1
				%incdec.ptr2 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %conv1, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				; CHECK-LABEL-NOT: define void @LoopCall_internal
				define internal void @LoopCall_internal(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%c = icmp ne i32 %num, 0
				br i1 %c, label %while.cond, label %while.end

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr2, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%1 = trunc i32 %num.addr.0 to i8
				%conv1 = add i8 %0, %1
				%incdec.ptr2 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %conv1, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				; CHECK-LABEL: define void @SimpleCall
				define void @SimpleCall(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				%arrayidx = getelementptr inbounds i8, i8* %source, i32 %num
				%0 = load i8, i8* %arrayidx, align 1
				%1 = xor i8 %0, 127
				%arrayidx2 = getelementptr inbounds i8, i8* %dest, i32 %num
				store i8 %1, i8* %arrayidx2, align 1
				ret void
				}

				attributes #0 = { minsize optsize }

llvm/trunk/test/Transforms/Inline/ARM/loop-memcpy.ll

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; CHECK-LABEL: define void @matcpy
				define void @matcpy(i8* %dest, i8* %source, i32 %num) #0 {
				entry:
				%0 = ptrtoint i8* %dest to i32
				%1 = ptrtoint i8* %source to i32
				%2 = xor i32 %0, %1
				%3 = and i32 %2, 3
				%cmp = icmp eq i32 %3, 0
				br i1 %cmp, label %if.then, label %if.else20

				if.then: ; preds = %entry
				%sub = sub i32 0, %0
				%and2 = and i32 %sub, 3
				%add = or i32 %and2, 4
				%cmp3 = icmp ugt i32 %add, %num
				br i1 %cmp3, label %if.else, label %if.then4

				if.then4: ; preds = %if.then
				%sub5 = sub i32 %num, %and2
				%shr = and i32 %sub5, -4
				%sub7 = sub i32 %sub5, %shr
				%tobool = icmp eq i32 %and2, 0
				br i1 %tobool, label %if.end, label %if.then8

				if.then8: ; preds = %if.then4
				; CHECK: call fastcc void @memcpy
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %and2) #0
				%add.ptr = getelementptr inbounds i8, i8* %dest, i32 %and2
				%add.ptr9 = getelementptr inbounds i8, i8* %source, i32 %and2
				br label %if.end

				if.end: ; preds = %if.then4, %if.then8
				%p_dest.0 = phi i8* [ %add.ptr, %if.then8 ], [ %dest, %if.then4 ]
				%p_source.0 = phi i8* [ %add.ptr9, %if.then8 ], [ %source, %if.then4 ]
				%tobool14 = icmp eq i32 %sub7, 0
				br i1 %tobool14, label %if.end22, label %if.then15

				if.then15: ; preds = %if.end
				%add.ptr13 = getelementptr inbounds i8, i8* %p_source.0, i32 %shr
				%add.ptr11 = getelementptr inbounds i8, i8* %p_dest.0, i32 %shr
				; CHECK: call fastcc void @memcpy
				call fastcc void @memcpy(i8* %add.ptr11, i8* %add.ptr13, i32 %sub7) #0
				br label %if.end22

				if.else: ; preds = %if.then
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %num) #0
				br label %if.end22

				if.else20: ; preds = %entry
				call fastcc void @memcpy(i8* %dest, i8* %source, i32 %num) #0
				br label %if.end22

				if.end22: ; preds = %if.then15, %if.end, %if.else, %if.else20
				ret void
				}

				; CHECK-LABEL: define internal void @memcpy
				define internal void @memcpy(i8* nocapture %dest, i8* nocapture readonly %source, i32 %num) #0 {
				entry:
				br label %while.cond

				while.cond: ; preds = %while.body, %entry
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %while.body ]
				%p_dest.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr1, %while.body ]
				%p_source.0 = phi i8* [ %source, %entry ], [ %incdec.ptr, %while.body ]
				%cmp = icmp eq i32 %num.addr.0, 0
				br i1 %cmp, label %while.end, label %while.body

				while.body: ; preds = %while.cond
				%incdec.ptr = getelementptr inbounds i8, i8* %p_source.0, i32 1
				%0 = load i8, i8* %p_source.0, align 1
				%incdec.ptr1 = getelementptr inbounds i8, i8* %p_dest.0, i32 1
				store i8 %0, i8* %p_dest.0, align 1
				%dec = add i32 %num.addr.0, -1
				br label %while.cond

				while.end: ; preds = %while.cond
				ret void
				}

				attributes #0 = { minsize optsize }

llvm/trunk/test/Transforms/Inline/ARM/loop-noinline.ll

				; RUN: opt -inline %s -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv7m-arm-none-eabi"

				; Check we don't inline loops at -Oz. They tend to be larger than we
				; expect.

				; CHECK: define i8* @H
				@digits = constant [16 x i8] c"0123456789ABCDEF", align 1
				define i8* @H(i8* %p, i32 %val, i32 %num) #0 {
				entry:
				br label %do.body

				do.body: ; preds = %do.body, %entry
				%p.addr.0 = phi i8* [ %p, %entry ], [ %incdec.ptr, %do.body ]
				%val.addr.0 = phi i32 [ %val, %entry ], [ %shl, %do.body ]
				%num.addr.0 = phi i32 [ %num, %entry ], [ %dec, %do.body ]
				%shr = lshr i32 %val.addr.0, 28
				%arrayidx = getelementptr inbounds [16 x i8], [16 x i8]* @digits, i32 0, i32 %shr
				%0 = load i8, i8* %arrayidx, align 1
				%incdec.ptr = getelementptr inbounds i8, i8* %p.addr.0, i32 1
				store i8 %0, i8* %p.addr.0, align 1
				%shl = shl i32 %val.addr.0, 4
				%dec = add i32 %num.addr.0, -1
				%tobool = icmp eq i32 %dec, 0
				br i1 %tobool, label %do.end, label %do.body

				do.end: ; preds = %do.body
				%scevgep = getelementptr i8, i8* %p, i32 %num
				ret i8* %scevgep
				}

				define nonnull i8* @call1(i8* %p, i32 %val, i32 %num) #0 {
				entry:
				; CHECK: tail call i8* @H
				%call = tail call i8* @H(i8* %p, i32 %val, i32 %num) #0
				ret i8* %call
				}

				define nonnull i8* @call2(i8* %p, i32 %val) #0 {
				entry:
				; CHECK: tail call i8* @H
				%call = tail call i8* @H(i8* %p, i32 %val, i32 32) #0
				ret i8* %call
				}

				attributes #0 = { minsize optsize }