This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost, -Oz] Don't take into account the penalty of a fast call of frequently used functions
Needs ReviewPublic

Authored by eastig on Mar 6 2017, 7:04 AM.

Details

Reviewers
chandlerc
Summary

This patch introduces an inline heuristic to be used whilst the cost of inlining is calculated for calls at '-Oz' optimization level.

When the cost of inlining is calculated for a call which has the fast calling convention and it is a call of a frequently used function, the penalty of the call is not taken into account. By default a function is frequently used if a number of its users is greater than 3. This value can be change via the command-line option '-inline-freq-func-threshold'. The heuristic allows inlining of very small functions, like a few instructions.

The rationale for the heuristic:

  1. Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers.
  2. Inlining of frequently used functions increases the code size.
  3. It's based on real examples of code for microcontrollers (v6-M). We see an improvement of the code size of the examples ~6.5%.

Benchmark results of the LNT testsuite:

Code size of only three benchmarks were affected by the heuristic.

x86 (i7-4770):

Code size improvementPerf regression
MultiSource/Applications/SPASS/SPASS6.67%~0%
MultiSource/Benchmarks/FreeBench/distray/distray2.46%~0%
MultiSource/Applications/sqlite3/sqlite31.34%~0%

v7-A (Cortex-A53, Thumb)

Code size improvementPerf regression
MultiSource/Applications/SPASS/SPASS4.93%~3%
MultiSource/Applications/sqlite3/sqlite31.12%~1.5%

AArch64 (Cortex-A57)

Code size improvementPerf regression
MultiSource/Applications/SPASS/SPASS3.95%~3%
MultiSource/Benchmarks/FreeBench/distray/distray1.17%~0%
MultiSource/Applications/sqlite3/sqlite31.07%~0%

Event Timeline

eastig created this revision.Mar 6 2017, 7:04 AM

Special-casing calls marked "fast" doesn't really make sense in general. For example, on ARM v6-M, "fast" is exactly the same as the C calling convention.

That said, adding an inliner special-case to inline internal functions with two or three callers more aggressively might make sense, similar to the existing bonus for inlining internal functions with exactly one caller. (This might have a similar effect to your patch because calls marked "fast" generally are calling functions with internal linkage.)

Err, to be more clear, I should say "fastcc" is the same as the C calling convention. (Calls aren't marked fast, they have calling convention fastcc.)

Gah, I should read more carefully before I post. I meant to say, on ARM v6-M, fastcc is exactly the same as the C calling convention.

Special-casing calls marked "fast" doesn't really make sense in general. For example, on ARM v6-M, "fast" is exactly the same as the C calling convention.

More precise, the code from GlobalOpt:

static bool isProfitableToMakeFastCC(Function *F) {
  CallingConv::ID CC = F->getCallingConv();
  // FIXME: Is it worth transforming x86_stdcallcc and x86_fastcallcc?
  return CC == CallingConv::C || CC == CallingConv::X86_ThisCall;
}

You are mixing up the idea of fast calls and what is marked as a fast call. According to the http://llvm.org/docs/LangRef.html#calling-conventions:

“fastcc” - The fast calling convention
This calling convention attempts to make calls as fast as possible (e.g. by passing things in registers). This calling convention allows the target to use whatever tricks it wants to produce fast code for the target, without having to conform to an externally specified ABI (Application Binary Interface). Tail calls can only be optimized when this, the GHC or the HiPE convention is used. This calling convention does not support varargs and requires the prototype of all callees to exactly match the prototype of the function definition.

Why not to use this information? Based on this we can expect fast calls will have minimized call overhead.

That said, adding an inliner special-case to inline internal functions with two or three callers more aggressively might make sense, similar to the existing bonus for inlining internal functions with exactly one caller. (This might have a similar effect to your patch because calls marked "fast" generally are calling functions with internal linkage.)

If a call is not marked as a call with a low call overhead we should not assume this.
Also we can mark calls as fastcc based on additional heuristics or target knowledge.

eastig added a comment.Mar 6 2017, 1:00 PM

Special-casing calls marked "fast" doesn't really make sense in general.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

Based on this we can expect fast calls will have minimized call overhead.

That's true, in some sense... but that's also true for the C calling convention in most cases.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

It's better to think of fastcc as the default calling convention for everything where the LLVM optimizer controls the calling convention. It passes values in registers where it makes sense, and tries to strike a reasonable balance between caller-save and callee-save registers. We use it over the C calling convention where we can so that we aren't stuck doing stupid things, like passing everything on the stack on x86-32, or shuffling floating-point values between integer and VFP registers on ARM with a softfp abi. It doesn't really indicate anything about the absolute overhead of a call.

lib/Analysis/InlineCost.cpp
1219

getNumUses() isn't going to return anything useful unless the function has internal linkage; even if a function only has one or two uses in the current file, it's likely to have more uses in other files the compiler can't see.

chandlerc edited edge metadata.Mar 6 2017, 1:39 PM

Based on this we can expect fast calls will have minimized call overhead.

That's true, in some sense... but that's also true for the C calling convention in most cases.

If it makes no sense why functions are marked as fastcc so aggressively. I think not all calls of them can actually be made as fastcc.

It's better to think of fastcc as the default calling convention for everything where the LLVM optimizer controls the calling convention. It passes values in registers where it makes sense, and tries to strike a reasonable balance between caller-save and callee-save registers. We use it over the C calling convention where we can so that we aren't stuck doing stupid things, like passing everything on the stack on x86-32, or shuffling floating-point values between integer and VFP registers on ARM with a softfp abi. It doesn't really indicate anything about the absolute overhead of a call.

Agreed.

I suspect that the underlying problem is some combination of imprecise modeling of call overhead and failing to account for low numbers of call sites that aren't exactly one.

I suspect that there would be more general (and still quite easy) fixes for the call overhead modeling, and some tuning of heuristics around the number of call sites seems like a good idea much as Eli suggested.

eastig added a comment.Mar 7 2017, 1:28 AM

Hi Eli,

Thank you for explanation. You are right. Assuming the model you've described checking fastcc makes no sense.
I'll replace it with hasLocalLinkage. I'll also do runs with FrequentFuncThreshold==2 and FrequentFuncThreshold==3.

Thanks,
Evgeny

eastig updated this revision to Diff 90941.Mar 7 2017, 2:21 PM
eastig edited the summary of this revision. (Show Details)

Updated according to Eli's comments:

  • The heuristic is changed to check internal functions with 3+ callers.
eastig added a comment.Mar 7 2017, 2:41 PM

LNT data I've got at the moment:

  • x86

FrequentFuncThreshold: 2+ callers

Code size changeExe time
MultiSource/Applications/SPASS/SPASS-6.21%+0..+1%
MultiSource/Benchmarks/FreeBench/distray/distray-2.79%0%
SingleSource/Benchmarks/Misc/perlin-2.32%0%
SingleSource/Benchmarks/BenchmarkGame/Large/fasta-1.75%0%
MultiSource/Applications/sqlite3/sqlite3-1.09%+0..1%
SingleSource/Benchmarks/Misc-C++/Large/sphereflake+1.54%0%
MultiSource/Benchmarks/Olden/tsp/tsp+1.16%0%

FrequentFuncThreshold: 3+ callers

Code size changeExe time
MultiSource/Applications/SPASS/SPASS-6.29%+0..+1%
MultiSource/Benchmarks/FreeBench/distray/distray-2.79%0%
MultiSource/Applications/sqlite3/sqlite3-1.27%+0..1%
  • AArch64

FrequentFuncThreshold: 2+ callers

Code size changeExe time
MultiSource/Benchmarks/Olden/perimeter/perimeter-4.0%+5%
MultiSource/Applications/SPASS/SPASS-3.97%+4%
MultiSource/Benchmarks/FreeBench/distray/distray-1.17%0%
MultiSource/Applications/sqlite3/sqlite3-1.05%+2%

FrequentFuncThreshold: 3+ callers

Code size changeExe time
MultiSource/Applications/SPASS/SPASS-4.04%+4%
MultiSource/Benchmarks/Olden/perimeter/perimeter-4.0%4%
MultiSource/Benchmarks/FreeBench/distray/distray-1.17%0%
MultiSource/Applications/sqlite3/sqlite3-1.11%+3%
eastig updated this revision to Diff 90945.Mar 7 2017, 2:45 PM

Fixed code comments

efriedma added inline comments.Mar 7 2017, 3:54 PM
lib/Analysis/InlineCost.cpp
1216

This comparison seems backwards... generally, we assume non-local functions have many callers, and therefore treat them the same way as local functions with many callers.

eastig added inline comments.Mar 8 2017, 10:45 AM
lib/Analysis/InlineCost.cpp
1216

If I understand you correctly you mean this:

if (!Callee || !Callee->hasLocalStorage())
  return true;

Do I?

efriedma added inline comments.Mar 8 2017, 10:52 AM
lib/Analysis/InlineCost.cpp
1216

Yes.

eastig updated this revision to Diff 91095.Mar 8 2017, 3:50 PM

Updated according Eli's comment.

Probably needs new benchmarking numbers with that change.

The functionality mostly makes sense now, but the comments/naming are really confusing. "Frequent" makes it sound like a check based on runtime profiling check rather than the absolute number of callers. And there isn't any comment explaining why we want to more aggressively inline local functions with 2 callers.

lib/Analysis/InlineCost.cpp
1296

You might want give a bonus to functions with few callers rather than a penalty to functions with many callers; as-is, you're basically decreasing the default inline threshold from 5 to -20.

MatzeB added a subscriber: MatzeB.Mar 8 2017, 4:03 PM

"Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers."

BTW: In terms of code size this statement isn't convincing to me as calls still impose constraints on the register allocator so we end up producing moves: Parameters need to go in specific registers, results come out in specific registers and live-through values have to be moved to callee-saved registers.

eastig added a comment.Mar 9 2017, 9:06 AM

"Fast calls are usually optimized to minimize the call penalty, e.g. parameters are passed through registers."

BTW: In terms of code size this statement isn't convincing to me as calls still impose constraints on the register allocator so we end up producing moves: Parameters need to go in specific registers, results come out in specific registers and live-through values have to be moved to callee-saved registers.

If this was an often case we would see code size increases in LNT.

eastig added inline comments.Mar 9 2017, 2:59 PM
lib/Analysis/InlineCost.cpp
1296

As I can see Cost = Cost_we_cannot_remove - Cost_we_can_remove; where Cost_we_can_remove = Call + Ret + Arguments + Call_Penalty.
Could you please explain me the values of the threshold? Why is it 5 for Oz? Maybe, it's simpler to adjust the threshold than to add my code.

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

I ran LNT for the patch. There are big code-size regressions in benchmarks written in C++. So the idea to treat functions with the non-local linkage as having many users needs some research.

OptMinSizeThreshold is basically set experimentally; http://reviews.llvm.org/rL288024 is the last time it was adjusted.

I don't think anyone's looked at CallPenalty in a long time; it's possible it's too large at -Oz.

Hi Eli,
What do think about an idea to add a knob for CallPenalty? Playing with it and the threshold knob in the driver can do what I want to do with the patch.

I've found an interesting thing related to the cost of getelementptr:

while.cond:                                       ; preds = %while.body, %entry
  %dest.addr.0 = phi i8* [ %dest, %entry ], [ %incdec.ptr1, %while.body ]
  %src.addr.0 = phi i8* [ %src, %entry ], [ %incdec.ptr, %while.body ]
  %tobool = icmp eq i32 %size.addr.0, 0
  br i1 %tobool, label %while.end, label %while.body

while.body:                                       ; preds = %while.cond
  %dec = add nsw i32 %size.addr.0, -1
  %incdec.ptr = getelementptr inbounds i8, i8* %src.addr.0, i32 1
  %0 = load i8, i8* %src.addr.0, align 1, !tbaa !12
  %incdec.ptr1 = getelementptr inbounds i8, i8* %dest.addr.0, i32 1
  store i8 %0, i8* %dest.addr.0, align 1, !tbaa !12
  br label %while.cond

while.end:                                        ; preds = %while.cond

Getelementptr instructions have inter-iteration dependencies. CallAnalyzer::visitGetElementPtr returns true for them which means the instructions can be simplified. As a result they don't contribute into the cost of inlining. The instructions are not simplified and are lowered to cost-having instructions.
In fact instructions having inter-iteration dependencies are not so easy to simplify.
Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

I ran LNT for the patch. There are big code-size regressions in benchmarks written in C++. So the idea to treat functions with the non-local linkage as having many users needs some research.

Hmm. Maybe you're just turning the inline threshold too low to eliminate C++ abstractions. Or maybe in C++, the average number of callers for a linkonce function is lower than we might expect. Not sure.

What do think about an idea to add a knob for CallPenalty? Playing with it and the threshold knob in the driver can do what I want to do with the patch.

Makes sense, at least to make it easier to experiment with it.

Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

The problem isn't really inter-iteration dependencies, exactly... it's that the computation which declares a GEP "free" (see getGEPCost in TargetTransformInfoImpl.h) isn't paying attention to the instructions which use the GEP. A GEP can only be free if we can fold the addressing mode into a load or store instruction. Not sure how relevant that is in practice, though.

Should we have a check of inter-iteration dependencies in CallAnalyzer::visitGetElementPtr?

The problem isn't really inter-iteration dependencies, exactly... it's that the computation which declares a GEP "free" (see getGEPCost in TargetTransformInfoImpl.h) isn't paying attention to the instructions which use the GEP. A GEP can only be free if we can fold the addressing mode into a load or store instruction.

'getGEPCost' is not called. Execution ends here in InlineCost.cpp because an offset is a constant:

439	  if (IsGEPOffsetConstant(I)) {
440	    if (SROACandidate)
441	      SROAArgValues[&I] = SROAArg;
442	
443	    // Constant GEPs are modeled as free.
444	    return true;
445	  }

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Not sure how relevant that is in practice, though.

Do you mean the IR code I provided? This is what a naive memcopy function is translated into.

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Oh, the inliner has its own equivalent? Anyway, that isn't really the point; the point is that we're concluding the GEP is free because we're assuming the users of the GEP are loads and stores, rather than PHI nodes/calls/etc.

Not sure how relevant that is in practice, though.

Do you mean the IR code I provided? This is what a naive memcopy function is translated into.

Sorry, wasn't clear there; I mean, it obviously triggers in a substantial number of cases, but it probably doesn't actually change the inline cost enough to matter in most cases.

I guess if it was not a loop GEP would have been folded into a load/store instruction. Anyway the cost calculation for GEP is not correct.
Maybe instead of 'return true' it should be 'return isGEPFree(I);'. Then TTI::getGEPCost would return it is not free.

Oh, the inliner has its own equivalent? Anyway, that isn't really the point; the point is that we're concluding the GEP is free because we're assuming the users of the GEP are loads and stores, rather than PHI nodes/calls/etc.

My understanding of GEP was exactly as you wrote. I've browsed the LangRef and http://llvm.org/docs/GetElementPtr.html and I have not found anything prohibiting using GEP in such way. I think this representation of an increment of a pointer in a loop is better than:

inc.offset.0 = phi(offset, inc.offset)
inc.offset = inc.offset.0 + 1
ptr = GEP(base, inc.offset.0)

However the last one gives the correct cost.
BTW I did an experiment when such GEPs are not free and the call penalty has lower values: 5 and 10. 5 means it's a cost of one instruction. 10 is a cost of two instructions. This model of the call penalty might be good for ARM where a call just saves PC into a register. So it's something like a sequence of MOV + BR.
I didn't use the heuristic of many callers. The inline threshold value was not changed. It was 5.
The functions I didn't want to inline were not inlined. C++ benchmarks code size regressions improved but there are still some small regressions.

I checked if the GEPs from the provided IR are free on x86. No, they are not:

        movq    %rdi, -8(%rbp)
        movq    %rsi, -16(%rbp)
        movl    %edx, -20(%rbp)
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movl    -20(%rbp), %eax
        movl    %eax, %ecx
        addl    $-1, %ecx
        movl    %ecx, -20(%rbp)
        cmpl    $0, %eax
        je      .LBB0_3
# BB#2:                                 #   in Loop: Header=BB0_1 Depth=1
        movq    -16(%rbp), %rax
        movq    %rax, %rcx
        addq    $1, %rcx
        movq    %rcx, -16(%rbp)
        movb    (%rax), %dl
        movq    -8(%rbp), %rax
        movq    %rax, %rcx
        addq    $1, %rcx
        movq    %rcx, -8(%rbp)
        movb    %dl, (%rax)
        jmp     .LBB0_1
.LBB0_3:

MOVQ + ADDQ are generated for them. I am writing a patch.