Do not early-inline recursive calls in sample profile loader.
ClosedPublic

Authored by danielcdh on Jun 7 2017, 4:09 PM.

Details

Summary

Early-inlining of recursive call makes the code size bloat exponentially. We should not disable it.

danielcdh created this revision.Jun 7 2017, 4:09 PM

It doesn't have to blow it up exponentially. Can we clone the function to be inlined, re-writing the recursive calls to call the clone, and then inline that? Similar to a worker-wrapper transformation

It doesn't have to blow it up exponentially. Can we clone the function to be inlined, re-writing the recursive calls to call the clone, and then inline that? Similar to a worker-wrapper transformation

You mean clone the caller before any inlining, and inline that copy instead of the caller itself in early inlining? Yes we could do that, but that adds unnecessary copy overhead for every caller (which could be large). As llvm does not do recursive inlining anyway, this only solves cases where profile is collected from gcc binary, which should be temporary. So I guess it's not worth the effort to special-handle recursive early inlining.

Does it miss opportunities to profile annotate inline instance of indirect target?

Does it miss opportunities to profile annotate inline instance of indirect target?

For indirect recursive-call, yes. But those calls will not be by llvm inlined anyway. So promote to direct call does not help much.

So the profile is from GCC built binary ?

It doesn't have to blow it up exponentially. Can we clone the function to be inlined, re-writing the recursive calls to call the clone, and then inline that? Similar to a worker-wrapper transformation

You mean clone the caller before any inlining, and inline that copy instead of the caller itself in early inlining? Yes we could do that, but that adds unnecessary copy overhead for every caller (which could be large). As llvm does not do recursive inlining anyway, this only solves cases where profile is collected from gcc binary, which should be temporary. So I guess it's not worth the effort to special-handle recursive early inlining.

I meant to do it only if the inline is recursive. The overhead should be much lower in that case.
Clone the function, make the recursive calls call the clone, inline the clone. Repeat until you no longer want to inline any more, then take the remaining calls to the clone, and change them back. GC the clone. You could think of it as a kind of loop unrolling.

iteratee added inline comments.Jun 7 2017, 5:13 PM
lib/Transforms/IPO/SampleProfile.cpp
698

Add a comment?

So the profile is from GCC built binary ?

yes, we need to make sure it does not cause build breakage.

It doesn't have to blow it up exponentially. Can we clone the function to be inlined, re-writing the recursive calls to call the clone, and then inline that? Similar to a worker-wrapper transformation

You mean clone the caller before any inlining, and inline that copy instead of the caller itself in early inlining? Yes we could do that, but that adds unnecessary copy overhead for every caller (which could be large). As llvm does not do recursive inlining anyway, this only solves cases where profile is collected from gcc binary, which should be temporary. So I guess it's not worth the effort to special-handle recursive early inlining.

I meant to do it only if the inline is recursive. The overhead should be much lower in that case.
Clone the function, make the recursive calls call the clone, inline the clone. Repeat until you no longer want to inline any more, then take the remaining calls to the clone, and change them back. GC the clone. You could think of it as a kind of loop unrolling.

This is still not-optimal. e.g.

foo() {

if (a)
  bar_large();
foo();

}

Suppose we first process bar_large() and inlined it. Then we start process foo and found it's recursive call, then we clone the foo body (with bar_large inlined). Then we recursively inline foo_clone into foo. But foo_clone already have bar_large(), which may not necessarily be cloned multiple times. A better solution would be

for (callsite cs : caller.callsites())

if (cs is recursive)
  clone caller

for (callsite cs : caller.callsites())

if (should early inline)
  if (cs is recursive)
    inline caller_clone into caller
  else
    inline callee into caller

But this is still not enough, as we may have foo calls bar, bar calls foo(). Which cannot be detected by the above algorithm. Of cause, we can process the SCC to find if caller could be early-inlined to be recursive, but that may be expensive, or make the logic complex.

danielcdh updated this revision to Diff 101844.Jun 7 2017, 5:30 PM

add comments

danielcdh marked an inline comment as done.Jun 7 2017, 6:00 PM
iteratee accepted this revision.Jun 8 2017, 11:26 AM
This revision is now accepted and ready to land.Jun 8 2017, 11:26 AM
danielcdh closed this revision.Jun 8 2017, 1:12 PM