This is an archive of the discontinued LLVM Phabricator instance.

Inlining: Run the legacy AlwaysInliner before the regular inliner.
ClosedPublic

Authored by aemerson on Feb 8 2023, 8:04 PM.

Details

Summary

We have several situations where it's beneficial for code size to ensure that every
call to always-inline functions are inlined before normal inlining decisions are
made. While the normal inliner runs in a "MandatoryOnly" mode to try to do this,
it only does it on a per-SCC basis, rather than the whole module. Ensuring that
all mandatory inlinings are done before any heuristic based decisions are made
just makes sense.

Despite being referred to the "legacy" AlwaysInliner pass, it's already necessary
for -O0 because the CGSCC inliner is too expensive in compile time to run at -O0.

This also fixes an exponential compile time blow up in
https://github.com/llvm/llvm-project/issues/59126

Diff Detail

Event Timeline

aemerson created this revision.Feb 8 2023, 8:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2023, 8:04 PM
aemerson requested review of this revision.Feb 8 2023, 8:04 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: sstefan1. · View Herald Transcript

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

llvm/lib/Passes/PassBuilderPipelines.cpp
1082

I think we want to insert lifetime intrinsics when optimizing

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

I think that doesn't actually do the same thing as this, since the Calls vector is populated by visiting the functions in the current SCC. What we're trying to do with this patch is to ensure that all always-inline calls globally are processed first.

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

I think that doesn't actually do the same thing as this, since the Calls vector is populated by visiting the functions in the current SCC. What we're trying to do with this patch is to ensure that all always-inline calls globally are processed first.

That's true, but the legacy pass manager where the inliner explosion didn't happen in your case didn't process always-inline calls before other calls. So I don't think it's necessary to process alwaysinline calls globally first to fix your case. However, given that we still do two more rounds of inlining in the inliner pipeline after the alwaysinliner pass you added and your case still doesn't blow up, this solution does seem robust.

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

IIRC we had at a point a mandatory , whole-module pass. The idea wast that let's not have N inliners, and that AlwaysInliner had some limitations, and for similar reasons @aemerson pointed out, it'd make sense to first perform the mandatory inlines (D91567). In D94825 we went away from that. I don't remember why. @aeubanks? (it's referencing a patch you started)

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

I think that doesn't actually do the same thing as this, since the Calls vector is populated by visiting the functions in the current SCC. What we're trying to do with this patch is to ensure that all always-inline calls globally are processed first.

That's true, but the legacy pass manager where the inliner explosion didn't happen in your case didn't process always-inline calls before other calls. So I don't think it's necessary to process alwaysinline calls globally first to fix your case. However, given that we still do two more rounds of inlining in the inliner pipeline after the alwaysinliner pass you added and your case still doesn't blow up, this solution does seem robust.

Sure, the exponential compile time case is actually just a side benefit here. The motivating reason for this change is actually to improve code size when building codebases that make heavy use of always_inline.

__clang_hip_math.hip is annoying...

We'll need to remove the MandatoryFirst inliner in ModuleInlinerWrapperPass, although not sure if @mtrofin has any issues with that or not

IIRC we had at a point a mandatory , whole-module pass. The idea wast that let's not have N inliners, and that AlwaysInliner had some limitations, and for similar reasons @aemerson pointed out, it'd make sense to first perform the mandatory inlines (D91567). In D94825 we went away from that. I don't remember why. @aeubanks? (it's referencing a patch you started)

This isn't quite what I had initially thought, but this might be better. (I was thinking that we sort the calls in the inliner to visit alwaysinline calls first, but that might cause more compile time issues since we have to update the call graph after visiting all the calls in a function, but we might be visiting every function twice if we first batch process the alwaysinline calls then all other calls)

Without knowing that there were inliner explosion issues, I still think it makes more sense to not visit all functions twice. e.g. it helps with cache locality when compiling if you don't visit functions on two separate walks of the call graph.

But if this solves the issue then I think this patch is good. Getting rid of the mandatory inline advisor is a nice cleanup.

Sure, the exponential compile time case is actually just a side benefit here. The motivating reason for this change is actually to improve code size when building codebases that make heavy use of always_inline.

Ah I didn't see that part of the commit message. Mentioning code size more explicitly in the message would be good.

Had a chat offline with @mtrofin, wanted to be clear for future purposes that we do need the separate AlwaysInliner pass because it's used in -O0 and constructing a call graph there is non-trivial in terms of compile time. Originally the mandatory mode of the normal inliner was added to maybe remove the separate AlwaysInliner pass in the future, but that's not going to happen because of what I just said. Given that, we can eventually remove the mandatory mode of the normal inliner after this patch goes through. So this patch should also make mandatory-inlining-first false by default, then we remove it in a separate patch.

Had a chat offline with @mtrofin, wanted to be clear for future purposes that we do need the separate AlwaysInliner pass because it's used in -O0 and constructing a call graph there is non-trivial in terms of compile time. Originally the mandatory mode of the normal inliner was added to maybe remove the separate AlwaysInliner pass in the future, but that's not going to happen because of what I just said. Given that, we can eventually remove the mandatory mode of the normal inliner after this patch goes through. So this patch should also make mandatory-inlining-first false by default, then we remove it in a separate patch.

Ok, sounds good. I'll make the changes.

mtrofin accepted this revision.Feb 9 2023, 3:35 PM

lgtm. Like @aeubanks was saying, let's just give a bit of time (1 month or so?) between when this lands and until we clean up the "mandatory" notion from the advisor, just to make sure nothing breaks/regresses.

This revision is now accepted and ready to land.Feb 9 2023, 3:35 PM

Had a chat offline with @mtrofin, wanted to be clear for future purposes that we do need the separate AlwaysInliner pass because it's used in -O0 and constructing a call graph there is non-trivial in terms of compile time.

Worth maybe spelling that out in the patch description - i.e. why not go the D91567 route again, makes it easier to understand later.

Originally the mandatory mode of the normal inliner was added to maybe remove the separate AlwaysInliner pass in the future, but that's not going to happen because of what I just said. Given that, we can eventually remove the mandatory mode of the normal inliner after this patch goes through. So this patch should also make mandatory-inlining-first false by default, then we remove it in a separate patch.

aemerson updated this revision to Diff 496283.Feb 9 2023, 4:42 PM
aemerson edited the summary of this revision. (Show Details)

Address comments. Disable -mandatory-inlining-first by default and insert lifetime intrinsics if not at -O0.

aeubanks added inline comments.Feb 9 2023, 4:49 PM
llvm/lib/Passes/PassBuilderPipelines.cpp
1082

this will never be called with Level == OptimizationLevel::O0, true is good enough

llvm/test/Transforms/Inline/always-inline-newpm.ll
2

a better file name is always-inline-phase-ordering, legacy PM is deprecated anyway

was this file exploding before?

This revision was landed with ongoing or failed builds.Feb 9 2023, 4:49 PM
This revision was automatically updated to reflect the committed changes.

(my latest comments didn't get addressed in the land)

aemerson added inline comments.Feb 9 2023, 9:53 PM
llvm/lib/Passes/PassBuilderPipelines.cpp
1082

Ok.

llvm/test/Transforms/Inline/always-inline-newpm.ll
2

Ok I'll rename, this one is demonstrating different/smaller code size with this change.

Hello - I had to revert this because of some large regressions we got from routines in CMSIS-DSP.

The llvm/test/Transforms/PhaseOrdering/ARM/arm_mult_q15.ll test shows the problem - that's why that test exists to ensure that any pipeline changes don't negatively affect these routines. Unfortunately you just changed the test as opposed to showing the problems that this causes. They might be fixable with some other tweaks elsewhere, but the ordering of inlining seems important for getting the correct code that can be vectorized nicely.

There are some other cases around inlining this thing on v6m cores: https://github.com/ARM-software/CMSIS-DSP/blob/809202bf185280a322efc2e2c850a544747f9d79/Include/arm_math_memory.h#L76, but I'm not sure about the details yet. The mult examples were the really large regressions.

Hello - I had to revert this because of some large regressions we got from routines in CMSIS-DSP.

The llvm/test/Transforms/PhaseOrdering/ARM/arm_mult_q15.ll test shows the problem - that's why that test exists to ensure that any pipeline changes don't negatively affect these routines. Unfortunately you just changed the test as opposed to showing the problems that this causes. They might be fixable with some other tweaks elsewhere, but the ordering of inlining seems important for getting the correct code that can be vectorized nicely.

There are some other cases around inlining this thing on v6m cores: https://github.com/ARM-software/CMSIS-DSP/blob/809202bf185280a322efc2e2c850a544747f9d79/Include/arm_math_memory.h#L76, but I'm not sure about the details yet. The mult examples were the really large regressions.

It’s not clear from the original commit message why the test is related to inlining order? It seems entirely testing vectorization cost model which should be insensitive to these kind of changes, right?

It’s not clear from the original commit message why the test is related to inlining order? It seems entirely testing vectorization cost model which should be insensitive to these kind of changes, right?

It's a phase ordering test - it's testing the entire pipeline including all the inlining and simplification that needs to happen :)

You can run update_test_checks of the file to see the differences. I believe the inlining causes differences in the code that then cause different vector factors to be chosen. I can try to add a similar test for the other case that got worse, if they are similar.

It’s not clear from the original commit message why the test is related to inlining order? It seems entirely testing vectorization cost model which should be insensitive to these kind of changes, right?

It's a phase ordering test - it's testing the entire pipeline including all the inlining and simplification that needs to happen :)

You can run update_test_checks of the file to see the differences. I believe the inlining causes differences in the code that then cause different vector factors to be chosen. I can try to add a similar test for the other case that got worse, if they are similar.

I’ll take a look, but this indicates to me that there’s something missing from the vectoriser or later passes rather than a problem with the inliners behaviour.

I’ll take a look, but this indicates to me that there’s something missing from the vectoriser or later passes rather than a problem with the inliners behaviour.

Sure. I'm not saying that this patch is wrong. I'm just saying that unfortunately it leads to some pretty large regressions. Hopefully we can figure out why and put fixes in place instead of just bodging the tests. Hopefully it's just some missing fold to get the code into the same form it was before, after all the inlining has happened.

I took a look at one of the other cases, it appears to be a pretty unfortunate case of the load order in loops leading to LSR not recognizing chains of instructions due to them being order with offsets [2,0,6,4,10,8,..], as opposed to the order they were in before [0,2,4,6,8,10...], which was an easier to reason about. https://godbolt.org/z/Grv64xoxW. I'm not sure exactly what the best way to fix that would be, without making other cases worse.

@dmgreen I've been looking at this test again trying to see what's missing. The problem now is that only a VF of 4 is chosen. In the good case, instcombine/simplifyCFG runs so that it simplifies down to an smin intrinsic. After this change __SSAT() is inlined first. We then have:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-linux-gnu"

define void @arm_mult_q15(ptr %pSrcA, ptr %pSrcB, ptr noalias %pDst, i32 %blockSize) {
entry:
  br label %while.cond

while.cond:                                       ; preds = %while.body, %entry
  %pSrcB.addr.0 = phi ptr [ %pSrcB, %entry ], [ %incdec.ptr1, %while.body ]
  %pDst.addr.0 = phi ptr [ %pDst, %entry ], [ %incdec.ptr4, %while.body ]
  %pSrcA.addr.0 = phi ptr [ %pSrcA, %entry ], [ %incdec.ptr, %while.body ]
  %blkCnt.0 = phi i32 [ %blockSize, %entry ], [ %dec, %while.body ]
  %cmp.not = icmp eq i32 %blkCnt.0, 0
  br i1 %cmp.not, label %while.end, label %while.body

while.body:                                       ; preds = %while.cond
  %incdec.ptr = getelementptr inbounds i16, ptr %pSrcA.addr.0, i32 1
  %0 = load i16, ptr %pSrcA.addr.0, align 2
  %conv = sext i16 %0 to i32
  %incdec.ptr1 = getelementptr inbounds i16, ptr %pSrcB.addr.0, i32 1
  %1 = load i16, ptr %pSrcB.addr.0, align 2
  %conv2 = sext i16 %1 to i32
  %mul = mul nsw i32 %conv, %conv2
  %shr = ashr i32 %mul, 15
  %cmp4.i = icmp sgt i32 %shr, 32767
  %switch.i = icmp ult i1 %cmp4.i, true
  %spec.select.i = select i1 %switch.i, i32 %shr, i32 32767
  %conv3 = trunc i32 %spec.select.i to i16
  %incdec.ptr4 = getelementptr inbounds i16, ptr %pDst.addr.0, i32 1
  store i16 %conv3, ptr %pDst.addr.0, align 2
  %dec = add i32 %blkCnt.0, -1
  br label %while.cond

while.end:                                        ; preds = %while.cond
  ret void
}

These instructions are from the callee that should now be combined into smin:

%cmp4.i = icmp sgt i32 %shr, 32767
%switch.i = icmp ult i1 %cmp4.i, true
%spec.select.i = select i1 %switch.i, i32 %shr, i32 32767

... except due to the surrounding instructions, the first icmp is optimized into
icmp sgt i32 %mul, 1073741823 by InstCombinerImpl::foldICmpInstWithConstant()

This breaks the smin recognition. I'm not sure what the best approach is to fix this. InstCombine already has this chunk of code to try to avoid messing with compares that might form min/max patterns but it expects further simplification to fire:

// Test if the ICmpInst instruction is used exclusively by a select as
// part of a minimum or maximum operation. If so, refrain from doing
// any other folding. This helps out other analyses which understand
// non-obfuscated minimum and maximum idioms, such as ScalarEvolution
// and CodeGen. And in this case, at least one of the comparison
// operands has at least one user besides the compare (the select),
// which would often largely negate the benefit of folding anyway.
//
// Do the same for the other patterns recognized by matchSelectPattern.
if (I.hasOneUse())
  if (SelectInst *SI = dyn_cast<SelectInst>(I.user_back())) {
    Value *A, *B;
    SelectPatternResult SPR = matchSelectPattern(SI, A, B);
    if (SPR.Flavor != SPF_UNKNOWN)
      return nullptr;
  }

Any ideas? I'd really like to get this inliner change in because it's fundamentally a good change to have.

Hello. It sounds like it is really close to being OK. The combine of the shift just seem to make things more difficult.

The icmp ult i1 %cmp4.i, true is just a not, would it help if it was actually an xor? Or if the not(icmp sgt) was changed to a slt earlier?

I was taking a look at the example but I am not super sure what to suggest. Would it be best if the code that detect min/max looked through not's?

Hello. It sounds like it is really close to being OK. The combine of the shift just seem to make things more difficult.

The icmp ult i1 %cmp4.i, true is just a not, would it help if it was actually an xor? Or if the not(icmp sgt) was changed to a slt earlier?

I was taking a look at the example but I am not super sure what to suggest. Would it be best if the code that detect min/max looked through not's?

I posted an attempt at this: https://reviews.llvm.org/D149725

It looks like there is quite a lot more optimization that happens to the function being always-inlined (__SSAT) before this change. Through multiple rounds of instcombine, almost to the end of the pass pipeline. The new version runs a lot less before inlining, only running instcombine->simplifycfg and not seeing another instcombine to clean up the results. Is that because the AlwaysInlinePass is a module pass and it now only runs the passes up to that point?

It does look like there might be a chance to undo the transform, as opposed to prevent the transform that blocks it. Something like https://alive2.llvm.org/ce/z/qHtPqz seems to happen at at least one point. Might that be more preferable?

There are some other changes I#m seeing though, from the same function inlined into different routine. This one for example seems to be not longer applying to canonicalizeClampLike, so the ssat doesn't get created. https://godbolt.org/z/qMW44qfz4. That doesn't seem to be easily undoable without knowing the value is positive though https://alive2.llvm.org/ce/z/v9YdaK.

nikic added a comment.May 3 2023, 8:08 AM

It looks like there is quite a lot more optimization that happens to the function being always-inlined (__SSAT) before this change. Through multiple rounds of instcombine, almost to the end of the pass pipeline. The new version runs a lot less before inlining, only running instcombine->simplifycfg and not seeing another instcombine to clean up the results. Is that because the AlwaysInlinePass is a module pass and it now only runs the passes up to that point?

Yes, which is why I personally think this change isn't a good idea. This essentially breaks our invariant that functions get simplified before they are inlined. This significantly alters the way alwaysinline functions will be optimized relative to normally inlined functions.

It looks like there is quite a lot more optimization that happens to the function being always-inlined (__SSAT) before this change. Through multiple rounds of instcombine, almost to the end of the pass pipeline. The new version runs a lot less before inlining, only running instcombine->simplifycfg and not seeing another instcombine to clean up the results. Is that because the AlwaysInlinePass is a module pass and it now only runs the passes up to that point?

Yes, which is why I personally think this change isn't a good idea. This essentially breaks our invariant that functions get simplified before they are inlined. This significantly alters the way alwaysinline functions will be optimized relative to normally inlined functions.

(Nitpicking just on the invariant part) Not sure if that's always the invariant, because we could be inlining a call site in a SCC where both caller and callee are in that same SCC.

It looks like there is quite a lot more optimization that happens to the function being always-inlined (__SSAT) before this change. Through multiple rounds of instcombine, almost to the end of the pass pipeline. The new version runs a lot less before inlining, only running instcombine->simplifycfg and not seeing another instcombine to clean up the results. Is that because the AlwaysInlinePass is a module pass and it now only runs the passes up to that point?

Yes, which is why I personally think this change isn't a good idea. This essentially breaks our invariant that functions get simplified before they are inlined. This significantly alters the way alwaysinline functions will be optimized relative to normally inlined functions.

That invariant shouldn't matter if we're not using heuristics to inline. The normal heuristic-based inliner will still work on simplified callees, but now with the additional benefit of seeing the state of an SCC where there may be alwaysinline calls after the inlinings that must happen having happened.

nikic added a comment.May 3 2023, 12:28 PM

It looks like there is quite a lot more optimization that happens to the function being always-inlined (__SSAT) before this change. Through multiple rounds of instcombine, almost to the end of the pass pipeline. The new version runs a lot less before inlining, only running instcombine->simplifycfg and not seeing another instcombine to clean up the results. Is that because the AlwaysInlinePass is a module pass and it now only runs the passes up to that point?

Yes, which is why I personally think this change isn't a good idea. This essentially breaks our invariant that functions get simplified before they are inlined. This significantly alters the way alwaysinline functions will be optimized relative to normally inlined functions.

That invariant shouldn't matter if we're not using heuristics to inline. The normal heuristic-based inliner will still work on simplified callees, but now with the additional benefit of seeing the state of an SCC where there may be alwaysinline calls after the inlinings that must happen having happened.

I agree in the sense that for alwaysinling the simplification doesn't matter for cost modelling purposes. However, it can still have other benefits. One is that we don't repeat unnecessary work. Any simplification we do before inlining doesn't have to be repeated for every (potentially transitive) caller it gets inlined into. The other (and in a way, opposite) is that we simplify the function before inlining and then again the callee after inlining, which may paper over phase ordering issues (or the problem discussed above, which is kind of in the same category). Of course, this is not what this is intended for and we should endeavor to fix such issues -- but the practical outcome is still that you'll probably get worse optimization if you use alwaysinline vs inline after this change, which seems kind of unintuitive.