This is an archive of the discontinued LLVM Phabricator instance.

Don't inline dynamic allocas that simplify to huge static allocas.
ClosedPublic

Authored by aemerson on Jun 12 2020, 2:21 PM.

Details

Summary

Some sequences of optimizations can generate call sites which may never be
executed during runtime, and through constant propagation result in dynamic
allocas being converted to static allocas with very large allocation amounts.

The inliner tries to move these to the caller's entry block, resulting in the
stack limits being reached/bypassed. Avoid inlining functions if this would
result.

The threshold of 64k currently doesn't get triggered on the test suite with an
-Os LTO build on arm64, care should be taken in changing this in future to avoid
needlessly pessimising inlining behaviour.

Diff Detail

Event Timeline

aemerson created this revision.Jun 12 2020, 2:21 PM
jfb added a subscriber: MadCoder.Jun 12 2020, 3:18 PM
jfb added inline comments.
llvm/include/llvm/Analysis/InlineCost.h
54

@MadCoder does this size limit make sense to you, or would you recommend something smaller?

llvm/lib/Transforms/Utils/InlineFunction.cpp
1939

This will also prevent allocations of unknown size from inlining, right? Is that a pessimization? I guess the problem you have is that you don't know whether you'll then propagate a huge size into the alloca, and then hoist it outside a block?

Would it make sense to both:

  • Refuse inlining large alloca of known size
  • Refuse moving large alloca of known size outside blocks
  • Refuse moving variable-sized alloca (unknown size) outside blocks

?

That way you can still inline functions with alloca of unknown size, you can still constant propagate into them, but you would hoist the alloca if it's big.

MadCoder added inline comments.Jun 12 2020, 3:39 PM
llvm/include/llvm/Analysis/InlineCost.h
54

64k is wayyyy too large.

a kernel stack is 16k.

and we've seen that cause explosions in ioctl() kind of syscalls that have a large switch going in their various implementations where one of the functions has a large stack buffer and is a leaf and another path has no stack allocations but will go in a deep stack.

the threshold for the kernel is likely something like 1 or 2k

jfb added a comment.Jun 12 2020, 3:39 PM

One thing I'd like to make sure we don't break:

__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}

This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...

Independent of the thresholds, I'd be very cautious about turning a static alloca into a dynamic alloca. Our code generation isn't very sophisticated in a lot of cases; it often requires a "base pointer" to be generated in a fixed register. This leads to a issues even if the alloca never runs: there's a performance penalty due to allocating the base pointer, and we can potentially break inline asm. We don't want to be doing that; if it comes down to either emitting a dynamic alloca or refusing to inline, refusing to inline is probably better.

aemerson marked an inline comment as done.Jun 12 2020, 4:53 PM

Independent of the thresholds, I'd be very cautious about turning a static alloca into a dynamic alloca. Our code generation isn't very sophisticated in a lot of cases; it often requires a "base pointer" to be generated in a fixed register. This leads to a issues even if the alloca never runs: there's a performance penalty due to allocating the base pointer, and we can potentially break inline asm. We don't want to be doing that; if it comes down to either emitting a dynamic alloca or refusing to inline, refusing to inline is probably better.

How would generating a dynamic alloca break inline asm? That seems very fragile to me if that's the case.

64k was chosen, somewhat arbitrarily, as a value which no reasonable piece of code should be allocating more of in the stack. This isn't really intended to catch cases except in very rare circumstances where we get something like -1 as the alloc size. That said, I can look at changing this approach to avoid inlining large static allocas completely if we know that it would likely end up being a dynamic alloca in the caller.

llvm/lib/Transforms/Utils/InlineFunction.cpp
1939

This patch doesn't actually change any inlining heuristics. At this point we've already inlined the function, we're just deciding whether or not to move static allocas in the inlined body to the caller's entry block.

What happens if the alloca that is left in the conditional instead of being moved at the caller's entry block is on the hot path, like in a loop?

llvm/include/llvm/Analysis/InlineCost.h
54

Nit: if this stays 65536, should it be typed uint32_t?? unsigned's size may not always be 4 bytes

llvm/lib/Transforms/Utils/InlineFunction.cpp
1936

Nit: IsAllocOverSizeLimit (or some other verb-starting alternative)

1939

I think jfb's point is about the effects of that decision.

aemerson updated this revision to Diff 270555.Jun 12 2020, 6:13 PM
aemerson retitled this revision from Don't hoist very large static allocas to the entry block during inlining to Don't inline dynamic allocas that simplify to huge static allocas..
aemerson edited the summary of this revision. (Show Details)

Instead of avoiding hoisting, avoid inlining altogether.

Just a thought, feel free to consider or ignore.

It occurs to me that the lowering of static alloca is under compiler control. Maybe we shouldn't look at this as being an inlining problem, but instead look at it as being a lowering problem? If we have a static alloca which is huge and only used in a rare codepath, maybe we should be converting that to a dynamic alloca if legal for the target?

It occurs to me that the lowering of static alloca is under compiler control. Maybe we shouldn't look at this as being an inlining problem, but instead look at it as being a lowering problem? If we have a static alloca which is huge and only used in a rare codepath, maybe we should be converting that to a dynamic alloca if legal for the target?

There are a few problems:

  1. Figuring out the legality on top of the current IR representation of dynamic allocation and lifetimes would be painful. If a function has any dynamic allocation, it's hard to isolate it into proper scopes. And lifetime markers are just messy to reason about. I have some vague ideas about rewriting the way we represent dynamic allocation to work more like ‘llvm.call.preallocated.setup, and rewriting the way we represent lifetimes, but I'm not completely sure how it would work.
  2. We'd probably have to teach targets to avoid the use of a fixed "base" pointer in functions with dynamic allocation, both for performance and to avoid weird interactions with inline asm. Probably feasible, but a substantial project.
efriedma added inline comments.Jun 22 2020, 11:46 PM
llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

Hhow is the handling for array allocation supposed to work if the alloca isn't in the entry block of the function? It looks like this code assumes that constant-propagation is enough to turn any dynamic alloca into a static alloca?

Not really a problem with this change, exactly, but I want to make sure I'm understanding the surrounding code correctly.

llvm/lib/Transforms/Utils/InlineFunction.cpp
29

Unnecessary change?

mtrofin added inline comments.Jun 23 2020, 7:26 AM
llvm/include/llvm/Analysis/InlineCost.h
54

I think MaxSimplifiedDynamicAllocaToMove should be part of InlineParams, since it's a threshold.

aemerson marked an inline comment as done.Jun 23 2020, 11:18 AM
aemerson added inline comments.
llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

I'm not entirely sure what you mean, but if an array allocation isn't in the entry block, and if constant prop can't simplify it to a static alloca, then it remains a dynamic alloca and the inliner gives up (see the end of this function).

jfb added a comment.Jun 23 2020, 11:23 AM
In D81765#2090952, @jfb wrote:

One thing I'd like to make sure we don't break:

__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}

This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...

Can you check this?

llvm/include/llvm/Analysis/InlineCost.h
54

It probably needs to be a per-target thing, since we want different values for userspace and kernel.

mtrofin added inline comments.Jun 23 2020, 12:10 PM
llvm/lib/Analysis/InlineCost.cpp
845 ↗(On Diff #270555)

I think that the decision here needs to be mindful of profiling information (if present). The hypothesis in the patch is that the call site may never be executed. What if we know (from profiling) it is always executed?

Also, could you provide more insight into the scenario generating this - thanks!

efriedma added inline comments.Jun 23 2020, 12:57 PM
llvm/include/llvm/Analysis/InlineCost.h
54

Rather than add a threshold for a single dynamic alloca, it might make sense to just pay attention to the overall size of the stack of the inlined function. We currently have a threshold TotalAllocaSizeRecursiveCaller which applies to recursive inlining; we could apply a similar threshold to inlining in general. That would avoid potential weird interactions with other passes about whether the alloca is actually dynamic.

It might make sense to let the user control that threshold; users running code on small stacks might care more about the stack size. But really, I'm not sure there's a great way to solve "how much stack memory should we speculatively allocate" if we can't see the whole callstack.

But really, I'm fine with this patch as-is as a spot fix for "I accidentally allocated 2^34 bytes in dead code". We probably want to consider the general issue of stack usage more carefully.

llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

My question is what happens if an array allocation isn't in the entry block, but we can simplify the size to a constant.

aemerson marked 4 inline comments as done.Jun 23 2020, 6:17 PM
In D81765#2109656, @jfb wrote:
In D81765#2090952, @jfb wrote:

One thing I'd like to make sure we don't break:

__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}

This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...

Can you check this?

Sure, I'll add this as a test.

llvm/include/llvm/Analysis/InlineCost.h
54

It probably needs to be a per-target thing, since we want different values for userspace and kernel.

For this specific bug any reasonable value is going to work to catch a -1 size allocation. Like Eli says I feel generally controlling the stack allocation size is a much broader set of changes than this one.

llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

That was the original motivation for this fix actually. If its simplified to a constant, then it can be converted to a static alloca and then later moved into the entry block. If that constant allocation is too large then we can get overflows.

845 ↗(On Diff #270555)

What if we know (from profiling) it is always executed?

IMO then that function is simply allocating a huge amount on the stack. I see that in a similar way to TotalAllocaSizeRecursiveCaller being triggered, in that case we don't want to inline either, no matter what profiling information says, right?

Also, could you provide more insight into the scenario generating this - thanks!

This is code generated as a result of the -ftrivial-auto-var-init=pattern option, which combined with user code that assumes that VLAs will not be speculatively created with -1 (2^32) size in the entry block. If you run the test case in this patch you will see the resulting alloca which is guaranteed to crash on any reasonable platform.

efriedma added inline comments.Jun 23 2020, 6:50 PM
llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

Take the following testcase. On trunk, we inline into caller1 and caller3, but not caller2 and caller4. I think your patch stops inlining in caller1, but not caller3. That seems weird to me...

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"

define void @caller1(i8 *%p1, i1 %b) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 -1)
  br label %exit

exit:
  ret void
}

define void @caller2(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 %len)
  br label %exit

exit:
  ret void
}

define void @caller3(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 300)
  br label %exit

exit:
  ret void
}

define void @callee(i8* %p1, i32 %l1, i32 %l2) {
entry:
  %ext = zext i32 %l2 to i64
  br label %foo
foo:
  %vla = alloca float, i64 %ext, align 16
  %call = call i1 @extern_call(float* nonnull %vla) nounwind 
  br i1 %call, label %foo, label %exit
  
exit:
  ret void
}

define void @caller4(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee2(i8* %p1, i32 0)
  br label %exit

exit:
  ret void
}

define void @callee2(i8* %p1, i32 %l1) {
entry:
  br label %foo
foo:
  %vla = alloca float, i64 300, align 16
  %call = call i1 @extern_call(float* nonnull %vla) nounwind 
  br i1 %call, label %foo, label %exit
  
exit:
  ret void
}

declare i1 @extern_call(float*)
aemerson marked an inline comment as done.Jun 23 2020, 7:42 PM
aemerson added inline comments.
llvm/lib/Analysis/InlineCost.cpp
838 ↗(On Diff #270555)

At first I thought that was because the 300 value in caller3 was below the threshold for preventing inlining, which it is. But there's actually a bug here where the patch isn't scaling the allocation size by the size of the type. That's why it still allows inlining even if you change it to 30000. I'll fix that.

mtrofin added inline comments.Jun 23 2020, 7:48 PM
llvm/lib/Analysis/InlineCost.cpp
845 ↗(On Diff #270555)

If profiling indicates the callsite is hot, then we know that a) there's no runtime problem, and b) we would want to inline. Yes, that can then surface the problem because of the behavior in InlineFunction, my point is to illustrate that it's not clear cut we wouldn't want to inline.

jfb added inline comments.Jun 23 2020, 8:42 PM
llvm/lib/Analysis/InlineCost.cpp
845 ↗(On Diff #270555)

I think profiling is important to take into account, but I think you're wrong on one point: say we have this function:

void f(int a, int b) {
  if (a < 1000)
    g();
  if (b < 1000)
    h();
}

Imagine profiling shows both are hot. Imagine g and h have very large stacks. We'll now inline them, hoist them to the top of f, and in many cases fail to color the allocas separately.

What you said is: "profiling shows there's no runtime problem". That's only true without inlining! With inlining, if the optimizer falls on its face, there's a problem.

The bug I describe isn't exactly the one @aemerson is fixing, but it's connected, and one @MadCoder and friends run into *very frequently*.

If we back up: Amara is trying to fix one instance of a quite unfortunate bug. That won't fix the entire problem with LLVM inlining alloca and then making a mess of it. But I hope it helps. Let's focus on avoiding what it might regress, and as a follow-up I think we all agree there's much more to be done.

mtrofin added inline comments.Jun 23 2020, 9:21 PM
llvm/lib/Analysis/InlineCost.cpp
845 ↗(On Diff #270555)

Yup - inlining would "surface the problem because of the behavior in InlineFunction" - I think we're very much in agreement!

If this patch can avoid uninteded perf regressions, or offer a way to avoid them, sgtm!

For due diligence, what's the experience today of folks that run into this - refactor code and mark noinline? (Asking to learn, not to minimize what's very likely quite a negative experience)

aemerson updated this revision to Diff 273114.Jun 24 2020, 11:38 AM

Add tests and fix not scaling the alloc size by the type size.

aemerson marked an inline comment as done.Jun 24 2020, 11:50 AM
aemerson added inline comments.
llvm/lib/Analysis/InlineCost.cpp
845 ↗(On Diff #270555)

At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected.

I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer.

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

llvm/lib/Analysis/InlineCost.cpp
849 ↗(On Diff #273114)

I'd add here a FIXME, detailing what the underlying problem is, and that this is a stop-gap measure which may potentially cause unintended perf regressions - this context would help with maintainability.

For good measure, I think a FIXME in InlineFunction detailing the problem.

845 ↗(On Diff #270555)

At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected.

I'm not sure that would catch production cases with profiling information - on the other hand, of course 64K seems large enough.

Could you comment on my last question above "what's the experience today... etc" - thanks!

I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer.

Probably fine - @jfb had a comment on this probably being more of a per-target thing anyway

This revision is now accepted and ready to land.Jun 24 2020, 12:28 PM
aemerson marked an inline comment as done.Jun 24 2020, 1:19 PM

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

That's a poor description on my part, you can disregard it. If we inlined it but didn't hoist the alloca from the never-executed block, it would still cause the function to have a dynamic alloca present and perhaps negatively impact codegen as Eli alluded to earlier in this thread.

Could you comment on my last question above "what's the experience today... etc" - thanks!

Sorry I forgot to address that. The experience today is that this bug causes an out of stack fault due to trying to allocate 4GB of stack. The workaround is to either disable -fauto-var-init or to change the code such that the optimizer isn't able to SCCP the alloca size to the callee, and therefore prevent the dynamic alloca from being promoted to a static one. E.g. try to add an early exit somewhere so that the optimizer can prove that the alloc amount must always be > 0.

Also I just noticed one of the new tests is missing some CHECK lines, I'll add those before committing.

llvm/lib/Analysis/InlineCost.cpp
849 ↗(On Diff #273114)

I don't really have a good example of how this could cause perf regressions.

Replying to an earlier point:

What if we know (from profiling) it is always executed?

Correct me if I'm wrong, but how could a function which blows the stack be always executed? The threshold of 64k was intended to only catch cases which would clearly be crashing if executed. I could bump that up even further 1MB or something to be "safer" from a perf perspective.

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

That's a poor description on my part, you can disregard it. If we inlined it but didn't hoist the alloca from the never-executed block, it would still cause the function to have a dynamic alloca present and perhaps negatively impact codegen as Eli alluded to earlier in this thread.

Could you update it accordingly - thanks!

Could you comment on my last question above "what's the experience today... etc" - thanks!

Sorry I forgot to address that. The experience today is that this bug causes an out of stack fault due to trying to allocate 4GB of stack. The workaround is to either disable -fauto-var-init or to change the code such that the optimizer isn't able to SCCP the alloca size to the callee, and therefore prevent the dynamic alloca from being promoted to a static one. E.g. try to add an early exit somewhere so that the optimizer can prove that the alloc amount must always be > 0.

Also I just noticed one of the new tests is missing some CHECK lines, I'll add those before committing.

llvm/lib/Analysis/InlineCost.cpp
849 ↗(On Diff #273114)

Sure, for 64K that won't happen. I'm more worried about how this code will evolve - unless a more complete fix makes its way in first. That's why I think it'd be valuable to capture in a FIXME the whole context of what this code does, including that, if that threshold is dropped, it could lead to unintentional perf regressions.

aemerson updated this revision to Diff 273194.Jun 24 2020, 5:25 PM
aemerson edited the summary of this revision. (Show Details)

Add some CHECK lines, add FIXME comment, update description/commit message.

I'm going to commit this now assuming that all feedback has been addressed, let me know if there's anything else needed.

This revision was automatically updated to reflect the committed changes.