This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
2/7
InlineCost.h
-
lib/Analysis/
-
Analysis/
6/16
InlineCost.cpp
-
test/Transforms/Inline/
-
Transforms/
-
Inline/
-
dynamic-alloca-simplified-large.ll

Differential D81765

Don't inline dynamic allocas that simplify to huge static allocas.
ClosedPublic

Authored by aemerson on Jun 12 2020, 2:21 PM.

Download Raw Diff

Details

Reviewers

eli.friedman
mtrofin
tejohnson
reames
jfb
efriedma

Commits

rG090c108d04e2: Don't inline dynamic allocas that simplify to huge static allocas.

Summary

Some sequences of optimizations can generate call sites which may never be
executed during runtime, and through constant propagation result in dynamic
allocas being converted to static allocas with very large allocation amounts.

The inliner tries to move these to the caller's entry block, resulting in the
stack limits being reached/bypassed. Avoid inlining functions if this would
result.

The threshold of 64k currently doesn't get triggered on the test suite with an
-Os LTO build on arm64, care should be taken in changing this in future to avoid
needlessly pessimising inlining behaviour.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

aemerson created this revision.Jun 12 2020, 2:21 PM

Herald added subscribers: haicheng, hiraditya, eraman. · View Herald TranscriptJun 12 2020, 2:21 PM

aemerson added a reviewer: jfb.Jun 12 2020, 2:22 PM

Herald added a subscriber: dexonsmith. · View Herald TranscriptJun 12 2020, 2:22 PM

jfb added a subscriber: MadCoder.Jun 12 2020, 3:18 PM

jfb added inline comments.

llvm/include/llvm/Analysis/InlineCost.h
54	@MadCoder does this size limit make sense to you, or would you recommend something smaller?
llvm/lib/Transforms/Utils/InlineFunction.cpp
1939 ↗	(On Diff #270511)	This will also prevent allocations of unknown size from inlining, right? Is that a pessimization? I guess the problem you have is that you don't know whether you'll then propagate a huge size into the alloca, and then hoist it outside a block? Would it make sense to both: Refuse inlining large alloca of known size Refuse moving large alloca of known size outside blocks Refuse moving variable-sized alloca (unknown size) outside blocks ? That way you can still inline functions with alloca of unknown size, you can still constant propagate into them, but you would hoist the alloca if it's big.

Harbormaster failed remote builds in B60169: Diff 270511!Jun 12 2020, 3:24 PM

MadCoder added inline comments.Jun 12 2020, 3:39 PM

llvm/include/llvm/Analysis/InlineCost.h
54	64k is wayyyy too large. a kernel stack is 16k. and we've seen that cause explosions in ioctl() kind of syscalls that have a large switch going in their various implementations where one of the functions has a large stack buffer and is a leaf and another path has no stack allocations but will go in a deep stack. the threshold for the kernel is likely something like 1 or 2k

One thing I'd like to make sure we don't break:

__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}

This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...

Independent of the thresholds, I'd be very cautious about turning a static alloca into a dynamic alloca. Our code generation isn't very sophisticated in a lot of cases; it often requires a "base pointer" to be generated in a fixed register. This leads to a issues even if the alloca never runs: there's a performance penalty due to allocating the base pointer, and we can potentially break inline asm. We don't want to be doing that; if it comes down to either emitting a dynamic alloca or refusing to inline, refusing to inline is probably better.

In D81765#2090965, @efriedma wrote:

Independent of the thresholds, I'd be very cautious about turning a static alloca into a dynamic alloca. Our code generation isn't very sophisticated in a lot of cases; it often requires a "base pointer" to be generated in a fixed register. This leads to a issues even if the alloca never runs: there's a performance penalty due to allocating the base pointer, and we can potentially break inline asm. We don't want to be doing that; if it comes down to either emitting a dynamic alloca or refusing to inline, refusing to inline is probably better.

How would generating a dynamic alloca break inline asm? That seems very fragile to me if that's the case.

64k was chosen, somewhat arbitrarily, as a value which no reasonable piece of code should be allocating more of in the stack. This isn't really intended to catch cases except in very rare circumstances where we get something like -1 as the alloc size. That said, I can look at changing this approach to avoid inlining large static allocas completely if we know that it would likely end up being a dynamic alloca in the caller.

llvm/lib/Transforms/Utils/InlineFunction.cpp
1939 ↗	(On Diff #270511)	This patch doesn't actually change any inlining heuristics. At this point we've already inlined the function, we're just deciding whether or not to move static allocas in the inlined body to the caller's entry block.

What happens if the alloca that is left in the conditional instead of being moved at the caller's entry block is on the hot path, like in a loop?

llvm/include/llvm/Analysis/InlineCost.h
54	Nit: if this stays 65536, should it be typed uint32_t?? unsigned's size may not always be 4 bytes
llvm/lib/Transforms/Utils/InlineFunction.cpp
1936 ↗	(On Diff #270511)	Nit: IsAllocOverSizeLimit (or some other verb-starting alternative)
1939 ↗	(On Diff #270511)	I think jfb's point is about the effects of that decision.

Instead of avoiding hoisting, avoid inlining altogether.

Harbormaster failed remote builds in B60196: Diff 270555!Jun 12 2020, 6:48 PM

Just a thought, feel free to consider or ignore.

It occurs to me that the lowering of static alloca is under compiler control. Maybe we shouldn't look at this as being an inlining problem, but instead look at it as being a lowering problem? If we have a static alloca which is huge and only used in a rare codepath, maybe we should be converting that to a dynamic alloca if legal for the target?

It occurs to me that the lowering of static alloca is under compiler control. Maybe we shouldn't look at this as being an inlining problem, but instead look at it as being a lowering problem? If we have a static alloca which is huge and only used in a rare codepath, maybe we should be converting that to a dynamic alloca if legal for the target?

There are a few problems:

Figuring out the legality on top of the current IR representation of dynamic allocation and lifetimes would be painful. If a function has any dynamic allocation, it's hard to isolate it into proper scopes. And lifetime markers are just messy to reason about. I have some vague ideas about rewriting the way we represent dynamic allocation to work more like ‘llvm.call.preallocated.setup, and rewriting the way we represent lifetimes, but I'm not completely sure how it would work.
We'd probably have to teach targets to avoid the use of a fixed "base" pointer in functions with dynamic allocation, both for performance and to avoid weird interactions with inline asm. Probably feasible, but a substantial project.

Ping.

efriedma added inline comments.Jun 22 2020, 11:46 PM

llvm/lib/Analysis/InlineCost.cpp
838	Hhow is the handling for array allocation supposed to work if the alloca isn't in the entry block of the function? It looks like this code assumes that constant-propagation is enough to turn any dynamic alloca into a static alloca? Not really a problem with this change, exactly, but I want to make sure I'm understanding the surrounding code correctly.
llvm/lib/Transforms/Utils/InlineFunction.cpp
29 ↗	(On Diff #270555)	Unnecessary change?

mtrofin added inline comments.Jun 23 2020, 7:26 AM

llvm/include/llvm/Analysis/InlineCost.h
54	I think MaxSimplifiedDynamicAllocaToMove should be part of InlineParams, since it's a threshold.

aemerson marked an inline comment as done.Jun 23 2020, 11:18 AM

aemerson added inline comments.

llvm/lib/Analysis/InlineCost.cpp
838	I'm not entirely sure what you mean, but if an array allocation isn't in the entry block, and if constant prop can't simplify it to a static alloca, then it remains a dynamic alloca and the inliner gives up (see the end of this function).

In D81765#2090952, @jfb wrote:
One thing I'd like to make sure we don't break:
__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}
This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...

Can you check this?

llvm/include/llvm/Analysis/InlineCost.h
54	It probably needs to be a per-target thing, since we want different values for userspace and kernel.

mtrofin added inline comments.Jun 23 2020, 12:10 PM

llvm/lib/Analysis/InlineCost.cpp
845	I think that the decision here needs to be mindful of profiling information (if present). The hypothesis in the patch is that the call site may never be executed. What if we know (from profiling) it is always executed? Also, could you provide more insight into the scenario generating this - thanks!

efriedma added inline comments.Jun 23 2020, 12:57 PM

llvm/include/llvm/Analysis/InlineCost.h
54	Rather than add a threshold for a single dynamic alloca, it might make sense to just pay attention to the overall size of the stack of the inlined function. We currently have a threshold TotalAllocaSizeRecursiveCaller which applies to recursive inlining; we could apply a similar threshold to inlining in general. That would avoid potential weird interactions with other passes about whether the alloca is actually dynamic. It might make sense to let the user control that threshold; users running code on small stacks might care more about the stack size. But really, I'm not sure there's a great way to solve "how much stack memory should we speculatively allocate" if we can't see the whole callstack. But really, I'm fine with this patch as-is as a spot fix for "I accidentally allocated 2^34 bytes in dead code". We probably want to consider the general issue of stack usage more carefully.
llvm/lib/Analysis/InlineCost.cpp
838	My question is what happens if an array allocation isn't in the entry block, but we can simplify the size to a constant.

In D81765#2109656, @jfb wrote:
In D81765#2090952, @jfb wrote:
One thing I'd like to make sure we don't break:
__attribute__((always_inline))
char *stack_allocate(size_t size) {
  if (size < threshold)
    return alloca(size);
  return malloc(size);
}
This should always inline. Is it still the case? It turns out that we have Important Code which relies on this...
Can you check this?

Sure, I'll add this as a test.

llvm/include/llvm/Analysis/InlineCost.h
54	It probably needs to be a per-target thing, since we want different values for userspace and kernel. For this specific bug any reasonable value is going to work to catch a -1 size allocation. Like Eli says I feel generally controlling the stack allocation size is a much broader set of changes than this one.
llvm/lib/Analysis/InlineCost.cpp
838	That was the original motivation for this fix actually. If its simplified to a constant, then it can be converted to a static alloca and then later moved into the entry block. If that constant allocation is too large then we can get overflows.
845	What if we know (from profiling) it is always executed? IMO then that function is simply allocating a huge amount on the stack. I see that in a similar way to TotalAllocaSizeRecursiveCaller being triggered, in that case we don't want to inline either, no matter what profiling information says, right? Also, could you provide more insight into the scenario generating this - thanks! This is code generated as a result of the `-ftrivial-auto-var-init=pattern` option, which combined with user code that assumes that VLAs will not be speculatively created with -1 (2^32) size in the entry block. If you run the test case in this patch you will see the resulting alloca which is guaranteed to crash on any reasonable platform.

efriedma added inline comments.Jun 23 2020, 6:50 PM

llvm/lib/Analysis/InlineCost.cpp

838

Take the following testcase. On trunk, we inline into caller1 and caller3, but not caller2 and caller4. I think your patch stops inlining in caller1, but not caller3. That seems weird to me...

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.15.0"

define void @caller1(i8 *%p1, i1 %b) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 -1)
  br label %exit

exit:
  ret void
}

define void @caller2(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 %len)
  br label %exit

exit:
  ret void
}

define void @caller3(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee(i8* %p1, i32 0, i32 300)
  br label %exit

exit:
  ret void
}

define void @callee(i8* %p1, i32 %l1, i32 %l2) {
entry:
  %ext = zext i32 %l2 to i64
  br label %foo
foo:
  %vla = alloca float, i64 %ext, align 16
  %call = call i1 @extern_call(float* nonnull %vla) nounwind 
  br i1 %call, label %foo, label %exit
  
exit:
  ret void
}

define void @caller4(i8 *%p1, i1 %b, i32 %len) {
entry:
  %cond = icmp eq i1 %b, true
  br i1 %cond, label %exit, label %split

split:
  call void @callee2(i8* %p1, i32 0)
  br label %exit

exit:
  ret void
}

define void @callee2(i8* %p1, i32 %l1) {
entry:
  br label %foo
foo:
  %vla = alloca float, i64 300, align 16
  %call = call i1 @extern_call(float* nonnull %vla) nounwind 
  br i1 %call, label %foo, label %exit
  
exit:
  ret void
}

declare i1 @extern_call(float*)

aemerson marked an inline comment as done.Jun 23 2020, 7:42 PM

aemerson added inline comments.

llvm/lib/Analysis/InlineCost.cpp
838	At first I thought that was because the 300 value in caller3 was below the threshold for preventing inlining, which it is. But there's actually a bug here where the patch isn't scaling the allocation size by the size of the type. That's why it still allows inlining even if you change it to 30000. I'll fix that.

mtrofin added inline comments.Jun 23 2020, 7:48 PM

llvm/lib/Analysis/InlineCost.cpp
845	If profiling indicates the callsite is hot, then we know that a) there's no runtime problem, and b) we would want to inline. Yes, that can then surface the problem because of the behavior in InlineFunction, my point is to illustrate that it's not clear cut we wouldn't want to inline.

jfb added inline comments.Jun 23 2020, 8:42 PM

llvm/lib/Analysis/InlineCost.cpp
845	I think profiling is important to take into account, but I think you're wrong on one point: say we have this function: void f(int a, int b) { if (a < 1000) g(); if (b < 1000) h(); } Imagine profiling shows both are hot. Imagine `g` and `h` have very large stacks. We'll now inline them, hoist them to the top of `f`, and in many cases fail to color the `alloca`s separately. What you said is: "profiling shows there's no runtime problem". That's only true without inlining! With inlining, if the optimizer falls on its face, there's a problem. The bug I describe isn't exactly the one @aemerson is fixing, but it's connected, and one @MadCoder and friends run into very frequently. If we back up: Amara is trying to fix one instance of a quite unfortunate bug. That won't fix the entire problem with LLVM inlining `alloca` and then making a mess of it. But I hope it helps. Let's focus on avoiding what it might regress, and as a follow-up I think we all agree there's much more to be done.

mtrofin added inline comments.Jun 23 2020, 9:21 PM

llvm/lib/Analysis/InlineCost.cpp
845	Yup - inlining would "surface the problem because of the behavior in InlineFunction" - I think we're very much in agreement! If this patch can avoid uninteded perf regressions, or offer a way to avoid them, sgtm! For due diligence, what's the experience today of folks that run into this - refactor code and mark noinline? (Asking to learn, not to minimize what's very likely quite a negative experience)

Add tests and fix not scaling the alloc size by the type size.

aemerson marked an inline comment as done.Jun 24 2020, 11:50 AM

aemerson added inline comments.

llvm/lib/Analysis/InlineCost.cpp
845	At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected. I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer.

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

llvm/lib/Analysis/InlineCost.cpp
845	At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected. I'm not sure that would catch production cases with profiling information - on the other hand, of course 64K seems large enough. Could you comment on my last question above "what's the experience today... etc" - thanks! I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer. Probably fine - @jfb had a comment on this probably being more of a per-target thing anyway
868	I'd add here a FIXME, detailing what the underlying problem is, and that this is a stop-gap measure which may potentially cause unintended perf regressions - this context would help with maintainability. For good measure, I think a FIXME in InlineFunction detailing the problem.

LGTM

This revision is now accepted and ready to land.Jun 24 2020, 12:28 PM

In D81765#2112281, @mtrofin wrote:

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

That's a poor description on my part, you can disregard it. If we inlined it but didn't hoist the alloca from the never-executed block, it would still cause the function to have a dynamic alloca present and perhaps negatively impact codegen as Eli alluded to earlier in this thread.

Could you comment on my last question above "what's the experience today... etc" - thanks!

Sorry I forgot to address that. The experience today is that this bug causes an out of stack fault due to trying to allocate 4GB of stack. The workaround is to either disable -fauto-var-init or to change the code such that the optimizer isn't able to SCCP the alloca size to the callee, and therefore prevent the dynamic alloca from being promoted to a static one. E.g. try to add an early exit somewhere so that the optimizer can prove that the alloc amount must always be > 0.

Also I just noticed one of the new tests is missing some CHECK lines, I'll add those before committing.

llvm/lib/Analysis/InlineCost.cpp
868	I don't really have a good example of how this could cause perf regressions. Replying to an earlier point: What if we know (from profiling) it is always executed? Correct me if I'm wrong, but how could a function which blows the stack be always executed? The threshold of 64k was intended to only catch cases which would clearly be crashing if executed. I could bump that up even further 1MB or something to be "safer" from a perf perspective.

Harbormaster completed remote builds in B61597: Diff 273114.Jun 24 2020, 2:07 PM

In D81765#2112502, @aemerson wrote:

In D81765#2112281, @mtrofin wrote:

Not sure I follow this in the patch description:
"Avoid inlining functions if this would result, as even if we didn't hoist the alloca it would remain an dynamic alloca in the caller body."

What would be different between: inlining and not hoisting the alloca, or not inlining?

That's a poor description on my part, you can disregard it. If we inlined it but didn't hoist the alloca from the never-executed block, it would still cause the function to have a dynamic alloca present and perhaps negatively impact codegen as Eli alluded to earlier in this thread.

Could you update it accordingly - thanks!

Could you comment on my last question above "what's the experience today... etc" - thanks!

Sorry I forgot to address that. The experience today is that this bug causes an out of stack fault due to trying to allocate 4GB of stack. The workaround is to either disable -fauto-var-init or to change the code such that the optimizer isn't able to SCCP the alloca size to the callee, and therefore prevent the dynamic alloca from being promoted to a static one. E.g. try to add an early exit somewhere so that the optimizer can prove that the alloc amount must always be > 0.

Also I just noticed one of the new tests is missing some CHECK lines, I'll add those before committing.

llvm/lib/Analysis/InlineCost.cpp
868	Sure, for 64K that won't happen. I'm more worried about how this code will evolve - unless a more complete fix makes its way in first. That's why I think it'd be valuable to capture in a FIXME the whole context of what this code does, including that, if that threshold is dropped, it could lead to unintentional perf regressions.

dexonsmith removed a subscriber: dexonsmith.Jun 24 2020, 3:19 PM

Add some CHECK lines, add FIXME comment, update description/commit message.

I'm going to commit this now assuming that all feedback has been addressed, let me know if there's anything else needed.

Herald added a subscriber: kristof.beyls. · View Herald TranscriptJun 24 2020, 5:25 PM

Closed by commit rG090c108d04e2: Don't inline dynamic allocas that simplify to huge static allocas. (authored by aemerson). · Explain WhyJun 24 2020, 5:57 PM

This revision was automatically updated to reflect the committed changes.

Harbormaster failed remote builds in B61647: Diff 273194!Jun 24 2020, 5:57 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

InlineCost.h

3 lines

lib/

Analysis/

InlineCost.cpp

12 lines

test/

Transforms/

Inline/

dynamic-alloca-simplified-large.ll

175 lines

Diff 273202

llvm/include/llvm/Analysis/InlineCost.h

	Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	const int InstrCost = 5;			const int InstrCost = 5;
	const int IndirectCallThreshold = 100;			const int IndirectCallThreshold = 100;
	const int CallPenalty = 25;			const int CallPenalty = 25;
	const int LastCallToStaticBonus = 15000;			const int LastCallToStaticBonus = 15000;
	const int ColdccPenalty = 2000;			const int ColdccPenalty = 2000;
	/// Do not inline functions which allocate this many bytes on the stack			/// Do not inline functions which allocate this many bytes on the stack
	/// when the caller is recursive.			/// when the caller is recursive.
	const unsigned TotalAllocaSizeRecursiveCaller = 1024;			const unsigned TotalAllocaSizeRecursiveCaller = 1024;
				/// Do not inline dynamic allocas that have been constant propagated to be
				/// static allocas above this amount in bytes.
				const uint64_t MaxSimplifiedDynamicAllocaToInline = 65536;
				jfbUnsubmitted Not Done Reply Inline Actions @MadCoder does this size limit make sense to you, or would you recommend something smaller? jfb: @MadCoder does this size limit make sense to you, or would you recommend something smaller?
				MadCoderUnsubmitted Not Done Reply Inline Actions 64k is wayyyy too large. a kernel stack is 16k. and we've seen that cause explosions in ioctl() kind of syscalls that have a large switch going in their various implementations where one of the functions has a large stack buffer and is a leaf and another path has no stack allocations but will go in a deep stack. the threshold for the kernel is likely something like 1 or 2k MadCoder: 64k is wayyyy too large. a kernel stack is 16k. and we've seen that cause explosions in ioctl…
				mtrofinUnsubmitted Not Done Reply Inline Actions Nit: if this stays 65536, should it be typed uint32_t?? unsigned's size may not always be 4 bytes mtrofin: Nit: if this stays 65536, should it be typed uint32_t?? unsigned's size may not always be 4…
				mtrofinUnsubmitted Done Reply Inline Actions I think MaxSimplifiedDynamicAllocaToMove should be part of InlineParams, since it's a threshold. mtrofin: I think MaxSimplifiedDynamicAllocaToMove should be part of InlineParams, since it's a threshold.
				jfbUnsubmitted Not Done Reply Inline Actions It probably needs to be a per-target thing, since we want different values for userspace and kernel. jfb: It probably needs to be a per-target thing, since we want different values for userspace and…
				efriedmaUnsubmitted Not Done Reply Inline Actions Rather than add a threshold for a single dynamic alloca, it might make sense to just pay attention to the overall size of the stack of the inlined function. We currently have a threshold TotalAllocaSizeRecursiveCaller which applies to recursive inlining; we could apply a similar threshold to inlining in general. That would avoid potential weird interactions with other passes about whether the alloca is actually dynamic. It might make sense to let the user control that threshold; users running code on small stacks might care more about the stack size. But really, I'm not sure there's a great way to solve "how much stack memory should we speculatively allocate" if we can't see the whole callstack. But really, I'm fine with this patch as-is as a spot fix for "I accidentally allocated 2^34 bytes in dead code". We probably want to consider the general issue of stack usage more carefully. efriedma: Rather than add a threshold for a single dynamic alloca, it might make sense to just pay…
				aemersonAuthorUnsubmitted Done Reply Inline Actions It probably needs to be a per-target thing, since we want different values for userspace and kernel. For this specific bug any reasonable value is going to work to catch a -1 size allocation. Like Eli says I feel generally controlling the stack allocation size is a much broader set of changes than this one. aemerson: > It probably needs to be a per-target thing, since we want different values for userspace and…
	} // namespace InlineConstants			} // namespace InlineConstants

	/// Represents the cost of inlining a function.			/// Represents the cost of inlining a function.
	///			///
	/// This supports special values for functions which should "always" or			/// This supports special values for functions which should "always" or
	/// "never" be inlined. Otherwise, the cost represents a unitless amount;			/// "never" be inlined. Otherwise, the cost represents a unitless amount;
	/// smaller values increase the likelihood of the function being inlined.			/// smaller values increase the likelihood of the function being inlined.
	///			///
	▲ Show 20 Lines • Show All 227 Lines • Show Last 20 Lines

llvm/lib/Analysis/InlineCost.cpp

Show First 20 Lines • Show All 829 Lines • ▼ Show 20 Lines	bool CallAnalyzer::accumulateGEPOffset(GEPOperator &GEP, APInt &Offset) {
}		}
return true;		return true;
}		}

/// Use TTI to check whether a GEP is free.		/// Use TTI to check whether a GEP is free.
///		///
/// Respects any simplified values known during the analysis of this callsite.		/// Respects any simplified values known during the analysis of this callsite.
bool CallAnalyzer::isGEPFree(GetElementPtrInst &GEP) {		bool CallAnalyzer::isGEPFree(GetElementPtrInst &GEP) {
SmallVector<Value *, 4> Operands;		SmallVector<Value *, 4> Operands;
		efriedmaUnsubmitted Not Done Reply Inline Actions Hhow is the handling for array allocation supposed to work if the alloca isn't in the entry block of the function? It looks like this code assumes that constant-propagation is enough to turn any dynamic alloca into a static alloca? Not really a problem with this change, exactly, but I want to make sure I'm understanding the surrounding code correctly. efriedma: Hhow is the handling for array allocation supposed to work if the alloca isn't in the entry…
		aemersonAuthorUnsubmitted Done Reply Inline Actions I'm not entirely sure what you mean, but if an array allocation isn't in the entry block, and if constant prop can't simplify it to a static alloca, then it remains a dynamic alloca and the inliner gives up (see the end of this function). aemerson: I'm not entirely sure what you mean, but if an array allocation isn't in the entry block, and…
		efriedmaUnsubmitted Not Done Reply Inline Actions My question is what happens if an array allocation isn't in the entry block, but we can simplify the size to a constant. efriedma: My question is what happens if an array allocation isn't in the entry block, but we can…
		aemersonAuthorUnsubmitted Done Reply Inline Actions That was the original motivation for this fix actually. If its simplified to a constant, then it can be converted to a static alloca and then later moved into the entry block. If that constant allocation is too large then we can get overflows. aemerson: That was the original motivation for this fix actually. If its simplified to a constant, then…
		efriedmaUnsubmitted Not Done Reply Inline Actions Take the following testcase. On trunk, we inline into caller1 and caller3, but not caller2 and caller4. I think your patch stops inlining in caller1, but not caller3. That seems weird to me... target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.15.0" define void @caller1(i8 %p1, i1 %b) { entry: %cond = icmp eq i1 %b, true br i1 %cond, label %exit, label %split split: call void @callee(i8 %p1, i32 0, i32 -1) br label %exit exit: ret void } define void @caller2(i8 %p1, i1 %b, i32 %len) { entry: %cond = icmp eq i1 %b, true br i1 %cond, label %exit, label %split split: call void @callee(i8 %p1, i32 0, i32 %len) br label %exit exit: ret void } define void @caller3(i8 %p1, i1 %b, i32 %len) { entry: %cond = icmp eq i1 %b, true br i1 %cond, label %exit, label %split split: call void @callee(i8 %p1, i32 0, i32 300) br label %exit exit: ret void } define void @callee(i8* %p1, i32 %l1, i32 %l2) { entry: %ext = zext i32 %l2 to i64 br label %foo foo: %vla = alloca float, i64 %ext, align 16 %call = call i1 @extern_call(float* nonnull %vla) nounwind br i1 %call, label %foo, label %exit exit: ret void } define void @caller4(i8 %p1, i1 %b, i32 %len) { entry: %cond = icmp eq i1 %b, true br i1 %cond, label %exit, label %split split: call void @callee2(i8 %p1, i32 0) br label %exit exit: ret void } define void @callee2(i8* %p1, i32 %l1) { entry: br label %foo foo: %vla = alloca float, i64 300, align 16 %call = call i1 @extern_call(float* nonnull %vla) nounwind br i1 %call, label %foo, label %exit exit: ret void } declare i1 @extern_call(float) efriedma:* Take the following testcase. On trunk, we inline into caller1 and caller3, but not caller2 and…
		aemersonAuthorUnsubmitted Done Reply Inline Actions At first I thought that was because the 300 value in caller3 was below the threshold for preventing inlining, which it is. But there's actually a bug here where the patch isn't scaling the allocation size by the size of the type. That's why it still allows inlining even if you change it to 30000. I'll fix that. aemerson: At first I thought that was because the 300 value in caller3 was below the threshold for…
Operands.push_back(GEP.getOperand(0));		Operands.push_back(GEP.getOperand(0));
for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E; ++I)		for (User::op_iterator I = GEP.idx_begin(), E = GEP.idx_end(); I != E; ++I)
if (Constant SimpleOp = SimplifiedValues.lookup(I))		if (Constant SimpleOp = SimplifiedValues.lookup(I))
Operands.push_back(SimpleOp);		Operands.push_back(SimpleOp);
else		else
Operands.push_back(*I);		Operands.push_back(*I);
return TargetTransformInfo::TCC_Free ==		return TargetTransformInfo::TCC_Free ==
		mtrofinUnsubmitted Not Done Reply Inline Actions I think that the decision here needs to be mindful of profiling information (if present). The hypothesis in the patch is that the call site may never be executed. What if we know (from profiling) it is always executed? Also, could you provide more insight into the scenario generating this - thanks! mtrofin: I think that the decision here needs to be mindful of profiling information (if present). The…
		aemersonAuthorUnsubmitted Done Reply Inline Actions What if we know (from profiling) it is always executed? IMO then that function is simply allocating a huge amount on the stack. I see that in a similar way to TotalAllocaSizeRecursiveCaller being triggered, in that case we don't want to inline either, no matter what profiling information says, right? Also, could you provide more insight into the scenario generating this - thanks! This is code generated as a result of the `-ftrivial-auto-var-init=pattern` option, which combined with user code that assumes that VLAs will not be speculatively created with -1 (2^32) size in the entry block. If you run the test case in this patch you will see the resulting alloca which is guaranteed to crash on any reasonable platform. aemerson: > What if we know (from profiling) it is always executed? IMO then that function is simply…
		mtrofinUnsubmitted Not Done Reply Inline Actions If profiling indicates the callsite is hot, then we know that a) there's no runtime problem, and b) we would want to inline. Yes, that can then surface the problem because of the behavior in InlineFunction, my point is to illustrate that it's not clear cut we wouldn't want to inline. mtrofin: If profiling indicates the callsite is hot, then we know that a) there's no runtime problem…
		jfbUnsubmitted Not Done Reply Inline Actions I think profiling is important to take into account, but I think you're wrong on one point: say we have this function: void f(int a, int b) { if (a < 1000) g(); if (b < 1000) h(); } Imagine profiling shows both are hot. Imagine `g` and `h` have very large stacks. We'll now inline them, hoist them to the top of `f`, and in many cases fail to color the `alloca`s separately. What you said is: "profiling shows there's no runtime problem". That's only true without inlining! With inlining, if the optimizer falls on its face, there's a problem. The bug I describe isn't exactly the one @aemerson is fixing, but it's connected, and one @MadCoder and friends run into very frequently. If we back up: Amara is trying to fix one instance of a quite unfortunate bug. That won't fix the entire problem with LLVM inlining `alloca` and then making a mess of it. But I hope it helps. Let's focus on avoiding what it might regress, and as a follow-up I think we all agree there's much more to be done. jfb: I think profiling is important to take into account, but I think you're wrong on one point: say…
		mtrofinUnsubmitted Not Done Reply Inline Actions Yup - inlining would "surface the problem because of the behavior in InlineFunction" - I think we're very much in agreement! If this patch can avoid uninteded perf regressions, or offer a way to avoid them, sgtm! For due diligence, what's the experience today of folks that run into this - refactor code and mark noinline? (Asking to learn, not to minimize what's very likely quite a negative experience) mtrofin: Yup - inlining would "surface the problem because of the behavior in InlineFunction" - I think…
		aemersonAuthorUnsubmitted Done Reply Inline Actions At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected. I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer. aemerson: At the current threshold of 64k (open to changing that but serves the purpose for this bug fix)…
		mtrofinUnsubmitted Not Done Reply Inline Actions At the current threshold of 64k (open to changing that but serves the purpose for this bug fix), this code never fires across a build of the LLVM test suite for arm64 at -Os + LTO, which is expected. I'm not sure that would catch production cases with profiling information - on the other hand, of course 64K seems large enough. Could you comment on my last question above "what's the experience today... etc" - thanks! I also tried to move the constant to InlineParams but I'm not familiar enough with the inliner to see how I could access InlineParams from within the CallAnalyzer. Probably fine - @jfb had a comment on this probably being more of a per-target thing anyway mtrofin: > At the current threshold of 64k (open to changing that but serves the purpose for this bug…
TTI.getUserCost(&GEP, Operands,		TTI.getUserCost(&GEP, Operands,
TargetTransformInfo::TCK_SizeAndLatency);		TargetTransformInfo::TCK_SizeAndLatency);
}		}

bool CallAnalyzer::visitAlloca(AllocaInst &I) {		bool CallAnalyzer::visitAlloca(AllocaInst &I) {
// Check whether inlining will turn a dynamic alloca into a static		// Check whether inlining will turn a dynamic alloca into a static
// alloca and handle that case.		// alloca and handle that case.
if (I.isArrayAllocation()) {		if (I.isArrayAllocation()) {
Constant *Size = SimplifiedValues.lookup(I.getArraySize());		Constant *Size = SimplifiedValues.lookup(I.getArraySize());
if (auto *AllocSize = dyn_cast_or_null<ConstantInt>(Size)) {		if (auto *AllocSize = dyn_cast_or_null<ConstantInt>(Size)) {
		// Sometimes a dynamic alloca could be converted into a static alloca
		// after this constant prop, and become a huge static alloca on an
		// unconditional CFG path. Avoid inlining if this is going to happen above
		// a threshold.
		// FIXME: If the threshold is removed or lowered too much, we could end up
		// being too pessimistic and prevent inlining non-problematic code. This
		// could result in unintended perf regressions. A better overall strategy
		// is needed to track stack usage during inlining.
Type *Ty = I.getAllocatedType();		Type *Ty = I.getAllocatedType();
AllocatedSize = SaturatingMultiplyAdd(		AllocatedSize = SaturatingMultiplyAdd(
AllocSize->getLimitedValue(), DL.getTypeAllocSize(Ty).getFixedSize(),		AllocSize->getLimitedValue(), DL.getTypeAllocSize(Ty).getFixedSize(),
AllocatedSize);		AllocatedSize);
		if (AllocatedSize > InlineConstants::MaxSimplifiedDynamicAllocaToInline) {
		mtrofinUnsubmitted Not Done Reply Inline Actions I'd add here a FIXME, detailing what the underlying problem is, and that this is a stop-gap measure which may potentially cause unintended perf regressions - this context would help with maintainability. For good measure, I think a FIXME in InlineFunction detailing the problem. mtrofin: I'd add here a FIXME, detailing what the underlying problem is, and that this is a stop-gap…
		aemersonAuthorUnsubmitted Done Reply Inline Actions I don't really have a good example of how this could cause perf regressions. Replying to an earlier point: What if we know (from profiling) it is always executed? Correct me if I'm wrong, but how could a function which blows the stack be always executed? The threshold of 64k was intended to only catch cases which would clearly be crashing if executed. I could bump that up even further 1MB or something to be "safer" from a perf perspective. aemerson: I don't really have a good example of how this could cause perf regressions. Replying to an…
		mtrofinUnsubmitted Not Done Reply Inline Actions Sure, for 64K that won't happen. I'm more worried about how this code will evolve - unless a more complete fix makes its way in first. That's why I think it'd be valuable to capture in a FIXME the whole context of what this code does, including that, if that threshold is dropped, it could lead to unintentional perf regressions. mtrofin: Sure, for 64K that won't happen. I'm more worried about how this code will evolve - unless a…
		HasDynamicAlloca = true;
		return false;
		}
return Base::visitAlloca(I);		return Base::visitAlloca(I);
}		}
}		}

// Accumulate the allocated size.		// Accumulate the allocated size.
if (I.isStaticAlloca()) {		if (I.isStaticAlloca()) {
Type *Ty = I.getAllocatedType();		Type *Ty = I.getAllocatedType();
AllocatedSize =		AllocatedSize =
▲ Show 20 Lines • Show All 1,701 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/dynamic-alloca-simplified-large.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -inline < %s -S -o - \| FileCheck %s
				target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.15.0"

				define void @caller1(i8 *%p1, i1 %b) {
				; CHECK-LABEL: @caller1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[COND:%.]] = icmp eq i1 [[B:%.]], true
				; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.]], label [[SPLIT:%.]]
				; CHECK: split:
				; CHECK-NEXT: call void @callee(i8* [[P1:%.*]], i32 0, i32 -1)
				; CHECK-NEXT: br label [[EXIT]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%cond = icmp eq i1 %b, true
				br i1 %cond, label %exit, label %split

				split:
				; This path may be generated from CS splitting and never taken at runtime.
				call void @callee(i8* %p1, i32 0, i32 -1)
				br label %exit

				exit:
				ret void
				}

				define void @callee(i8* %p1, i32 %l1, i32 %l2) {
				entry:
				%ext = zext i32 %l2 to i64
				%vla = alloca float, i64 %ext, align 16
				call void @extern_call(float* nonnull %vla) #3
				ret void
				}


				define void @caller2_below_threshold(i8 *%p1, i1 %b) {
				; CHECK-LABEL: @caller2_below_threshold(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[VLA_I:%.*]] = alloca float, i64 15000, align 16
				; CHECK-NEXT: [[COND:%.]] = icmp eq i1 [[B:%.]], true
				; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.]], label [[SPLIT:%.]]
				; CHECK: split:
				; CHECK-NEXT: [[SAVEDSTACK:%.]] = call i8 @llvm.stacksave()
				; CHECK-NEXT: [[TMP0:%.]] = bitcast float [[VLA_I]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.start.p0i8(i64 60000, i8* [[TMP0]])
				; CHECK-NEXT: call void @extern_call(float* nonnull [[VLA_I]]) #2
				; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[VLA_I]] to i8*
				; CHECK-NEXT: call void @llvm.lifetime.end.p0i8(i64 60000, i8* [[TMP1]])
				; CHECK-NEXT: call void @llvm.stackrestore(i8* [[SAVEDSTACK]])
				; CHECK-NEXT: br label [[EXIT]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%cond = icmp eq i1 %b, true
				br i1 %cond, label %exit, label %split

				split:
				call void @callee(i8* %p1, i32 0, i32 15000)
				br label %exit

				exit:
				ret void
				}

				define void @callee2_not_in_entry(i8* %p1, i32 %l1, i32 %l2) {
				entry:
				%ext = zext i32 %l2 to i64
				%c = icmp eq i32 %l1, 42
				br i1 %c, label %bb2, label %bb3
				bb2:
				%vla = alloca float, i64 %ext, align 16
				call void @extern_call(float* nonnull %vla) #3
				ret void
				bb3:
				ret void
				}

				define void @caller3_alloca_not_in_entry(i8 *%p1, i1 %b) {
				; CHECK-LABEL: @caller3_alloca_not_in_entry(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[COND:%.]] = icmp eq i1 [[B:%.]], true
				; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.]], label [[SPLIT:%.]]
				; CHECK: split:
				; CHECK-NEXT: br label [[EXIT]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%cond = icmp eq i1 %b, true
				br i1 %cond, label %exit, label %split

				split:
				call void @callee2_not_in_entry(i8* %p1, i32 0, i32 -1)
				br label %exit

				exit:
				ret void
				}

				define void @caller4_over_threshold(i8 *%p1, i1 %b, i32 %len) {
				; CHECK-LABEL: @caller4_over_threshold(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[COND:%.]] = icmp eq i1 [[B:%.]], true
				; CHECK-NEXT: br i1 [[COND]], label [[EXIT:%.]], label [[SPLIT:%.]]
				; CHECK: split:
				; CHECK-NEXT: call void @callee(i8* [[P1:%.*]], i32 0, i32 16500)
				; CHECK-NEXT: br label [[EXIT]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				%cond = icmp eq i1 %b, true
				br i1 %cond, label %exit, label %split

				split:
				call void @callee(i8* %p1, i32 0, i32 16500)
				br label %exit

				exit:
				ret void
				}

				declare noalias i8* @malloc(i64)
				define i8* @stack_allocate(i32 %size) #2 {
				entry:
				%cmp = icmp ult i32 %size, 100
				%conv = zext i32 %size to i64
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%0 = alloca i8, i64 %conv, align 8
				br label %return

				if.end: ; preds = %entry
				%call = tail call i8* @malloc(i64 %conv) #3
				br label %return

				return: ; preds = %if.end, %if.then
				%retval.0 = phi i8* [ %0, %if.then ], [ %call, %if.end ]
				ret i8* %retval.0
				}

				define i8* @test_stack_allocate_always(i32 %size) {
				; CHECK-LABEL: @test_stack_allocate_always(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SAVEDSTACK:%.]] = call i8 @llvm.stacksave()
				; CHECK-NEXT: [[CMP_I:%.]] = icmp ult i32 [[SIZE:%.]], 100
				; CHECK-NEXT: [[CONV_I:%.*]] = zext i32 [[SIZE]] to i64
				; CHECK-NEXT: br i1 [[CMP_I]], label [[IF_THEN_I:%.]], label [[IF_END_I:%.]]
				; CHECK: if.then.i:
				; CHECK-NEXT: [[TMP0:%.*]] = alloca i8, i64 [[CONV_I]], align 8
				; CHECK-NEXT: br label [[STACK_ALLOCATE_EXIT:%.*]]
				; CHECK: if.end.i:
				; CHECK-NEXT: [[CALL_I:%.]] = tail call i8 @malloc(i64 [[CONV_I]]) #2
				; CHECK-NEXT: br label [[STACK_ALLOCATE_EXIT]]
				; CHECK: stack_allocate.exit:
				; CHECK-NEXT: [[RETVAL_0_I:%.]] = phi i8 [ [[TMP0]], [[IF_THEN_I]] ], [ [[CALL_I]], [[IF_END_I]] ]
				; CHECK-NEXT: call void @llvm.stackrestore(i8* [[SAVEDSTACK]])
				; CHECK-NEXT: ret i8* [[RETVAL_0_I]]
				;
				entry:
				%call = tail call i8* @stack_allocate(i32 %size)
				ret i8* %call
				}

				declare void @extern_call(float*)

				attributes #1 = { argmemonly nounwind willreturn writeonly }
				attributes #2 = { alwaysinline }
				attributes #3 = { nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

Don't inline dynamic allocas that simplify to huge static allocas.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 273202

llvm/include/llvm/Analysis/InlineCost.h

llvm/lib/Analysis/InlineCost.cpp

llvm/test/Transforms/Inline/dynamic-alloca-simplified-large.ll

Don't inline dynamic allocas that simplify to huge static allocas.
ClosedPublic