This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Inlining should honor nobuiltin attributes
ClosedPublic

Authored by tejohnson on Feb 6 2020, 1:36 PM.

Details

Summary

Final patch in series to fix inlining between functions with different
nobuiltin attributes/options, which was specifically an issue in LTO.
See discussion on D61634 for background.

The prior patch in this series (D67923) enabled per-Function TLI
construction that identified the nobuiltin attributes.

Here I have allowed inlining to proceed if the callee's nobuiltins are a
subset of the caller's nobuiltins, but not in the reverse case, which
should be conservatively correct.

Diff Detail

Event Timeline

tejohnson created this revision.Feb 6 2020, 1:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2020, 1:36 PM

Apart from my other comment LGTM

llvm/lib/Transforms/IPO/PartialInlining.cpp
393

I'm having a hard time convincing myself that the lifetime requirements are correct here.
Passing a local variable GetTLI by address in return statement looks fishy.

It's similar to GetTTI so is seems correct, it's just hard to tell by looking at the code.

Same above and below.

nhaehnle removed a subscriber: nhaehnle.Feb 7 2020, 3:18 AM
tejohnson marked an inline comment as done.Feb 7 2020, 6:49 AM
tejohnson added inline comments.
llvm/lib/Transforms/IPO/PartialInlining.cpp
393

What's being returned is the bool result of the run() call, not the PartialInlinerImpl object, which doesn't survive past this function and therefore the GetTLI scope.

gchatelet accepted this revision.Feb 7 2020, 7:13 AM
gchatelet marked an inline comment as done.

Let's wait a bit for other reviewers to comment.

llvm/lib/Transforms/IPO/PartialInlining.cpp
393

Ha right, I got confused by the formatting and missed the run() on next line.

This revision is now accepted and ready to land.Feb 7 2020, 7:13 AM

@davidxl David can you take a look at it from the inliner side and let me know if it looks ok?

davidxl added inline comments.Feb 27 2020, 10:03 AM
llvm/include/llvm/Analysis/TargetLibraryInfo.h
267

This may be bad for performance -- as the inline instance will be optimized differently.

llvm/test/Transforms/Inline/X86/inline-no-builtin-compatible.ll
6 ↗(On Diff #242994)

Is this test x86 specific?

23 ↗(On Diff #242994)

why not directly check-not call?

tejohnson marked 3 inline comments as done.Feb 27 2020, 10:13 AM
tejohnson added inline comments.
llvm/include/llvm/Analysis/TargetLibraryInfo.h
267

Note that it won't be worse than head, which doesn't restrict the inlines based on nobuiltin attributes at all. We could also just disallow inlining completely between callers/callees with different nobuiltin attributes. But I was concerned that this would degrade performance too much by disallowing inlining in too many cases.

llvm/test/Transforms/Inline/X86/inline-no-builtin-compatible.ll
6 ↗(On Diff #242994)

Actually not, I can move to parent directory.

23 ↗(On Diff #242994)

Yeah that would be better, will change.

Address comments

davidxl added inline comments.Feb 27 2020, 10:20 AM
llvm/include/llvm/Analysis/TargetLibraryInfo.h
267

Perhaps add an additional parameter to the interface to allow superset behavior. Then in the inlineCost.cpp, add an internal option to specify whether strict attribute matching is required -- the default can be the current behavior -- allow inlining into superset.

tejohnson marked an inline comment as done.Feb 27 2020, 9:51 PM
tejohnson updated this revision to Diff 247165.Feb 27 2020, 9:53 PM

Add internal option to control superset check and test it

davidxl accepted this revision.Feb 27 2020, 9:57 PM

lgtm

gchatelet added inline comments.Feb 28 2020, 1:11 AM
llvm/include/llvm/Analysis/TargetLibraryInfo.h
267

Note that it won't be worse than head, which doesn't restrict the inlines based on nobuiltin attributes at all. We could also just disallow inlining completely between callers/callees with different nobuiltin attributes. But I was concerned that this would degrade performance too much by disallowing inlining in too many cases.

I agree disallowing inlining completely when nobuiltin differ would prevent inlining of basic memory functions entirely (memset, memcpy, etc..)

This revision was automatically updated to reflect the committed changes.

I was discussing this patch with @arsenm after investigating why the calls to libc functions weren't being inlined on the GPU. What was the motivation for the original change? I figured it would make no difference if we inlined a function that doesn't allow built-ins into one that does. This is problematic because it prevents us from inlining any function from the LLVM C library under LTO. I'd like a way to at least work around this in the short term, but I'd like some context for why this was introduced.

Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 3:29 PM
Herald added subscribers: hoy, wlei, ormris and 3 others. · View Herald Transcript

I was discussing this patch with @arsenm after investigating why the calls to libc functions weren't being inlined on the GPU. What was the motivation for the original change? I figured it would make no difference if we inlined a function that doesn't allow built-ins into one that does. This is problematic because it prevents us from inlining any function from the LLVM C library under LTO. I'd like a way to at least work around this in the short term, but I'd like some context for why this was introduced.

@gchatelet to add more context as this was related to work he was doing for custom implementations of libc memcpy functions.

Context is in D61634. See specifically the discussion from https://reviews.llvm.org/D61634#1502201 to https://reviews.llvm.org/D61634#1512020 which discusses the inliner changes for the new attribute. IIRC the issue is that allowing such inlining would lose the no-builtin attributes, which would defeat their purpose.

I was discussing this patch with @arsenm after investigating why the calls to libc functions weren't being inlined on the GPU. What was the motivation for the original change? I figured it would make no difference if we inlined a function that doesn't allow built-ins into one that does. This is problematic because it prevents us from inlining any function from the LLVM C library under LTO. I'd like a way to at least work around this in the short term, but I'd like some context for why this was introduced.

@gchatelet to add more context as this was related to work he was doing for custom implementations of libc memcpy functions.

Context is in D61634. See specifically the discussion from https://reviews.llvm.org/D61634#1502201 to https://reviews.llvm.org/D61634#1512020 which discusses the inliner changes for the new attribute. IIRC the issue is that allowing such inlining would lose the no-builtin attributes, which would defeat their purpose.

Losing the ability to inline any function implemented in an LTO library compiled with -ffreestanding seems like a very bad tradeoff. I was talking with @arsenm about this and the attribute seems undocumented and sparsely tested. What was the specific failure case that this was introduced to solve? We lose that attribute, but is that a bad thing? If we inline a function that cannot call builtins into a function that can, what is the issue with calling builtins at that point?

I was discussing this patch with @arsenm after investigating why the calls to libc functions weren't being inlined on the GPU. What was the motivation for the original change? I figured it would make no difference if we inlined a function that doesn't allow built-ins into one that does. This is problematic because it prevents us from inlining any function from the LLVM C library under LTO. I'd like a way to at least work around this in the short term, but I'd like some context for why this was introduced.

@gchatelet to add more context as this was related to work he was doing for custom implementations of libc memcpy functions.

Context is in D61634. See specifically the discussion from https://reviews.llvm.org/D61634#1502201 to https://reviews.llvm.org/D61634#1512020 which discusses the inliner changes for the new attribute. IIRC the issue is that allowing such inlining would lose the no-builtin attributes, which would defeat their purpose.

Losing the ability to inline any function implemented in an LTO library compiled with -ffreestanding seems like a very bad tradeoff. I was talking with @arsenm about this and the attribute seems undocumented and sparsely tested. What was the specific failure case that this was introduced to solve? We lose that attribute, but is that a bad thing? If we inline a function that cannot call builtins into a function that can, what is the issue with calling builtins at that point?

Adding @sivachandra for the LLVM libc part.

The gist of the issue is described in the RFC.

A few considerations first:

  • LLVM libc is not written in assembly because we want compiler support for optimizations, sanitizers and fuzzers.
  • Compiler is able to recognize C/C++ constructs and turn them into library calls unless we use the dedicated -fno-builtin-* compiler flags.
  • -ffreestanding implies -fno-builtins (i.e., all builtins).

Now let's consider the case of a foo function calling memcpy where we allow inlining of functions with different attributes.
foo is compiled without any of -ffreestanding/-fno-builtin-*.
During LTO, the compiler decides to inline all of the code of memcpy within foo, discarding memcpy's -fno-builtin-memcpy flag (stored as a function attribute).
Deep down memcpy's implementation the compiler recognizes a copy loop and turns it into the @llvm.memcpy IR intrinsic.
The backend decides that this intrinsic is best implemented by calling libc's memcpy and inserts a call.
Now memcpy is implemented by a call to itself and will eventually stack overflow or infinite loop.
This happened in production.

Now for memcpy in particular we have introduced __builtin_memcpy_inline (doc) which allows us to convey the semantics of "fixed size operations that always generate code". They are guaranteed to never "call to the libc" (godbolt). Unfortunately those intrinsics are not supported in GCC so this would be clang only. Also they don't yet cover memcmp and bcmp.

Today only memcpy relies completely on __builtin_memcpy_inline (when compiled with clang) so technically this function could be compiled without -ffreestanding nor -fno-builtin and thus inlined during LTO.

Now it is very desirable that the libc functions can be inlined through LTO, we may be able to reach this state when we have strong guarantees that the compiler will not generate libc calls from within the libc function themselves. We're there for memcpy which really is the most fundamental libc function. But we're not there for most of the libc.

Note: for the reasons described above we had to implement the no-builtin flags as function attributes so they can stick to code even during LTO.

Adding @sivachandra for the LLVM libc part.

The gist of the issue is described in the RFC.

A few considerations first:

  • LLVM libc is not written in assembly because we want compiler support for optimizations, sanitizers and fuzzers.
  • Compiler is able to recognize C/C++ constructs and turn them into library calls unless we use the dedicated -fno-builtin-* compiler flags.
  • -ffreestanding implies -fno-builtins (i.e., all builtins).

Now let's consider the case of a foo function calling memcpy where we allow inlining of functions with different attributes.
foo is compiled without any of -ffreestanding/-fno-builtin-*.
During LTO, the compiler decides to inline all of the code of memcpy within foo, discarding memcpy's -fno-builtin-memcpy flag (stored as a function attribute).
Deep down memcpy's implementation the compiler recognizes a copy loop and turns it into the @llvm.memcpy IR intrinsic.
The backend decides that this intrinsic is best implemented by calling libc's memcpy and inserts a call.
Now memcpy is implemented by a call to itself and will eventually stack overflow or infinite loop.
This happened in production.

Now for memcpy in particular we have introduced __builtin_memcpy_inline (doc) which allows us to convey the semantics of "fixed size operations that always generate code". They are guaranteed to never "call to the libc" (godbolt). Unfortunately those intrinsics are not supported in GCC so this would be clang only. Also they don't yet cover memcmp and bcmp.

Today only memcpy relies completely on __builtin_memcpy_inline (when compiled with clang) so technically this function could be compiled without -ffreestanding nor -fno-builtin and thus inlined during LTO.

Now it is very desirable that the libc functions can be inlined through LTO, we may be able to reach this state when we have strong guarantees that the compiler will not generate libc calls from within the libc function themselves. We're there for memcpy which really is the most fundamental libc function. But we're not there for most of the libc.

Note: for the reasons described above we had to implement the no-builtin flags as function attributes so they can stick to code even during LTO.

Thanks for the background. Was there a reason that we could not simply merge the nobuiltin attribute into the caller? By the time we get to this point it's likely that we've already run optimizations on the TUs individually so I'd guess that we wouldn't lose any potential optimizations more severe than not being able to inline the function in the first place.

The backend decides that this intrinsic is best implemented by calling libc's memcpy and inserts a call.

I looked into this and it's just a straight SelectionDAG bug. DAG.getMemcpy tries expanding inline depending on target size thresholds and custom instructions, but ultimately unconditionally emits an illegal libcall. It needs to consider the function availability. I've already "fixed" this particular issue in 3c848194f28decca41b7362f9dd35d4939797724. This is an IR pass to work around SelectionDAG limitations, the DAG cannot directly emit a loop. Really the DAG builder code should emit a hard error if the memcpy runtime call isn't available. There is a small gap where unhandled variable sized memcpys could be introduced between PreISelIntrinsicLowering and isel, but this is just a SelectionDAG implementation issue. That would only be an issue for a hypothetical pass which were to introduce a variable sized memcpy in the late codegen pipeline, which I can't see a need for. GlobalISel doesn't have the same limitation, and we could always come up with a post-isel expanded pseudo hack to deal with it.

I think part of the fundamental problem is we've jammed two different concepts into a confusing set of overlapping control mechanisms. We currently have this set of attributes:

`builtin`
`nobuiltin`
"no-builtins"
"no-builtin-<func>"

I'm unclear on exactly the semantics of any of these is supposed to be. "no-builtins" isn't in the LangRef, and the wording on nobuiltin is oddly call site specific. With this confusion, I think the implementation is just buggy. What is the difference between nobuiltin and "no-builtins" supposed to be? The nobuiltin LangRef merely states it may be placed on declarations and definitions, but doesn't state what that means in that case.

There are 2 different concepts here: what calls the compiler is allowed to introduce, and which the compiler recognizes as having known semantics. There are traces of evidence that they're supposed to be separate but this wasn't fully implemented. We separately have "RuntimeLibcalls" and "TargetLibraryInfo", which have some overlap in the set of functions. Both are unfortunately hardcoded sets the target has to opt-out of manually. TargetLibraryInfo tries to respect "no-builtins", but no equivalent mechanism is used for "RuntimeLibcalls". As such, I do think the PreISelIntrinsicLowering case should be using some kind of runtime libcall information, though not necessarily TargetLibraryInfo.

Now for memcpy in particular we have introduced __builtin_memcpy_inline (doc) which allows us to convey the semantics of "fixed size operations that always generate code". They are guaranteed to never "call to the libc" (godbolt). Unfortunately those intrinsics are not supported in GCC so this would be clang only. Also they don't yet cover memcmp and bcmp.

I do not believe memcpy inline should be necessary. It should always be correct to emit an llvm.memcpy call anywhere, and the backend can always expand it. It needs to consider runtime call target availability, which is just not considered currently as a bug.

efriedma added a subscriber: efriedma.EditedJul 23 2023, 8:09 PM

I'm unclear on exactly the semantics of any of these is supposed to be. "no-builtins" isn't in the LangRef, and the wording on nobuiltin is oddly call site specific. With this confusion, I think the implementation is just buggy. What is the difference between nobuiltin and "no-builtins" supposed to be? The nobuiltin LangRef merely states it may be placed on declarations and definitions, but doesn't state what that means in that case.

Roughly speaking, nobuiltin exists to suppress libcall recognition on a specific call, and "no-builtins" suppresses libcall recognition in function definitions. Sort of similar, but not exactly the same. (I think the primary case where we need "nobuiltin" C++ operator new, where a new expression and an explicit call to operator new have different optimization rules.)


I looked into this and it's just a straight SelectionDAG bug. DAG.getMemcpy tries expanding inline depending on target size thresholds and custom instructions, but ultimately unconditionally emits an illegal libcall

memcpy in particular has its own set of considerations. Historically, we required the user to provide a symbol named "memcpy", no matter what flags they passed. This has turned out to be unsatisfactory in certain scenarios, so we've worked around it. The primary form of workaround is that we don't form memcpys; if the input IR doesn't contain any reference to llvm.memcpy, we don't introduce such a reference. This makes optimizations slightly worse, but basically avoids the weird issues without significantly pessimizing code in most cases. (memcpy formation is pretty niche in most cases.)

Technically speaking, a codepath was recently added to expand memcpy. But it generates extremely bad code, to the point where it's not actually usable unless you don't care at all about performance. So practically speaking, nothing has changed here.


but no equivalent mechanism is used for "RuntimeLibcalls"

Historically, we assumed that all the functions in RuntimeLibcalls were available even in no-builtins mode, because we don't have alternative implementations. This has always had various holes, but it was a good enough approximation to allow building system libraries like libc.


Was there a reason that we could not simply merge the nobuiltin attribute into the caller?

The primary issue with LTO'ing parts of libc into a program is that we're basically crossing a line: once we start inlining calls into the program, "libc" is no longer an abstract interface, so we have to turn off libcall recognition for the whole module. I don't think merging the nobuiltin markings when you inline a function is sufficient.


This discussion is going in so many different directions it's hard for me to address everything...

Was there a reason that we could not simply merge the nobuiltin attribute into the caller?

The primary issue with LTO'ing parts of libc into a program is that we're basically crossing a line: once we start inlining calls into the program, "libc" is no longer an abstract interface, so we have to turn off libcall recognition for the whole module. I don't think merging the nobuiltin markings when you inline a function is sufficient.

That's completely satisfactory at least for my case, as the GPU has no hosted environment we can make assumptions about not emitting such calls in the backend. There's a similar problem with this that I attempted to solve in https://reviews.llvm.org/D154364 but I'm currently rethinking. The problem there is that we prevent explicit internalization of recognized libcalls during LTO because the backend might emit a call to it later. I think at the end of the day we need a list of libcall functions that each backend can emit, rather than just assume each backend roughly emits everything x64 does. But I'm not familiar enough with the machinery to know where such a function could live. I was thinking that we could maybe has a pass that decorates libcalls with special attributes if we know from the target library info that it is not emitted by the backend, but I don't know how feasible that is in a generic IR pass before touching the backend, not overly familiar with that machinery.

As it stands, not being able to inline even something simple like isalnum on the GPU is not great, so I'd appreciate some help in figuring out some kind of workaround even if it's just for the GPU. I feel like this is a bit of an edge case since up until now no one has really considered being able to optimize out libcalls.

Was there a reason that we could not simply merge the nobuiltin attribute into the caller?

The primary issue with LTO'ing parts of libc into a program is that we're basically crossing a line: once we start inlining calls into the program, "libc" is no longer an abstract interface, so we have to turn off libcall recognition for the whole module. I don't think merging the nobuiltin markings when you inline a function is sufficient.

That's completely satisfactory at least for my case, as the GPU has no hosted environment we can make assumptions about not emitting such calls in the backend. There's a similar problem with this that I attempted to solve in https://reviews.llvm.org/D154364 but I'm currently rethinking. The problem there is that we prevent explicit internalization of recognized libcalls during LTO because the backend might emit a call to it later. I think at the end of the day we need a list of libcall functions that each backend can emit, rather than just assume each backend roughly emits everything x64 does. But I'm not familiar enough with the machinery to know where such a function could live. I was thinking that we could maybe has a pass that decorates libcalls with special attributes if we know from the target library info that it is not emitted by the backend, but I don't know how feasible that is in a generic IR pass before touching the backend, not overly familiar with that machinery.

As it stands, not being able to inline even something simple like isalnum on the GPU is not great, so I'd appreciate some help in figuring out some kind of workaround even if it's just for the GPU. I feel like this is a bit of an edge case since up until now no one has really considered being able to optimize out libcalls.

Given the breadth of the issue it probably deserves an RFC on discourse.