This fixes cases where an invoke is emitted, despite the called llvm function being marked nounwind, because CodeGenModule::ConstructAttributeList failed to add the attribute to the attribute list. llvm optimization passes turn invokes into calls and optimize away the exception handling code, but it's better to avoid emitting the code in the front-end if the called function is known not to raise an exception.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
In case it wasn't clear, the calls to mayThrow() in the test cases are needed to prevent TryMarkNoThrow from annotating the functions with nounwind, which would cause a lot of churn.
Looks good to me.
clang/test/CodeGenObjCXX/os_log.mm | ||
---|---|---|
15–16 | This comment can simply read "Check that the os_log_helper is marked nounwind." |
Hi @ahatanak
We recently hit an issue of inconsistent codegen related with this optimization. In one build, Clang frontend generates different llvm IRs for the same function that is originally from one header file. It turned out this optimization gives different results for different function definition order which is naturally unstable.
See this two repro programs:
p1.cpp: https://godbolt.org/z/bavTYEG1x
void foo() {}; void bar() noexcept {foo();};
p2.cpp: https://godbolt.org/z/zfsnzPrE6
void foo(); void bar() noexcept {foo();}; void foo(){};
See the codegens of bar are different, for p2.cpp, the callee(foo)’s definition is after the caller(bar), it's unknown to be marked nounwind before it see foo's definition, so it still generates the invoke things.
This inconsistency affected the AutoFDO, one of our work assigns consecutive number IDs to the BBs of CFG, the unstable CFGs causes the BB ID mismatched and a lot of samples are lost.
Would like to hear from your feedback. Wondering if FE can handle this perfectly or perhaps we can just leave it for BE. Thank you in advance!
To be clear, there's no miscompile, correct?
(Also, can the backend safely optimize an invoke to a linkonce_odr function that's nounwind to a call? I thought it couldn't, in case the function is de-refined to a version that's not nounwind. But the frontend can do it since it has access to the source and knows it can't be de-refined in that way?)
In any case, let's say the backend can do this optimization.
I wonder if this is just a single example, where there could be various other (header-related) peepholes that cause similar problems for stable output. IIRC, the usual Clang approach is to make as-close-to-optimal IR up front, but maybe in some situations it's desirable to delay optimizations to improve stability. Another application where that could be useful is caching.
Maybe the high level principle deserves a broader discussion on the forums. Do we want IRGen to prefer stable IR, or optimized IR? Should there be a -cc1 flag to decide (which AutoFDO could set)?
@rjmccall, any thoughts?
(Also, can the backend safely optimize an invoke to a linkonce_odr function that's nounwind to a call? I thought it couldn't, in case the function is de-refined to a version that's not nounwind. But the frontend can do it since it has access to the source and knows it can't be de-refined in that way?)
Can you please elaborate what de-refining does? The backend does have the ability to optimize a nounwind invoke and its landing pad into a single call instruction.
In any case, let's say the backend can do this optimization.
I wonder if this is just a single example, where there could be various other (header-related) peepholes that cause similar problems for stable output. IIRC, the usual Clang approach is to make as-close-to-optimal IR up front, but maybe in some situations it's desirable to delay optimizations to improve stability. Another application where that could be useful is caching.
Maybe the high level principle deserves a broader discussion on the forums. Do we want IRGen to prefer stable IR, or optimized IR? Should there be a -cc1 flag to decide (which AutoFDO could set)?
A flag to allow for a stable IR generation would be nice, but I guess in general we do not want to lose the opportunity that are only available to the front end just to favor AutoFDO. The current case sounds to me a very specific case that the backend can also get, and so far it's the only case we have see affecting the IR stability so I'm inclined to just deferring it to the backend. WDYT?
I wonder if this is just a single example, where there could be various other (header-related) peepholes that cause similar problems for stable output. IIRC, the usual Clang approach is to make as-close-to-optimal IR up front, but maybe in some situations it's desirable to delay optimizations to improve stability. Another application where that could be useful is caching.
I think this nounwind propagation a classic IPA problem, where you need proper per-function summary first and then propagate that through call graph to get final per-function attribute (like Attributor). Frontend is not the right place to do this kind of IPA/IPO.
Do we want IRGen to prefer stable IR, or optimized IR? Should there be a -cc1 flag to decide (which AutoFDO could set)?
Unstable IR is a side of trying to do IPA in frontend which is naturally going to be half-complete.
I'm not sure if I follow why invoke -> call optimization can not be done in mid-end. If possible, I think this should be deferred to mid-end.
Oh, de-refining is pretty nifty / evil. This patch has background:
https://reviews.llvm.org/D18634
Since 2016, the optimizer is not allowed to do IPA on functions that can be de-refined (such as linkonce_odr functions).
Here's a hypothetical problematic scenario for the optimizer:
- original IR for B has a throw somewhere
- function A invokes function B
- in this TU, B is optimized and removes exceptions, and gets marked nounwind
- function A leverages the nounwind to turn the invoke into a call
- function B is de-refined at link/load time: linker chooses a *different* function B which still has a throw
- "evil magic" happens (see the discussions around the patch linked above)
- a crash is introduced
At first blush, it sounds like this could only be a problem if the code has UB in it. However, read the above patch (and follow-ups, and related discussion) for a variety of examples of non-UB cases where IPA on de-refineable functions introduces crashes. I don't know for sure this could be a problem for nounwind specifically, but in general the LLVM optimizer doesn't look at attributes of de-refineable functions.
(Note that if you're doing LTO (like AutoFDO), this usually won't block optimization, since at LTO time there are very few de-refineable functions (most linkonce_odr functions are internalized, and not exported as weak). So if we added a -cc1 flag to prefer "stable IR" over "frontend peepholes", it would make sense for -flto to imply it.)
On the other hand, the frontend knows the token sequence from the source language. It knows whether function B is inherently nounwind based on its ODR token sequence; in which case, it's safe to use the attribute for an IPA peephole.
BTW, I'm not personally against removing this peephole from the frontend (even without a flag), or limiting it somehow to cases where it doesn't make IR output unstable. I like the idea of stable IRGen output.
Nevertheless, it feels like removing IPA-based peepholes from Clang in the name of stable IRGen output is a change in direction, which might deserve a discussion in the forums rather than in a particular patch review.
clang marks the called function foo in p1.cpp as nounwind here: https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.cpp#L1284
clang can also mark a function declaration as nounwind based on the information in the source code, for example, when it is annotated with __attribute__((pure)).
I haven't read everything discussed in https://reviews.llvm.org/D18634 yet, but it seems like it's safe to do this optimization when the called function is linkonce_odr. If clang or llvm's optimization determines one version of the function doesn't throw, then other versions of the same function can't throw either.
But it looks like clang doesn't do the right thing when the foo is weak. clang emits a call instead of an invoke when it compiles the following code:
int foo() __attribute__((weak, pure)); int bar() noexcept { return foo();};
There are *some* properties we can still assume about linkonce_odr functions despite them being replaceable at link time. The high-level language guarantee we're starting from is that the source semantics of all versions of the function are identical. The version of the function we're looking at has been transformed from the original source — it is, after all, now LLVM IR, not C/C++ — but it has presumably faithfully preserved the source semantics. We can therefore rely on any properties of the semantics that are required to be preserved by transformation, which includes things like "does it terminate", "what value does it return", "what side effects does it perform", and so on. What we can't rely on are properties of the implementation that are not required to be preserved by transformation, like whether or not it uses a certain argument — transformations are permitted to change that.
The output-stability argument is an interesting one. The critical thing here is to avoid instability on the same source. When the source is different, I mean, it'd be nice to make a best effort at stability, but even putting optimization aside, things like header processing order or template instantiation order are necessarily going to affect things like order in the functions lists. That's going to affect output, at the very least in terms of object file order, but also in that we can't realistically promise that function processing order in the optimization will *never* have any impact. Our interprocedural passes generally try to work in call-dependency order, but that's not a perfect tree, and function order inevitably comes into it.
With all that said, I don't feel strongly that we need to preserve this frontend optimization if it's causing real problems.
Thanks for the detailed explanation about de-refining!
I feel a bit confused about linkonce_odr. From the LLVM IR reference I see the definition of
linkonce_odr, weak_odr Some languages allow differing globals to be merged, such as two functions with different semantics. Other languages, such as C++, ensure that only equivalent globals are ever merged (the “one definition rule” — “ODR”). Such languages can use the linkonce_odr and weak_odr linkage types to indicate that the global will only be merged with equivalent globals. These linkage types are otherwise the same as their non-odr versions.
It sounds to me that at link time only equivalent symbols can replace each other. Then de-refining some of those equivalent symbols should not affect their semantics as far as nothrow is concerned? Just as @rjmccall pointed out, the C++ language guarantee we're starting from is that the source semantics of all versions of the function are identical.
That said, the LLVM optimizer does not strictly subsume the front-end because of how it fails to handle linkonce_odr functions as in https://reviews.llvm.org/D18634. I'm wondering how common the linkonce_odr linkage is for C++. In @wlei's example, none of the functions there is linkonce_odr. Is there a particular source-level annotate that specifies functions to be linkonce_odr?
Discussing a path to stable IR gen in general in the forum would be great. In the meantime I'm appealing to remove this specific peephole to unblock AutoFDO, if nobody objects.
In C++, you get linkonce_odr all over the place. It's basically all functions that are defined in C++ headers that are available for inlining.
- any function marked inline
- any function in a class/struct whose declaration is its definition (approximately all templated code)
A few exceptions:
- If a function is explicitly instantiated (e.g., member functions of T<int> if template class T<int>;), it gets weak_odr, which IIRC cannot be de-refined?
- If a function has local linkage (like free functions with static inline), it gets internal, which cannot be de-refined.
- If a function is marked inline inside an extern "C" block, it gets available_externally. This can also be de-refined (but without ODR, you wouldn't be tempted to optimize based on its attributes anyway).
It sounds to me that at link time only equivalent symbols can replace each other. Then de-refining some of those equivalent symbols should not affect their semantics as far as nothrow is concerned? Just as @rjmccall pointed out, the C++ language guarantee we're starting from is that the source semantics of all versions of the function are identical.
The rule is subtly different. Only symbols that are source-equivalent can replace each other. But they aren't necessarily equivalent to the function you see, which may have been refined by optimization.
Here's a concrete example. Say we have function maybe_nounwind that is not nounwind at the source level, and a catch_all function that wraps it.
// Defined in header. extern std::atomic<int> Global; // LLVM: linkonce_odr inline int maybe_nounwind(int In) { int Read1 = Global; int Read2 = Global; if (Read1 != Read2) throw 0; return /* Big, non-inlineable computation on In */; } // Defined in source. // LLVM: nounwind int catch_all(int In) { try { return maybe_nounwind(In); } catch (...) { return -1; } }
There's no UB here, since comparing two atomic loads is allowed. In rare cases, an unoptimized maybe_nounwind could throw, if another thread is changing the value of Global between the two loads.
But the optimizer will probably CSE the two atomic loads since it's allowed to assume that both loads happen at the same time. This refines maybe_nounwind. It'll turn into IR equivalent to:
// Defined in header. Then optimized. // LLVM: linkonce_odr nounwind readnone inline int maybe_nounwind(int In) { return /* Big, non-inlineable computation on In */; } // Defined in source. Then optimized. // LLVM: nounwind int catch_all(int In) { try { return maybe_nounwind(In); } catch (...) { return -1; } }
It's important that catch_all is NOT optimized based on maybe_nounwind's new nounwind attribute. At link time, it's possible for the linker to choose an unoptimized copy of maybe_nounwind. Just in case it does, catch_all needs to keep its try/catch block, since unoptimized maybe_nounwind can throw. Similarly, catch_all should not be marked readnone, even though the refined/optimized maybe_nounwind is readnone, since a de-refined copy reads from memory.
- At IRGen time, you know the LLVM attributes have not been adjusted after the optimized refined the function's behaviour. It should be safe to have IPA peepholes, as long as IRGen's other peepholes don't refine behaviour and add attributes based on that.
- In the optimizer, if you're looking at de-refineable function then you don't know which attributes come directly from the source and which were implied by optimizer refinements. You can't trust you'll get the same function attributes at runtime.
On the other hand, the frontend knows the token sequence from the source language. It knows whether function B is inherently nounwind based on its ODR token sequence; in which case, it's safe to use the attribute for an IPA peephole.
Thanks for the detailed explanation again! As you pointed out previously, linkonce_odr is something the front end can optimize. I'm wondering why the front end is confident about that the linker would not replace the current definition with something else.
The frontend has generated unrefined IR with all side effects from the must-be-ODR-equivalent source still present. It's not until on optimizer gets at it that side effects can be refined away. (Unless the IRGen peepholes are powerful enough to refine away side effects, but I don't believe IRGen does that.)
Since the IR from IRGen is unrefined (still has all side effects present in the source), whatever the linker/loader chooses cannot gain "extra" side effects through de-refinement.
Hmm. I see what you're saying, but it's an interesting question how it applies here. In principle, the optimizer should not be changing the observable semantics of functions, which certainly includes things like whether the function throws. Maybe the optimizer can only figure out that a function throws in one TU, but if it "figures that out" and then a function with supposedly the same semantics actually does throw — not just retains the static ability to throw on a path that happens not to be taken dynamically, but actually throws at runtime — then arguably something has gone badly wrong. As I recall, the de-refinement discussion was originally about properties that are *not* invariant to optimization in this way, things like whether the function uses one of its arguments. Those properties are not generally considered to be part of the function's externally-observable semantics.
Of course, that's making a lot of assumptions about both what transformations are legal and to what extent they can be observed. All bets are off the second you have a single transformation that's observable in code. For example, we have a C++ optimization that promotes scoped heap allocations to the stack; that can definitely change whether exceptions are thrown, and then you can handle that exception and change return values, trigger extra side effects, and so on. I don't think anyone wants to argue that we shouldn't do that optimization. Even more simply, fast-math optimization can certainly change return values; and of course *anything* can change semantics under SEH.
Even if we just need to assume that that's always going to be possible in LLVM — that there will always be optimizations in play that can arbitrarily change observable semantics — maybe we can at least be a little more principled about them? It's still true that the vast majority of transformations in LLVM cannot trigger arbitrary changes to source semantics, at least when SEH isn't in use. Most transformations that change semantics are pretty narrow in practice — they don't touch most functions — so instead of conservatively assuming that *any* function might have been altered, it's probably profitable to track that on specific functions. That would at least eliminate this artificial boundary between the frontend and the optimizer: the optimizer would have the information it would need to do this analysis on unaltered functions.
As far as I know, the optimizer IPO pass that infers function attributes (i..e InferFunctionAttrsPass) is placed at the very beginning of the optimization pipeline. Does this sound to you that the side effects computed for linkonce_odr functions there can be trusted by the rest of the pipeline?
I believe in my example, it's kind of the reverse. Only one TU *remembers* that the function can throw; the other one "forgets" because it has optimized its variant not to throw.
Maybe it's useful to note that, while maybe_nounwind has no UB, whether it throws or not depends on thread timing, and is generally non-reproducible (run it twice, you can get different results). In the TU that forgets, the optimizer is choosing to assume that the two adjacent atomic loads happen so quickly that no store happens in between; choosing the thread timing where there's no store to contend with. This is a valid refinement of the original source semantics -- optimizers are allowed to CSE adjacent atomic loads.
As I recall, the de-refinement discussion was originally about properties that are *not* invariant to optimization in this way, things like whether the function uses one of its arguments. Those properties are not generally considered to be part of the function's externally-observable semantics.
The example described in the referenced de-refinement commit is where a function that writes to memory is refined to readnone. I think my nounwind example above is analogous.
Here's the original from https://reviews.llvm.org/D18634:
For instance, FunctionAttrs cannot assume a comdat function is
actually readnone even if it does not have any loads or stores in
it; since there may have been loads and stores in the "original
function" that were refined out in the currently visible variant, and
at the link step the linker may in fact choose an implementation with
a load or a store. As an example, consider a function that does two
atomic loads from the same memory location, and writes to memory only
if the two values are not equal. The optimizer is allowed to refine
this function by first CSE'ing the two loads, and the folding the
comparision to always report that the two values are equal. Such a
refined variant will look like it is readonly. However, the
unoptimized version of the function can still write to memory (since
the two loads can result in different values), and selecting the
unoptimized version at link time will retroactively invalidate
transforms we may have done under the assumption that the function
does not write to memory.
So your argument is that it would not be possible to recognize that we're doing such an optimization and mark the function as having had a possible semantics change?
Depends what you mean by "trusted". It assumes the attributes accurately describe the function it sees. The properties promised there will apply if/when the code is inlined. But, since the commit in 2016, it doesn't trust that they fully describe the source semantics, so IPA ignores them when the function is not inlined.
Note that the optimizer doesn't know if its input IR has already been optimized. Is this the first optimizer that has run on the IR, or could side effects have been refined away already? E.g., if the optimization pipeline in question runs at LTO time, the compile-time optimization pipeline has already run.
I suspect it would be error-prone to do that precisely. I'd bet there are a variety of hard-to-reason about examples. Originally, when @sanjoy was first describing this problem (to me and others), his examples all had UB in the original code (e.g., reading and writing to globals in different threads). Eventually he invented the adjacent-atomic-load device, described above, which does not rely on UB in the original code. I just assume there are more devices out there that we don't know about or understand.
Maybe it would be useful to do it imprecisely? E.g, have all transformation passes mark all functions as changed (or drop a pristine attribute)? Then at least you know whether it has come directly from IRGen.
Not sure if it's valuable enough though. I don't think the regressions were as bad as we expected. It seems okay for the optimizer to delay propagating attributes from de-refineable functions until you have the export list for the link (and non-exported symbols can be internalized).
Sorry... this seems to be pretty off topic, in a way, although maybe not many people know the de-refinement ins and outs so maybe this is useful.
My original points:
- Moving this to the middle end isn't easy (currently, it doesn't have the info it needs, although John might have a proposal for providing it)
- Dropping frontend peepholes in favour of stable IR output seems novel and might deserve a forum discussion (I'd be in favour, personally)
I just assume there are more devices out there that we don't know about or understand.
I don't totally understand the broader discussion, but malloc(4) == nullptr is another gadget. This is optimized to false by LLVM even though at runtime it can be false or true depending on the state of the heap.
Wondering if we can come up with a way to tell the optimizer about that, e.g., through a new module flag. When it comes to LTO, the selection of linkonce_odr symbols should already been done and the optimizer may be able to recompute the attributes based on pre-LTO attributes, or at least we can allow IPO to one module only, which should still do a better job than FE does?
I don't think there's much point in passing anything to LTO. There are very few linkonce_odr symbols in LTO, since LTO has the advantage of an export list from the link. Symbols not on the export list are internalized (they're given local linkage).
That sounds to me an opportunity to get a broader IPO done precisely in the prelink optimizer, as long as we find a way to tell it the incoming IR has source fidelity. What do you think about idea of introducing a module flag? Maybe it's worth discussing in the forum as a followup of introducing a cc1 flag for a stable IR gen.
I'm not sure I'm following.
The prelink optimizer will already be internalizing (i.e., NOT exporting) these symbols. That should be enough. AFAICT, it's non-LTO pipelines that might have headroom after this is reverted.
I'm also not sure what the module flag would be for. If "this module has source fidelity", it won't work, because the gadgets I'm aware of are implemented in function passes (probably -instcombine?). A function pass isn't allowed to touch module state. Were you thinking of a different module flag? (But, I repeat, I think LTO pipelines have nothing to worry about anyway.)
By prelink I meant the optimizer that run by Clang. The one run by the linker is usually called the postlink optimizer. As you pointed out, the postlink optimizer will unlikely see lots of linkonce_odr because of the internalization done right before it. But the prelink optimizer, which is basically a non-LTO optimizer, will still see them.
We actually didn't see expected wins with AutoFDO thinLTO after disabling this specific FE peephole, which I'm guessing might be due to the lack of such peephole in prelink.
I'm also not sure what the module flag would be for. If "this module has source fidelity", it won't work, because the gadgets I'm aware of are implemented in function passes (probably -instcombine?). A function pass isn't allowed to touch module state. Were you thinking of a different module flag? (But, I repeat, I think LTO pipelines have nothing to worry about anyway.)
One of the common places to promoting an invoke to a call is in SimplifyCFG (https://github.com/llvm/llvm-project/blob/207854b07dd9bd0d79add49bc5af17f1aabc752f/llvm/lib/Transforms/Utils/Local.cpp#L2498), which is based on function attributes that are inferred by InferFunctionAttrsPass. This pass is the first optimization pass in the pipeline, so as long as we can get that pass right, the downstream optimizations should be safe. The module flag I was thinking of is to inform the InferFunctionAttrsPass module pass that the IR it's seeing is non-refined.
clang-format: please reformat the code