Since r359743, we remove inalloca from functions that don't need
it. This change optimizes the affected allocas by turning them
into static allocas.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 31319 Build 31318: arc lint + arc unit
Event Timeline
Nice!
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2147 | I would suggest adding ->stripPointerCasts() to the argument here. In LTO scenarios with C++ templates, it's highly likely that two different TUs will end up computing different struct types that are structurally equivalent. When they are linked together, pointer bitcasts may be added to make the types work out. | |
2155 | This will skip the lifetime insertion. Do you think we should do it anyway? I guess it could hypothetically matter if you have a massive single basic block function that does 20 inalloca calls, then stack usage could get out of hand. Maybe just skip the hoisting. | |
2159–2162 | There's a shortcut for this: IRBuilder<>::getInt64() can make ConstantInts. | |
2168 | It's unfortunately very likely that getFirstInsertionPt may return an end iterator if BB is a catchswitch BB. I would just continue the loop in these cases. A missing lifetime end should pessimize the code, not lead to miscompiles. You should be able to construct a test case for this by putting an inalloca call inside a try / catch. | |
llvm/test/Transforms/GlobalOpt/inalloca.ll | ||
67 | This is a bit artificial, since we don't support generating code for a function that uses landingpad instructions with __CxxFrameHandler3 personalities. I think it's worth testing catchswitch anyway, but if you want to test landingpad too, add another test that uses __gxx_personality_v0. |
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2155 | Do we need to check whether the alloca is in the same basic block as the call? If there's control flow between the alloca and the call, the placement of the lifetime intrinsics might be wrong, or the alloca might still be used by some other inalloca call, or the alloca might not be deallocated along the path where the call doesn't execute. It's easy to construct a C++ testcase where an exception might be thrown between an alloca and the corresponding inalloca call. Not sure if it's possible for clang or LLVM optimizations to produce an alloca used by multiple different inalloca calls, depending on control flow. | |
2168 | There might not be an insertion point in a block with a catchswitch. |
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2155 | I've been operating under the assumption that stack coloring can tolerate lifetime markers that don't form regions, so this is a best effort to not create excessively large stack frames for functions with long sequences of inalloca calls. There are two side exit cases that are interesting:
I think normal exits are only possible with gnu statement expressions, which allow you to set up some arguments to a call, and then return, goto, or break out of the call setup. Clang's generated IR will do the stackrestore on a normal side exit, I believe, using the normal cleanup mechanism. We could try to find the associated stacksave / stackrestore calls and use them to insert the lifetime markers, but that requires a lot of analysis. So, I think normal exits are uninteresting. I had thought exceptional exits wouldn't be a problem, but now I'm worried about them. When the alloca is dynamic, I believe what happens is that the unwinder resets the stack pointer to what it was at the end of the prologue. This interacts badly with allocas, and MSVC does something to handle this, but I think we never implemented it. However, if we don't place our lifetime end markers along exceptional exits in this transform, I think there's the possibility that we won't do the stack coloring, and we could have runaway stack growth. Hm. We could place lifetime end markers along every unwind edge reachable on the path from the alloca to the call... but we'd have to worry about those uncommon normal side exits, then. |
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2155 | This skips the hoisting (that happens in AI->moveBefore(...) later on). This is the case where the alloca is already in the entry block. In that case, I think there is no gain from moving it, and you risk breaking code (by moving it after its use). | |
2159–2162 | Thanks! | |
2168 | I updated the test case so that it hits this case and changed the code to:
This should handle the catchswitch case by effectively inserting lifetime.end after the catchpads it dispatches to. | |
llvm/test/Transforms/GlobalOpt/inalloca.ll | ||
67 | I cobbled this together, but realized afterwards that it doesn't make much sense. Since this is 32-bit Windows, I rewrote it to use catchswitch instead. |
An example of what I'm talking about with multiple calls:
struct C { C() noexcept; C(const C&) noexcept; ~C() noexcept; }; void f(C) noexcept, f2(C) noexcept; void g(bool b, bool b2) { if (b2) { b ? f(C()) : f2(C()); } }
With "clang -O2 --target=i686-pc-windows-msvc -emit-llvm", there's one alloca used by multiple calls; the alloca can't be hoisted until both inalloca attributes are stripped.
I think all my comments are addressed, but @efriedma made a good point about the lifetime marker placement along exceptional exits from inalloca construction. I think we probably shouldn't worry about it and just go with this, but I'm biased, since we don't use EH. =S
Right, that test case. It was even in the design doc. If only we'd had token at the time. :(
So, we'd have to analyze the uses of the alloca to prove this transform is valid. But, this is kind of nasty since you can take an alloca, bitcast it, gep forwards, store that, reload it, gep back, and then use that as an argument to an inalloca call, and that should be valid IR.
It's actually important to optimize the case where the inalloca parameter escapes via a call and there is some control flow, since typically the reason we're emitting inalloca in the first place is because we have to call a nontrivial constructor. So, we'd want to be able to say that it's not valid to pass an inalloca pointer to a call, return it back, and then use it as an argument pack. I don't know if we can make that semantic change. We have things like the IR outliner that wouldn't know about that, although they wouldn't intentionally try to obscure other uses of a captured alloca.
I think Eli raises a good point, and I'm inclined to just say "if the alloca (before hoisting) is not in the same BB as the call, don't hoist it". That avoids changing the stack behavior of the code. It also means we miss out on an optimization, but it's an optimization that we're not performing today anyway.
For the multiple call case, we could maybe try to do some sort of control-flow based analysis, instead of trying to analyze uses of the pointer. Essentially, take advantage of the rule that "the argument allocation must have been the most recent stack allocation that is still live". But that's probably difficult to implement.
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2155 | The truly awful part about trying to emit lifetime markers for the exception side-exit case is that there isn't any good indication in the IR where, exactly, we would need to place the lifetime.end marker. The lifetime actually ends at some location in the middle of the exception handler, after the temporary's destructors run. Maybe clang should emit lifetime markers in that case. |
llvm/lib/Transforms/IPO/GlobalOpt.cpp | ||
---|---|---|
2152 | Maybe worth expanding a little on why control flow matters, based on the review discussion. | |
2167 | Do we expect that clang will, at some point, start emitting the relevant lifetime markers itself? It currently doesn't, as far as I can tell, but it seems like it might be useful for other reasons. |
The control flow based idea could be stated as, is there any other call to inalloca reachable from this allocation that is not killed by another stack allocation? But, I guess it would have to track stacksave + stackrestore levels, and those are hard to analyze too.
In person over here, @inglorion and I felt that maybe this would be the best:
- Teach clang to emit lifetime markers for inalloca packs (this should be easy)
- Teach instcombine or another general cleanup pass to turn an inalloca alloca static if there are no inalloca calls anywhere in the current function. Don't try to insert lifetime markers, just assume the frontend did it if it cares.
Doing the whole-function analysis ignoring uses is much simpler, and it catches cases where the inalloca call site gets inlined, but the inalloca alloca doesn't get SROA'd.
Teach instcombine or another general cleanup pass to turn an inalloca alloca static if there are no inalloca calls anywhere in the current function
Given an arbitrary alloca in a function with no inalloca calls, hoisting the alloca to the entry block requires proving that the allocation's lifetime doesn't overlap itself. For example, we have to make sure we don't miscompile for (int i = 0; i < n; i++) foo(alloca(1));. So you have to track stacksave + stackrestore levels anyway.
That said, a general alloca hoisting transform might be useful in other contexts, and analyzing the whole function at once is probably the most efficient way to handle it.
Thanks for helping me think about this and alternative approaches. I'll be withdrawing this for now to get it out of the review queue. If I end up implementing a new approach, I'll put it up as a new diff (with a link to this one for the comments).
I would suggest adding ->stripPointerCasts() to the argument here. In LTO scenarios with C++ templates, it's highly likely that two different TUs will end up computing different struct types that are structurally equivalent. When they are linked together, pointer bitcasts may be added to make the types work out.