This is an archive of the discontinued LLVM Phabricator instance.

[GlobalOpt] Remove preallocated calls when possible
ClosedPublic

Authored by aeubanks on Jun 1 2020, 1:42 PM.

Details

Summary

When possible (e.g. internal linkage), strip preallocated attribute off
parameters/arguments.
This requires removing the "preallocated" operand bundle from the call
site, replacing @llvm.call.preallocated.arg() with an alloca and a
bitcast to i8*, and removing the @llvm.call.preallocated.setup().

This sort of transformation may need to be moved to somewhere more
accessible to accomodate similar transformations like inlining in the
future.

Diff Detail

Event Timeline

aeubanks created this revision.Jun 1 2020, 1:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 1 2020, 1:42 PM
aeubanks marked an inline comment as done.Jun 1 2020, 1:44 PM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2288

I couldn't figure out how to remove the preallocated operand bundle any better than this, suggestions greatly appreciated

aeubanks updated this revision to Diff 267726.Jun 1 2020, 2:10 PM

Remove obsolete comment

efriedma added inline comments.Jun 1 2020, 2:18 PM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2281

Maybe assert any non-callbase use is a BlockAddress

2288

Looked briefly; I don't think there's any other way.

2304

Is the extractProfTotalWeight really necessary? I would have thought copying a call using CallInst::Create() would copy that as well.

2311

I think it's possible to have multiple llvm.call.preallocated.arg() calls with the same index?

If you're creating dynamic allocas, you probably want to llvm.stacksave/stackrestore in appropriate places.

2418

Do you also need to check for musttail calls?

aeubanks updated this revision to Diff 267753.Jun 1 2020, 3:42 PM

Don't remove preallocated when there are any musttail callers

aeubanks marked 4 inline comments as done.Jun 1 2020, 3:49 PM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2281

Done in newly added hasMustTailCallers().

2304

Did a little test, yes it's required. Do you think it makes sense to fix this globally?

2311

I think it's possible to have multiple llvm.call.preallocated.arg() calls with the same index?

Hmm, that's true. I guess for each index that is used, create an alloca right after the llvm.call.preallocated.setup() and replace the corresponding llvm.call.preallocated.arg with the result of that. That way the alloca will dominate all its uses.

If you're creating dynamic allocas, you probably want to llvm.stacksave/stackrestore in appropriate places.

Oh I though llvm automatically handled dynamic allocas, I'll look more into this.

2418

Good point, done

efriedma added inline comments.Jun 1 2020, 5:38 PM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2304

Maybe worth doing a brief survey of other callers. I expect usually, you'd want to preserve almost all most metadata.

2311

If you don't do anything, the "automatic" handling for dynamic allocas is that they remain live until the function returns, similar to alloca() in C. That's probably not what you want here; you want the memory to be deallocated after the call.

aeubanks updated this revision to Diff 268046.Jun 2 2020, 8:24 PM

Move more comprehensive tests to new file
Handle multiple @llvm.call.preallocated.arg() calls with same arg index
Add @llvm.stack{save,restore}

ping :)
I've gotten rid of the metadata cloning here. Originally I had seen the profile metadata copying somewhere else and I mostly copied that code, but I looked back and it was actually copying it from a different instruction.

efriedma added inline comments.Jun 8 2020, 4:26 PM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

"This doesn't handle invoke" is rough...

In the normal destination, you can stick a stackrestore as the first instruction (breaking the critical edge if necessary). In the unwind destination, I'm not sure. For Itanium-style unwinding, you could just stick it immediately after the landingpad, but I'm not sure what you can do on Windows; you can't stackrestore in a funclet.

I guess ultimately, we really want to use a static alloca (in the entry block) in most cases, but it's not safe in all cases, as we've discussed before.

At the very least, this probably should bail out somehow, not crash.

aeubanks updated this revision to Diff 270550.Jun 12 2020, 5:09 PM

Bail out if any invoke callers

aeubanks marked an inline comment as done.Jun 12 2020, 5:21 PM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

Bailed out for any invokes.

I've been trying to understand how inalloca handles this and still don't quite understand.

void bar2() {
 for (int i = 0; i < 1000000; ++i) {
  try {
   foo(A(), 5);
  } catch (int) {}
 }
}

becomes

define dso_local void @"?bar2@@YAXXZ"() local_unnamed_addr #0 personality i8* bitcast (i32 (...)* @__CxxFrameHandler3 to i8*) {
entry:
  %agg.tmp.ensured = alloca %struct.A, align 8
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.inc
  ret void

for.body:                                         ; preds = %for.inc, %entry
  %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %inalloca.save = call i8* @llvm.stacksave()
  %argmem = alloca inalloca <{ %struct.A*, %struct.A, i32 }>, align 4
  %0 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 1
  %call = call x86_thiscallcc %struct.A* @"??0A@@QAE@XZ"(%struct.A* nonnull %0) #2
  %1 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 0
  store %struct.A* %agg.tmp.ensured, %struct.A** %1, align 4
  %2 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 2
  store i32 5, i32* %2, align 4, !tbaa !3
  %call1 = invoke %struct.A* @"?foo@@YA?AUA@@U1@H@Z"(<{ %struct.A*, %struct.A, i32 }>* inalloca nonnull %argmem)
          to label %invoke.cont unwind label %catch.dispatch

catch.dispatch:                                   ; preds = %for.body
  %3 = catchswitch within none [label %catch] unwind to caller

catch:                                            ; preds = %catch.dispatch
  %4 = catchpad within %3 [%rtti.TypeDescriptor2* @"??_R0H@8", i32 0, i8* null]
  catchret from %4 to label %for.inc

for.inc:                                          ; preds = %invoke.cont, %catch
  %inc = add nuw nsw i32 %i.05, 1
  %exitcond = icmp eq i32 %inc, 1000000
  br i1 %exitcond, label %for.cond.cleanup, label %for.body

invoke.cont:                                      ; preds = %for.body
  call void @llvm.stackrestore(i8* %inalloca.save)
  call x86_thiscallcc void @"??1A@@QAE@XZ"(%struct.A* nonnull %agg.tmp.ensured) #2
  br label %for.inc
}

The @llvm.stackrestore() only gets called in the non-exceptional case, but I verified that this doesn't leak stack memory when foo() always throws. I'm not sure how that happens.

efriedma accepted this revision.Jun 16 2020, 5:44 PM

LGTM

llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

You're not testing what you want to test; the problem isn't when the function with the inalloca argument itself throws. The behavior is obvious in that case: the alloca gets consumed by the call. The issue is what happens if an exception gets thrown before that. Expanding out your example:

struct A { A() { throw 1; } ~A(); };
void foo(A a, int z); // Never gets called
void bar() {
 for (int i = 0; i < 1000000; ++i) {
  try {
   foo(A(), 5);
  } catch (int) {}
 }
}

I'm pretty sure with inalloca, the memory just leaks. I guess preallocated has the same problem,

Not sure how we solve it with preallocated. I guess in the backend, we need to apply some sort of adjustment after a catchret. Not sure how we compute the size of that adjustment, though. I guess we need to note on each catchret/cleanupret which allocations die at that point.

This revision is now accepted and ready to land.Jun 16 2020, 5:44 PM
aeubanks marked an inline comment as done.Jun 16 2020, 8:23 PM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

I realized that at some point and I've been trying out different variations of A() and foo() both randomly throwing, plus nested calls to foo(), but it seems like there's never a stack leak with inalloca with @llvm.stackrestores removed. The @llvm.stackrestore only appears in the non-exceptional path, so it can't possibly help prevent stack leaks due to exceptions? And in the non-exceptional case the stack will get cleaned up as expected. So I'm not sure that the stack restores are necessary for inalloca?

I am still trying to understand how stack cleanups work in an exception handler though.

void bar() {
        for (int i = 0; i < 1000000; ++i) {
                try {
                        foo(foo(A(), A(), 5), A(), 6);
                } catch (int) {
                }
        }
}
efriedma added inline comments.Jun 17 2020, 2:16 AM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

Try the following for some interesting results at -O0.

#include <cstdio>
#define INLINE __attribute((always_inline))
//#define INLINE __attribute((noinline))
struct A {
__attribute((optnone)) A(int) {z[1]=10;}
__attribute((optnone)) A() { z[1]=5; throw 1; }
__attribute((optnone)) ~A() { printf("%d\n", z[1]); }
int z[10000];
};
__attribute((optnone)) void foo(int z, A a) { a.z[0] = 100; }
INLINE static int inner() { try {foo(1, A());}catch(int){} return 3; }
void bar() {
 for (int i = 0; i < 10; ++i) {
   foo(inner(), A(1));
 }
}
int main() { bar(); }
aeubanks marked an inline comment as done.Jun 18 2020, 12:17 AM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

Interesting...
noinline is fine (with and without removing the stack restores), but always_inline produces incorrect results. Keeping the stack restores, the printfs print a random value (consistent within a run), and removing the outer stack restore makes it crash after the first iteration.

So this is a bug with the current implementation of inalloca? I can't tell if it's specifically an inalloca/exception/inlining issue (the LLVM IR looks fine at a quick glance) or a deeper exception handling issue.

This reminds me of https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html which Reid has brought up to me before, where you can apparently jump out while in the middle of constructing arguments.

This revision was automatically updated to reflect the committed changes.
aeubanks marked an inline comment as not done.
aeubanks marked an inline comment as done.Jun 18 2020, 11:15 AM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

I tried putting the outer stackrestore in the exception path instead of the non-exceptional path and that makes it produce the right output without stack leaks.

Filed https://bugs.llvm.org/show_bug.cgi?id=46386.

efriedma added inline comments.Jun 18 2020, 11:32 AM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

One, it looks like llvm::CloneBasicBlock doesn't recognize inalloca allocations as dynamic in some cases. That's pretty clearly a bug, and I think fixing that might be enough to deal with my earlier testcase. But really, it shouldn't matter: the frontend should wrap inalloca allocations with appropriate stacksave/stackrestore operations anyway.

Two, it looks like clang has some general issues with the interaction between dynamic allocation and exceptions. Take the following testcase:

int z = 10000;
__attribute((optnone)) int bar(int*z) { int a = *z; throw a;}
int main() {
    for (int i = 0; i < 1000; ++i) {
    try {
        int a[z];
        bar(a);
        __builtin_trap();
    } catch (...) {
    }
    }
}

This currently overflows the stack on Linux because there isn't an appropriate stackrestore anywhere.

Three, I suspect that unwinding on 32-bit Windows is freeing dynamic allocations in cases where it shouldn't, and this is hiding some inalloca issues. Don't have a good testcase for this; this is just what I suspect from not seeing stack overflows in cases where the IR indicates there should be stack overflows.

aeubanks marked an inline comment as done.Jun 22 2020, 1:38 PM
aeubanks added inline comments.
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2333

Fixing CloneBasicBlock doesn't seem to fix the example you gave.

Your example makes more sense as to why the stackrestore is necessary, thanks.

(also I really do appreciate you taking the time to look into things and explain things to me)

hans added inline comments.Jul 7 2020, 5:55 AM
llvm/lib/Transforms/IPO/GlobalOpt.cpp
2418

Actually, I think the old code is also missing that check :-) Sent D83300