This is an archive of the discontinued LLVM Phabricator instance.

[coroutines] Promote cleanup.dest.slot-like allocas to registers to avoid storing them in the coroutine frame
AbandonedPublic

Authored by GorNishanov on Aug 23 2017, 10:52 PM.

Details

Reviewers
majnemer
Summary

Allocas that follow the pattern of cleanup.dest.slot, namely, they are
integer allocas that don't escape and only store constants. We don't want
them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

For example in this coroutine %slot will be promoted to register even in -O0

define i8* @happy.case() "coroutine.presplit"="1" {
entry:
  %slot = alloca i32
  %id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
  %size = call i32 @llvm.coro.size.i32()
  %alloc = call i8* @malloc(i32 %size)
  %hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)

  store i32 1, i32* %slot

  %tok = call i8 @llvm.coro.suspend(token none, i1 false)
  switch i8 %tok, label %suspend [i8 0, label %resume
                                i8 1, label %cleanup]
resume:
  store i32 2, i32* %slot
  br label %cleanup

cleanup:
  %mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
  call void @free(i8* %mem)
  %x = load i32, i32* %slot
  call void @print.i32(i32 %x)
  br label %suspend
suspend:
  call i1 @llvm.coro.end(i8* %hdl, i1 0)
  ret i8* %hdl
}

This fixes PR34289

Diff Detail

Event Timeline

GorNishanov created this revision.Aug 23 2017, 10:52 PM
eric_niebler added inline comments.Aug 25 2017, 1:40 PM
lib/Transforms/Coroutines/CoroFrame.cpp
789 ↗(On Diff #112504)

Knowing nothing about how this code works, it seems strange to me that you are ignoring the last instruction in the basic block, but the comment says you're looking at all instructions.

GorNishanov marked an inline comment as done.Aug 25 2017, 1:47 PM
GorNishanov added inline comments.
lib/Transforms/Coroutines/CoroFrame.cpp
789 ↗(On Diff #112504)

Block should always end in a terminator instruction. AllocaInst is not a terminator, hence, no need to check for it. Here is another place with a similar loop (from where it was shamelessly stolen from :-))

C:\GitHub\llvm\lib\Transforms\Utils\Mem2Reg.cpp:
   39      // Find allocas that are safe to promote, by looking at all instructions in
   40      // the entry node
   41:     for (BasicBlock::iterator I = BB.begin(), E = --BB.end(); I != E; ++I)
   42        if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) // Is it an alloca?
   43          if (isAllocaPromotable(AI))

I'd love to see this approved. It's a bad bug.

majnemer edited edge metadata.Aug 30 2017, 3:37 PM

How are integer allocas special?

GorNishanov marked an inline comment as done.Sep 1 2017, 9:30 AM

How are integer allocas special?

clang synthesizes integer allocas for flags and cleanup states to implement dispatching of cleanup code to decide whether something is fully constructed and need destruction and to decide where to go after common cleanup sequence is executed. In O2, those synthetic variables are gone. In O0, they stay as variables and coroutine frame building code will put them into the coroutine frame, that may lead to accessing them for cleanup purposes in the initiation function after coroutine frame is already destroyed due to coroutine never suspending.

Potentially this code can be generalized to deal with arbitrarily typed allocas with this pattern, but, I wanted to keep it only for integers since that is the scenario (clang synthesized flags and destination slot variables) I wanted to address.

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

hfinkel added a subscriber: hfinkel.Sep 1 2017, 3:27 PM

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I can run the promotion of alloca in CoroEarly before any optimization passes could interfere. Essentially, cleaning up after the frontend "heavy handed" :-) cleanup code codegen.

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

Most of the data in the coroutine folds neatly into two categories:

  • Data that can be used after resume (all of the variables, temporaries and parameter copies in the user authored body.
  • Data used only during coroutine creation when it is first invoked, such as parameters and coroutine return value, such as generator<int>, task<void>, etc.

Cleanup destination slot breaks this pattern. It is used both before coroutine state is created, during coroutine body execution and after coroutine state is destroyed.

I was thinking for awhile and was not able to come up with a simple algorithm to deal with it apart from the one offered in this patch.

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

Most of the data in the coroutine folds neatly into two categories:

  • Data that can be used after resume (all of the variables, temporaries and parameter copies in the user authored body.
  • Data used only during coroutine creation when it is first invoked, such as parameters and coroutine return value, such as generator<int>, task<void>, etc.

Cleanup destination slot breaks this pattern. It is used both before coroutine state is created, during coroutine body execution and after coroutine state is destroyed.

I was thinking for awhile and was not able to come up with a simple algorithm to deal with it apart from the one offered in this patch.

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

GorNishanov added a comment.EditedSep 4 2017, 10:43 AM

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

The problem this patch is fixing only occurs in -O0, at any other optimization levels, mem2reg pass during SROA eliminates all of the cleanup related allocas (and more).
I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot), so I am looking for a pattern used in those, namely, those are integer allocas that have only constants stored in them.

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

The problem this patch is fixing only occurs in -O0, at any other optimization levels, mem2reg pass during SROA eliminates all of the cleanup related allocas (and more).
I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

so I am looking for a pattern used in those, namely, those are integer allocas that have only constants stored in them.

I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

Because as Gor writes at the top:

We don't want them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

This bug manifests as a codegen use-after-free in -O0.

I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

Because as Gor writes at the top:

We don't want them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

This bug manifests as a codegen use-after-free in -O0.

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

For cleanup, clang creates a pattern that current coroutine frame building algorithm cannot handle. To handle it, we will probably need to convert allocas to SSA and see def/use chains, but, once converted to SSA, the problematic pattern disappears.
Another approach is to prove that for a particular alloca store-load pairs never cross a suspend point, if that is the case, we can duplicate the alloca (so that every coroutine part has its own copy) since the value is never carried over from one suspend/resume to another. However, this algorithm seems complicated and have to be hand built, whereas, we can reuse well-tested alloca promotion algorithm to eliminate the problematic pattern.

Probably moving the promotion from CoroSplit to CoroEarly which rans immediately after front-end will help to make sure that later optimization passes will not interfere with cleanup.dest alloca promotion.

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

For cleanup, clang creates a pattern that current coroutine frame building algorithm cannot handle. To handle it, we will probably need to convert allocas to SSA and see def/use chains, but, once converted to SSA, the problematic pattern disappears.
Another approach is to prove that for a particular alloca store-load pairs never cross a suspend point, if that is the case, we can duplicate the alloca (so that every coroutine part has its own copy) since the value is never carried over from one suspend/resume to another. However, this algorithm seems complicated and have to be hand built, whereas, we can reuse well-tested alloca promotion algorithm to eliminate the problematic pattern.

Probably moving the promotion from CoroSplit to CoroEarly which rans immediately after front-end will help to make sure that later optimization passes will not interfere with cleanup.dest alloca promotion.

I'm missing something here. So the problem is this code, right:

%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
call void @free(i8* %mem)
%x = load i32, i32* %slot

If we have an alloca, %slot in this case, that is promoted to live in the coroutine frame, why don't you just extract it again if it's needed after the call to coro.free?

Currently, this gets transformed into:

define internal fastcc void @happy.case.destroy(%happy.case.Frame* %FramePtr) {
entry.destroy:
  %vFrame = bitcast %happy.case.Frame* %FramePtr to i8*
  %slot.reload.addr = getelementptr inbounds %happy.case.Frame, %happy.case.Frame* %FramePtr, i32 0, i32 4
  call void @free(i8* %vFrame)
  %x = load i32, i32* %slot.reload.addr
  call void @print.i32(i32 %x)
  ret void
}

How about, for all uses of alloca after the original call to @llvm.coro.free, you should create a new alloca in the foo.destroy function, copy the value from the frame into that new alloca, and then use that alloca in the later code? Like this:

define internal fastcc void @happy.case.destroy(%happy.case.Frame* %FramePtr) {
entry.destroy:
  %slot = alloca i32
  %vFrame = bitcast %happy.case.Frame* %FramePtr to i8*
  %slot.reload.addr = getelementptr inbounds %happy.case.Frame, %happy.case.Frame* %FramePtr, i32 0, i32 4
  call void @llvm.memcpy.p0i32.p0i32.i32(i32* %slot, i32* %slot.reload.addr, i32 4, i32 4, i1 false)
  call void @free(i8* %vFrame)
  %x = load i32, i32* %slot
  call void @print.i32(i32 %x)
  ret void
}

How about, for all uses of alloca after the original call to @llvm.coro.free, you should create a new alloca in the foo.destroy function, copy the value from the frame into that new alloca, and then use that alloca in the later code? Like this:

I see where you were going. I'll prototype and see how it looks. Thank you for your feedback.

Moved the promotion to CoroEarly to make sure no optimization passes can intervene.

@hfinkel, I looked at your suggestion, but, I think it is difficult to make it work in the general case and the transformation should be done before coroutine is split.
Consider a case of a fire and forget coroutine (no final suspend point, so it cleans itself when it runs to completion.
Now, if we keep destination slot in the coroutine frame and try to patch up the load, similar to your suggestion, in the initial function, there is no safe place to place the load, since the coroutine frame can be destroyed by a different thread that resumed the coroutine and it managed to run to the completion before the initiating function finished. The approach in this patch is simple, uses existing algorithm and takes care of this case by making sure that cleanup slot is never stored in memory.

Added the test case

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

coro-early pass is run as early pass during "cleanup after a frontend" phase of the optimizer. Thus, arbitrarly inlined code cannot happen. I also put out for review an alternative fix that eliminates cleanup.dest.slot for coroutines in the CodeGenFunction::FinishFunction: https://reviews.llvm.org/D39768

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

coro-early pass is run as early pass during "cleanup after a frontend" phase of the optimizer. Thus, arbitrarly inlined code cannot happen.

Hmm. After digging through source code for a while, it looks like this does precede the always-inliner, chiefly by virtue of being added to the function pass manager which seems to be run before the module passes. Still, this really feels like a hack that's going to come back to bite us. What if there's some weird control-flow feature elsewhere in the function that prevents mem2reg from working completely? I assume the direct bug you're trying to fix is that the frame is destroyed before your last use of the cleanup-dest alloca; is there a strong argument why this problem is specific to just the cleanup-dest alloca? Are we going to end up tightly integrating coroutine emission (which, with all due respect, is going to be very lightly tested in terms of the language features it interacts with) with of the cleanup and cleanup-emission logic across IRGen? I'm sorry if some of these points have already been hashed out elsewhere; if you want to just refer me to that discussion, please do.

If this really is special to cleanup dests, and we have a compelling argument that mem2reg will always succeed, then I think we can accept this patch. The other option is to find an IR-generaton pattern that avoids needing to access the cleanup-dest alloca after the destruction of the frame at all.