This is an archive of the discontinued LLVM Phabricator instance.

[Coroutine] Check indirect uses of alloca when checking lifetime info
ClosedPublic

Authored by lxfind on Feb 17 2021, 5:34 PM.

Details

Summary

In the existing logic, we look at the lifetime.start marker of each alloca, and check all uses of the alloca, to see if any pair of the lifetime marker and an use of alloca crosses suspension point.
This approach is unfortunately incorrect. An use of alloca does not need to be a direct use, but can be an indirect use through alias.
Only checking direct uses can miss cases where indirect uses are crossing suspension point.
This can be demonstrated in the newly added test case 007.
In the test case, both x and y are only directly used prior to suspend, but they are captured into an alias, merged through a PHINode (so they couldn't be materialized), and used after CoroSuspend.
If we only check whether the lifetime starts cross suspension points with direct uses, we will put the allocas to the stack, and then capture their addresses in the frame.

Instead of fixing it in D96441 and D96566, this patch takes a different approach which I think is better.
We still checks the lifetime info in the same way as before, but with two differences:

  1. The collection of liftime.start is moved into AllocaUseVisitor to make the logic more concentrated.
  2. When looking at lifetime.start and use pairs, we not only checks the direct uses as before, but in this patch we check all uses collected by AllocaUseVisitor, which would include all indirect uses through alias. This will make the analysis more accurate without throwing away the lifetime optimization.

Diff Detail

Event Timeline

lxfind created this revision.Feb 17 2021, 5:34 PM
lxfind requested review of this revision.Feb 17 2021, 5:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2021, 5:34 PM
ChuanqiXu added inline comments.Feb 17 2021, 6:37 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
1012

usees -> uses

1025

Should we add else return true;? Since Users already contains the alias of the alloca, the code above already checks whether lifetime start info and the indirect uses cross suspend points.

lxfind updated this revision to Diff 324495.Feb 17 2021, 7:06 PM

address comments

In general, we cannot ever assume that we can see all uses of the alloca, whether direct or indirect; we can do our best when we can reason about particular uses, but we have to have some ability to recognize that the address may have escaped and therefore basically any point within the lifetime of the alloca might be a use. It doesn't seem like this gets us to that point yet.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
1010

precise

In general, we cannot ever assume that we can see all uses of the alloca, whether direct or indirect; we can do our best when we can reason about particular uses, but we have to have some ability to recognize that the address may have escaped and therefore basically any point within the lifetime of the alloca might be a use. It doesn't seem like this gets us to that point yet.

Maybe we need to introduce AA or memorySSA to this pass to do that. And from my point of view, the solution now (by AllocaUseVisitor) can be a workaround. We could do analysis in AllocaUseVisitor as conservatively as we could before we introduce memorySSA or AA.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
1028

Maybe we need to forward the judgement for PI.isEscaped() to the front. Below is an example I find:

%a = alloca ...
%b = alloca ...
llvm.lifetime.start(%a)
call @f(%a, %b) ; %a maybe alias with %b and the `users` doesn't contain `%b` 
suspend point
; usage of %b ; so the memory access may not be right

In general, we cannot ever assume that we can see all uses of the alloca, whether direct or indirect; we can do our best when we can reason about particular uses, but we have to have some ability to recognize that the address may have escaped and therefore basically any point within the lifetime of the alloca might be a use. It doesn't seem like this gets us to that point yet.

Maybe we need to introduce AA or memorySSA to this pass to do that. And from my point of view, the solution now (by AllocaUseVisitor) can be a workaround. We could do analysis in AllocaUseVisitor as conservatively as we could before we introduce memorySSA or AA.

We don't need more information to do that; we need to use the information we have *correctly*. When you're analyzing uses of the alloca, you need to deal with uses that you don't fully understand by just giving up on the analysis and assuming that the object is used for its entire lifetime (as limited by lifetime intrinsics, if possible, but the entire function if not).

When you're analyzing uses of the alloca, you need to deal with uses that you don't fully understand by just giving up on the analysis and assuming that the object is used for its entire lifetime (as limited by lifetime intrinsics, if possible, but the entire function if not).

Yes, this is the problem that the AllocaUseVisitor is trying to solve. Maybe the implementation detail have bugs, but I think our direction is same.

We don't need more information to do that; we need to use the information we have *correctly*. When you're analyzing uses of the alloca, you need to deal with uses that you don't fully understand by just giving up on the analysis and assuming that the object is used for its entire lifetime (as limited by lifetime intrinsics, if possible, but the entire function if not).

The fundamental problem (and what makes this so hard) is that we need to be very accurate to make the program correct. Being conservative is insufficient for correctness. This is due to the fact that we often access data after the frame is destroyed, and anything accessed after the frame is destroyed cannot be on the frame. So for those data we need to be able to accurately tell if they should be on the stack or if the developer has a bug.

Below is some IR that troubled me for a while (removed some lines that are irrelevant to simplify it). The alloca in question is % coro_gro. Its lifetime started early, used after coro.end.
Since it's used after coro.end, at which point the coroutine frame is already destroyed,
coro_gro cannot be put on the frame and has to be put on the stack.
However technically it lives across suspension points and should be put on the frame.
The current implementation happens to work and put it on the stack because suspension checker stops propagating when it sees coro.end. So any use of data after coro.end is considered to not cross suspension point to anything before coro.suspend.
Also, I am not sure if we can prove that _ZN5folly15expected_detail7PromiseIiN12_GLOBALN_13ErrEE17get_return_objectEv never captures %coro_gro. Maybe a function cannot capture an argument that's sret? Not sure.

; Function Attrs: sanitize_address uwtable
define hidden i64 @_Z3foov() #3 personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
entry:
  %__coro_gro = alloca %"struct.folly::expected_detail::PromiseReturn", align 8
  %__promise = alloca %"struct.folly::expected_detail::Promise", align 8
  %0 = bitcast %"struct.folly::expected_detail::Promise"* %__promise to i8*
  %1 = call token @llvm.coro.id(i32 16, i8* nonnull %0, i8* bitcast (i64 ()* @_Z3foov to i8*), i8* null)
  %2 = call i1 @llvm.coro.alloc(token %1)
  br i1 %2, label %coro.alloc, label %invoke.cont21

coro.alloc:                                       ; preds = %entry
  %3 = tail call i64 @llvm.coro.size.i64()
  %call = call i8* @_Znwm(i64 %3)
  br label %invoke.cont21

invoke.cont21:                                    ; preds = %coro.alloc, %entry
  %4 = phi i8* [ null, %entry ], [ %call, %coro.alloc ]
  %5 = call i8* @llvm.coro.begin(token %1, i8* %4) #13
  %6 = bitcast %"struct.folly::expected_detail::PromiseReturn"* %__coro_gro to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %6) #6
  call fastcc void @_ZN5folly15expected_detail7PromiseIiN12_GLOBAL__N_13ErrEE17get_return_objectEv(%"struct.folly::expected_detail::PromiseReturn"* nonnull sret %__coro_gro, %"struct.folly::expected_detail::Promise"* nonnull %__promise) #6
  %call26 = call fastcc zeroext i1 @_ZNK5folly15expected_detail9AwaitableIiN12_GLOBAL__N_13ErrEE11await_readyEv(%"struct.folly::expected_detail::Awaitable"* nonnull %tmpcast) #6
  br i1 %call26, label %cont37, label %await.suspend

await.suspend:                                    ; preds = %invoke.cont21
  %10 = call token @llvm.coro.save(i8* null)
  %11 = bitcast %"class.std::experimental::coroutines_v1::coroutine_handle.21"* %retval.i109 to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %11)
  call fastcc void @_ZNSt12experimental13coroutines_v116coroutine_handleIN5folly15expected_detail7PromiseIiN12_GLOBAL__N_13ErrEEEEC2Ev(%"class.std::experimental::coroutines_v1::coroutine_handle.21"* nonnull %retval.i109) #6
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %11)
  invoke fastcc void @_ZN5folly15expected_detail9AwaitableIiN12_GLOBAL__N_13ErrEE13await_suspendIiEEvNSt12experimental13coroutines_v116coroutine_handleINS0_7PromiseIT_S3_EEEE(%"struct.folly::expected_detail::Awaitable"* nonnull %tmpcast, i8* %5)
          to label %invoke.cont35 unwind label %lpad

invoke.cont35:                                    ; preds = %await.suspend
  %12 = call i8 @llvm.coro.suspend(token %10, i1 false)
  switch i8 %12, label %coro.ret [
    i8 0, label %cont37
    i8 1, label %cleanup49
  ]

...

cleanup83:                                        ; preds = %invoke.cont54, %cleanup49, %cleanup49.thread
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %0) #6
  %24 = call i8* @llvm.coro.free(token %1, i8* %5)
  %25 = icmp eq i8* %24, null
  br i1 %25, label %coro.ret, label %coro.free

coro.free:                                        ; preds = %cleanup83
  %26 = tail call i64 @llvm.coro.size.i64()
  call void @_ZdlPvm(i8* nonnull %24, i64 %26) #6
  br label %coro.ret

coro.ret:                                         ; preds = %coro.free, %cleanup83, %invoke.cont35
  %27 = call i1 @llvm.coro.end(i8* null, i1 false) #13
  %call95 = invoke fastcc i64 @_ZNR5folly15expected_detail13PromiseReturnIiN12_GLOBAL__N_13ErrEEcvNS_8ExpectedIiS3_EEEv(%"struct.folly::expected_detail::PromiseReturn"* nonnull %__coro_gro)
          to label %cleanup.action unwind label %lpad93
...
}

We don't need more information to do that; we need to use the information we have *correctly*. When you're analyzing uses of the alloca, you need to deal with uses that you don't fully understand by just giving up on the analysis and assuming that the object is used for its entire lifetime (as limited by lifetime intrinsics, if possible, but the entire function if not).

The fundamental problem (and what makes this so hard) is that we need to be very accurate to make the program correct. Being conservative is insufficient for correctness. This is due to the fact that we often access data after the frame is destroyed, and anything accessed after the frame is destroyed cannot be on the frame. So for those data we need to be able to accurately tell if they should be on the stack or if the developer has a bug.

Below is some IR that troubled me for a while (removed some lines that are irrelevant to simplify it). The alloca in question is % coro_gro. Its lifetime started early, used after coro.end.
Since it's used after coro.end, at which point the coroutine frame is already destroyed,
coro_gro cannot be put on the frame and has to be put on the stack.
However technically it lives across suspension points and should be put on the frame.
The current implementation happens to work and put it on the stack because suspension checker stops propagating when it sees coro.end. So any use of data after coro.end is considered to not cross suspension point to anything before coro.suspend.
Also, I am not sure if we can prove that _ZN5folly15expected_detail7PromiseIiN12_GLOBALN_13ErrEE17get_return_objectEv never captures %coro_gro. Maybe a function cannot capture an argument that's sret? Not sure.

; Function Attrs: sanitize_address uwtable
define hidden i64 @_Z3foov() #3 personality i8* bitcast (i32 (...)* @__gxx_personality_v0 to i8*) {
entry:
  %__coro_gro = alloca %"struct.folly::expected_detail::PromiseReturn", align 8
  %__promise = alloca %"struct.folly::expected_detail::Promise", align 8
  %0 = bitcast %"struct.folly::expected_detail::Promise"* %__promise to i8*
  %1 = call token @llvm.coro.id(i32 16, i8* nonnull %0, i8* bitcast (i64 ()* @_Z3foov to i8*), i8* null)
  %2 = call i1 @llvm.coro.alloc(token %1)
  br i1 %2, label %coro.alloc, label %invoke.cont21

coro.alloc:                                       ; preds = %entry
  %3 = tail call i64 @llvm.coro.size.i64()
  %call = call i8* @_Znwm(i64 %3)
  br label %invoke.cont21

invoke.cont21:                                    ; preds = %coro.alloc, %entry
  %4 = phi i8* [ null, %entry ], [ %call, %coro.alloc ]
  %5 = call i8* @llvm.coro.begin(token %1, i8* %4) #13
  %6 = bitcast %"struct.folly::expected_detail::PromiseReturn"* %__coro_gro to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %6) #6
  call fastcc void @_ZN5folly15expected_detail7PromiseIiN12_GLOBAL__N_13ErrEE17get_return_objectEv(%"struct.folly::expected_detail::PromiseReturn"* nonnull sret %__coro_gro, %"struct.folly::expected_detail::Promise"* nonnull %__promise) #6
  %call26 = call fastcc zeroext i1 @_ZNK5folly15expected_detail9AwaitableIiN12_GLOBAL__N_13ErrEE11await_readyEv(%"struct.folly::expected_detail::Awaitable"* nonnull %tmpcast) #6
  br i1 %call26, label %cont37, label %await.suspend

await.suspend:                                    ; preds = %invoke.cont21
  %10 = call token @llvm.coro.save(i8* null)
  %11 = bitcast %"class.std::experimental::coroutines_v1::coroutine_handle.21"* %retval.i109 to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %11)
  call fastcc void @_ZNSt12experimental13coroutines_v116coroutine_handleIN5folly15expected_detail7PromiseIiN12_GLOBAL__N_13ErrEEEEC2Ev(%"class.std::experimental::coroutines_v1::coroutine_handle.21"* nonnull %retval.i109) #6
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %11)
  invoke fastcc void @_ZN5folly15expected_detail9AwaitableIiN12_GLOBAL__N_13ErrEE13await_suspendIiEEvNSt12experimental13coroutines_v116coroutine_handleINS0_7PromiseIT_S3_EEEE(%"struct.folly::expected_detail::Awaitable"* nonnull %tmpcast, i8* %5)
          to label %invoke.cont35 unwind label %lpad

invoke.cont35:                                    ; preds = %await.suspend
  %12 = call i8 @llvm.coro.suspend(token %10, i1 false)
  switch i8 %12, label %coro.ret [
    i8 0, label %cont37
    i8 1, label %cleanup49
  ]

...

cleanup83:                                        ; preds = %invoke.cont54, %cleanup49, %cleanup49.thread
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %0) #6
  %24 = call i8* @llvm.coro.free(token %1, i8* %5)
  %25 = icmp eq i8* %24, null
  br i1 %25, label %coro.ret, label %coro.free

coro.free:                                        ; preds = %cleanup83
  %26 = tail call i64 @llvm.coro.size.i64()
  call void @_ZdlPvm(i8* nonnull %24, i64 %26) #6
  br label %coro.ret

coro.ret:                                         ; preds = %coro.free, %cleanup83, %invoke.cont35
  %27 = call i1 @llvm.coro.end(i8* null, i1 false) #13
  %call95 = invoke fastcc i64 @_ZNR5folly15expected_detail13PromiseReturnIiN12_GLOBAL__N_13ErrEEcvNS_8ExpectedIiS3_EEEv(%"struct.folly::expected_detail::PromiseReturn"* nonnull %__coro_gro)
          to label %cleanup.action unwind label %lpad93
...
}

It's really rare to me that program would try to access variable after coro.end. I can't image the corresponding C++ code right now. Does this program call @_ZNR5folly15expected_detail13PromiseReturnIiN12_GLOBAL__N_13ErrEEcvNS_8ExpectedIiS3_EEEv in destructor of promise_type? If it is the case, maybe the problem is related to the destroy order of coroutine frame and promise_type. Not sure.

The current implementation happens to work and put it on the stack because suspension checker stops propagating when it sees coro.end. So any use of data after coro.end is considered to not cross suspension point to anything before coro.suspend.

If it's the solution, we should add an verify phase after we built coroutine frame to verify if there is any use of frame variable after coro.end.

Also, I am not sure if we can prove that _ZN5folly15expected_detail7PromiseIiN12_GLOBALN_13ErrEE17get_return_objectEv never captures %coro_gro. Maybe a function cannot capture an argument that's sret?

I think we could only prove that only if we can inline get_return_object into foo. I am also not sure whether a function can capture a sret argument. But I think case by case fixes may not be good direction.

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

IIRC, coroutine developers already complain about the design of coroutine many times like D96928. But just like @lxfind says, it is impossible that we can redesign the mechanism in a short time. I agree with that we should discuss how to refactor the struct of coroutine to solve these problems fundamentally. Before we get a clear solution, I think it is ok to fix bugs like this patch and add verify phase to tell the user what's going wrong.

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

A special alloca for coroutines will certainly help but it's not sufficient. A special alloca and a set of special rules around data allocation in coroutines can help resolve the escape-before-coro.begin issue (though I haven't seen any new bugs around that).
There are other problems that need to be solved. I need some more time to think about how to put them all together.
I think we are doing OK identifying variables that need to be on the heap (a conservative approach will do just that).
But it gets complex when we also need to be able to identify a set of variables that MUST be put on the stack. These are the variables that are being used after a coroutine is suspended (i.e. coro.suspend switch goes to the default destination), at which point the coroutine frame may have been destroyed.
In the example I had above, the coroutine return object is allocated and returned in the very end. Since it's the return object, it has to be on the stack. But the compiler isn't powerful enough to tell that for two reasons:

  1. It appears that it lives through coro.suspend, but it really only is used at the suspend path, not any other paths.
  2. It's constructed by calling a (sret) struct pointer, which suggests that it may capture, further complicating the problem.

I think a proper solution might require introducing a new IR instruction for coroutine suspension, instead of relying on intrinsics. I have similar conclusions in another bug that involves LICM and coroutine.

To make this patch fully correct, we need to first check PI.isEscaped() before relying on lifetime information.
Then for the return object case to work, we have two short-term solutions:

  1. We can treat sret as nocapture. This isn't 100% accurate. But the only case where C++ can capture a sret pointer is the constructor captures "this". But even that happens, in order for the "this" to really escape the object needs to be used again. So I think we should be safe there.
  2. We can special handle the return object variable, and make sure it's always on the stack. This should also work, I think.

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

Actually, this gives me an idea, perhaps we can emit metadata to the allocas to teach CoroSplit whether one must be on the stack.

lxfind updated this revision to Diff 326221.Feb 24 2021, 3:29 PM

Added detailed documentation on next steps. While I need to continue iterating on them, this patch incrementally improves the exisiting code by considering indirect use, so it should be a pure win

ChuanqiXu accepted this revision.Feb 24 2021, 6:01 PM

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

A special alloca for coroutines will certainly help but it's not sufficient. A special alloca and a set of special rules around data allocation in coroutines can help resolve the escape-before-coro.begin issue (though I haven't seen any new bugs around that).
There are other problems that need to be solved. I need some more time to think about how to put them all together.
I think we are doing OK identifying variables that need to be on the heap (a conservative approach will do just that).
But it gets complex when we also need to be able to identify a set of variables that MUST be put on the stack. These are the variables that are being used after a coroutine is suspended (i.e. coro.suspend switch goes to the default destination), at which point the coroutine frame may have been destroyed.
In the example I had above, the coroutine return object is allocated and returned in the very end. Since it's the return object, it has to be on the stack. But the compiler isn't powerful enough to tell that for two reasons:

  1. It appears that it lives through coro.suspend, but it really only is used at the suspend path, not any other paths.
  2. It's constructed by calling a (sret) struct pointer, which suggests that it may capture, further complicating the problem.

I think a proper solution might require introducing a new IR instruction for coroutine suspension, instead of relying on intrinsics. I have similar conclusions in another bug that involves LICM and coroutine.

To make this patch fully correct, we need to first check PI.isEscaped() before relying on lifetime information.
Then for the return object case to work, we have two short-term solutions:

  1. We can treat sret as nocapture. This isn't 100% accurate. But the only case where C++ can capture a sret pointer is the constructor captures "this". But even that happens, in order for the "this" to really escape the object needs to be used again. So I think we should be safe there.
  2. We can special handle the return object variable, and make sure it's always on the stack. This should also work, I think.

Introducing new IR instruction and other special rules may be a long term target. And for the issue introduced by get_return_object, it looks better to emit special metadata for the alloca of get_return_object in the clang frontend.

this patch incrementally improves the exisiting code by considering indirect use, so it should be a pure win

Yes, this Patch LGTM.

This revision is now accepted and ready to land.Feb 24 2021, 6:01 PM

Is there any way we can simplify this problem for ourselves by expecting IR to be emitted differently within coroutines? Because I think the true fundamental problem here is that LLVM's representation of stack allocations is not very good, and we're stuck trying to do an analysis that the representation isn't suited for at all. There's no real chance that LLVM in general is going to move away from alloca, but if we can recognize when we're emitting one of these allocations that needs to overlap coro.begin or coro.end and just do it with a special intrinsic that we treat differently in lowering, that might substantially reduce the difficulty of the problem. If necessary, we can even change the rules for inlining into unsplit coroutines.

A special alloca for coroutines will certainly help but it's not sufficient. A special alloca and a set of special rules around data allocation in coroutines can help resolve the escape-before-coro.begin issue (though I haven't seen any new bugs around that).

Well, even if we haven't seen any new bugs, a more robust solution would be welcome; it's hard to feel confident in what's happening there.

There are other problems that need to be solved. I need some more time to think about how to put them all together.
I think we are doing OK identifying variables that need to be on the heap (a conservative approach will do just that).
But it gets complex when we also need to be able to identify a set of variables that MUST be put on the stack. These are the variables that are being used after a coroutine is suspended (i.e. coro.suspend switch goes to the default destination), at which point the coroutine frame may have been destroyed.

Yeah, I agree with this basic statement of the problem. If we can figure out a way to reliably recognize the allocations that have to be on the stack but are really just meant to be local to various special computations which aren't fully part of the coroutine, that would make this much easier.

I'm not sure I understand what this allocation is that's only being used after the coroutine frame is destroyed. Where is it coming from? If it's actually just a local allocation in a function that we're inlining into the epilogue (a destructor?), can we reasonably delay that inlining to only happen after splitting? I think in any case we'll need to limit early inlining in those sections if we want to have stronger structural rules about allocation for them; either that or we'll need to be able to introduce our stronger rules during inlining.

In the example I had above, the coroutine return object is allocated and returned in the very end. Since it's the return object, it has to be on the stack. But the compiler isn't powerful enough to tell that for two reasons:

  1. It appears that it lives through coro.suspend, but it really only is used at the suspend path, not any other paths.
  2. It's constructed by calling a (sret) struct pointer, which suggests that it may capture, further complicating the problem.

I don't quite understand what you're saying here. If we're returning the return object indirectly (this is the return object from the ramp function, i.e. the coroutine value itself, right?), we should be constructing it in-place, and there's no local allocation for it. Are we constructing it into a temporary and then copying that (with memcpy? or a copy constructor?) for the actual return?

I think a proper solution might require introducing a new IR instruction for coroutine suspension, instead of relying on intrinsics. I have similar conclusions in another bug that involves LICM and coroutine.

Hmm. I can certainly believe that some of the code-motion passes need to understand that they can't always move things around suspensions. One problem with actually changing to an instruction, though, is that coroutine suspension actually looks quite different in the different lowerings.

To make this patch fully correct, we need to first check PI.isEscaped() before relying on lifetime information.

Yeah, this is the main thing I was trying to point out.

Then for the return object case to work, we have two short-term solutions:

  1. We can treat sret as nocapture. This isn't 100% accurate. But the only case where C++ can capture a sret pointer is the constructor captures "this". But even that happens, in order for the "this" to really escape the object needs to be used again. So I think we should be safe there.

Once it's constructed and has possibly escaped, basically any side-effectful thing might indirectly access it. Usually functions don't have a problem with captures of this for the return value because constructing the return value is the last thing that happens in the function, except for a few destructors. That might be legitimately different in C++-style coroutines.

  1. We can special handle the return object variable, and make sure it's always on the stack. This should also work, I think.

Yes, if Clang is emitting this return object variable as part of emitting the return from the ramp function, it should be easy to force it to use a different pattern, or at least mark it with some metadata.

Well, even if we haven't seen any new bugs, a more robust solution would be welcome; it's hard to feel confident in what's happening there.

I agree. Although I am leaning towards a different approach than specializing allocations. I am thinking about introducing another ramp function, whose sole job is to allocate the frame, copy the parameters, and then just to the real ramp function. This will prevent any random instructions from getting moved before coro.begin, because now we have a function boundary.

I'm not sure I understand what this allocation is that's only being used after the coroutine frame is destroyed. Where is it coming from? If it's actually just a local allocation in a function that we're inlining into the epilogue (a destructor?), can we reasonably delay that inlining to only happen after splitting? I think in any case we'll need to limit early inlining in those sections if we want to have stronger structural rules about allocation for them; either that or we'll need to be able to introduce our stronger rules during inlining.

It's not due to inlining. There are two cases here, one is the return object case (which I will talk a bit more below). The other is when a call to suspend_away returns a handle, and we need to do symmetric transfer by resuming that returned handle. In a case like this, the current coroutine (the one that is being suspended) may have been destroyed already. So all the instructions in-between the call to suspend_away and the resume of the handle should not touch the coroutine frame. This used to be quite complicated to deal with, but I patched it up in the front end a few months ago so that the lifetime of any temporaries used in that range is minimum as possible, so that a decent lifetime analysis/usage analysis should be able to decide that they need to be on the stack. However I still worry that if there exists any function call there that might appears to escape (especially when objects are constructed through sret), it will be broken.

I don't quite understand what you're saying here. If we're returning the return object indirectly (this is the return object from the ramp function, i.e. the coroutine value itself, right?), we should be constructing it in-place, and there's no local allocation for it. Are we constructing it into a temporary and then copying that (with memcpy? or a copy constructor?) for the actual return?

The return object is obtained by calling the get_return_object() function on the promise. And since the return object is a struct, it first do alloca to allocate a return object, then call the get_return_object() function with sret parameter. In that case, the return object isn't captured but the compiler doesn't know that, which makes it tricky.

Hmm. I can certainly believe that some of the code-motion passes need to understand that they can't always move things around suspensions. One problem with actually changing to an instruction, though, is that coroutine suspension actually looks quite different in the different lowerings.

It's possible that we may diverge more and more between these different lowerings..

Well, even if we haven't seen any new bugs, a more robust solution would be welcome; it's hard to feel confident in what's happening there.

I agree. Although I am leaning towards a different approach than specializing allocations. I am thinking about introducing another ramp function, whose sole job is to allocate the frame, copy the parameters, and then just to the real ramp function. This will prevent any random instructions from getting moved before coro.begin, because now we have a function boundary.

That seems like a very good approach. And if you really want to avoid the call overhead, you can always force-inline the coroutine ramp after lowering.

I'm not sure I understand what this allocation is that's only being used after the coroutine frame is destroyed. Where is it coming from? If it's actually just a local allocation in a function that we're inlining into the epilogue (a destructor?), can we reasonably delay that inlining to only happen after splitting? I think in any case we'll need to limit early inlining in those sections if we want to have stronger structural rules about allocation for them; either that or we'll need to be able to introduce our stronger rules during inlining.

It's not due to inlining. There are two cases here, one is the return object case (which I will talk a bit more below). The other is when a call to suspend_away returns a handle, and we need to do symmetric transfer by resuming that returned handle. In a case like this, the current coroutine (the one that is being suspended) may have been destroyed already. So all the instructions in-between the call to suspend_away and the resume of the handle should not touch the coroutine frame. This used to be quite complicated to deal with, but I patched it up in the front end a few months ago so that the lifetime of any temporaries used in that range is minimum as possible, so that a decent lifetime analysis/usage analysis should be able to decide that they need to be on the stack. However I still worry that if there exists any function call there that might appears to escape (especially when objects are constructed through sret), it will be broken.

Ah, I see that I've failed to know about yet another feature of C++ coroutines.

I don't quite understand what you're saying here. If we're returning the return object indirectly (this is the return object from the ramp function, i.e. the coroutine value itself, right?), we should be constructing it in-place, and there's no local allocation for it. Are we constructing it into a temporary and then copying that (with memcpy? or a copy constructor?) for the actual return?

The return object is obtained by calling the get_return_object() function on the promise. And since the return object is a struct, it first do alloca to allocate a return object, then call the get_return_object() function with sret parameter. In that case, the return object isn't captured but the compiler doesn't know that, which makes it tricky.

Should the return object variable actually be NRVO'ed? It certainly sounds like it could be.

Hmm. I can certainly believe that some of the code-motion passes need to understand that they can't always move things around suspensions. One problem with actually changing to an instruction, though, is that coroutine suspension actually looks quite different in the different lowerings.

It's possible that we may diverge more and more between these different lowerings..

It's possible, but if we're going to start adding "harder" IR support, we should make an effort to see if we can find common ground first.

ChuanqiXu added a comment.EditedFeb 24 2021, 10:49 PM

Well, even if we haven't seen any new bugs, a more robust solution would be welcome; it's hard to feel confident in what's happening there.

I agree. Although I am leaning towards a different approach than specializing allocations. I am thinking about introducing another ramp function, whose sole job is to allocate the frame, copy the parameters, and then just to the real ramp function. This will prevent any random instructions from getting moved before coro.begin, because now we have a function boundary.

This idea looks to emit two function (the new ramp and the original one with whose argument is FramePtr) in the frontend. If it is the way, it could achieve the goal I try to discuss in llvm-dev before: Normalize Ramp Function. So I think it would be a good idea too.

Should the return object variable actually be NRVO'ed? It certainly sounds like it could be.

IIRC, it the return object would be RVO if its size exceed some threshold.