This is an archive of the discontinued LLVM Phabricator instance.

[Coroutine][Clang] Force emit lifetime intrinsics for Coroutines
ClosedPublic

Authored by lxfind on Mar 23 2021, 4:53 PM.

Details

Summary

tl;dr Correct implementation of Corouintes requires having lifetime intrinsics available.

Coroutine functions are functions that can be suspended and resumed latter. To do so, data that need to stay alive after suspension must be put on the heap (i.e. the coroutine frame).
The optimizer is responsible for analyzing each AllocaInst and figure out whether it should be put on the stack or the frame.
In most cases, for data that we are unable to accurately analyze lifetime, we can just conservatively put them on the heap.
Unfortunately, there exists a few cases where certain data MUST be put on the stack, not on the heap. Without lifetime intrinsics, we are unable to correctly analyze those data's lifetime.

To dig into more details, there exists cases where at certain code points, the current coroutine frame may have already been destroyed. Hence no frame access would be allowed beyond that point.
The following is a common code pattern called "Symmetric Transfer" in coroutine:

auto tmp = await_suspend();
__builtin_coro_resume(tmp.address());
return;

In the above code example, await_suspend() returns a new coroutine handle, which we will obtain the address and then resume that coroutine. This essentially "transfered" from the current coroutine to a different coroutine.
During the call to await_suspend(), the current coroutine may be destroyed, which should be fine because we are not accessing any data afterwards.
However when LLVM is emitting IR for the above code, it needs to emit an AllocaInst for tmp. It will then call the address function on tmp. address function is a member function of coroutine, and there is no way for the LLVM optimizer to know that it does not capture the tmp pointer. So when the optimizer looks at it, it has to conservatively assume that tmp may escape and hence put it on the heap. Furthermore, in some cases address call would be inlined, which will generate a bunch of store/load instructions that move the tmp pointer around. Those stores will also make the compiler to think that tmp might escape.
A repro of crash can be found here: https://godbolt.org/z/KvPY66
To summarize, it's really difficult for the mid-end to figure out that the tmp data is short-lived.
I made some attempt in D98638, but it appears to be way too complex and is basically doing the same thing as inserting lifetime intrinsics in coroutines.

Also, for reference, we already force emitting lifetime intrinsics in O0 for AlwaysInliner: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Passes/PassBuilder.cpp#L1893

I need to fix a few tests. But sending this out early for feedback.

Diff Detail

Event Timeline

lxfind created this revision.Mar 23 2021, 4:53 PM
lxfind requested review of this revision.Mar 23 2021, 4:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2021, 4:53 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
lxfind edited the summary of this revision. (Show Details)Mar 23 2021, 4:59 PM
lxfind added reviewers: ChuanqiXu, junparser, rjmccall.

I have no objection to trying to always emit lifetime intrinsics in coroutines since it has a less-trivial runtime cost. I am skeptical that it's reasonable to do this for *correctness*, however; I don't think the frontend unconditionally emits lifetime intrinsics. But since I think this fine to do regardless, I have no objection to the patch.

I think you just set ShouldEmitLifetimeMarkers correctly in the first place instead of adding this as an extra condition to every place that considers it, however.

I am skeptical that it's reasonable to do this for *correctness*, however; I don't think the frontend unconditionally emits lifetime intrinsics.

Sorry, I re-read this after posting, and it's not exactly clear what I was saying. There are a lot of situations where Clang doesn't emit lifetime intrinsics for every alloca it emits, or emits unnecessarily weak bounds. Certain LLVM transforms can also introduce allocas that don't have corresponding lifetime intrinsics. So I think it's problematic to consider it a correctness condition that we're emitting optimally-tight lifetimes.

ChuanqiXu added a comment.EditedMar 23 2021, 7:58 PM

Only one problem I had for emitting lifetime markers even at O0 is that would action make allocas to be optimized even at O0? If so, I wonder if it confuses programmers since they may find some variables disappear surprisingly. Or there would be no optimization since every function would be marked with optnone attribute. I am not sure about this.

If I understand this problem correctly, this patch could fix problems for the return value of symmetric transfer and the gro that we discussed in D98638. Then D98638 may be unneeded. I prefer the implementation in this patch.

clang/lib/CodeGen/CGDecl.cpp
1318 ↗(On Diff #332826)

Can we sure frontend would always call this API to emit lifetime start? I mean the frontend may call EmitIntrinsic or create lifetime.start intrinsic directly whether by IRBuilder::CreateXXX or Instrinsic::Create(...). I worry about if this would incur changes out of design.

Then if we add check in EmitLifetimeStart, why not we add check in EmitLfietimeEnd?

I think you just set ShouldEmitLifetimeMarkers correctly in the first place instead of adding this as an extra condition to every place that considers it, however.

This was set when a CodeGenFunction is constructed, at that point it doesn't yet know if this function is a coroutine.
I could turn ShouldEmitLifetimeMarkers to non-const, and then modify it once it realizes it's a coroutine though, if that's better than the current approach.

Sorry, I re-read this after posting, and it's not exactly clear what I was saying. There are a lot of situations where Clang doesn't emit lifetime intrinsics for every alloca it emits, or emits unnecessarily weak bounds. Certain LLVM transforms can also introduce allocas that don't have corresponding lifetime intrinsics. So I think it's problematic to consider it a correctness condition that we're emitting optimally-tight lifetimes.

I tend to agree. Relying on lifetime for correctness seems fragile.
I wonder if there is a better way to inform optimizer that a "variable" is really a temporary value that should die at the end of an expression?
For instance, whenever we do something simple like:

foo().bar();
co_await ...

If we compile it under -O0 without lifetime intrinsics, the return value of foo() will always be put on the coroutine frame, unless the compiler knows in advance that bar() does not capture.
This becomes a problem if this code appears at a location where the current coroutine frame may be destroyed (but the code itself isn't wrong, it simply doesn't access the frame).
The case for symmetric transfer is exactly this situation.

An alternative to solve the problem for the case of symmetric transfer, is to change the design of symmetric transfer. For example, if we let await_suspend to return void* instead of coroutine_handle, we won't have this problem in the first place, because we no longer need to call address(). Maybe @lewissbaker can comment on the viability of that.

Only one problem I had for emitting lifetime markers even at O0 is that would action make allocas to be optimized even at O0? If so, I wonder if it confuses programmers since they may find some variables disappear surprisingly. Or there would be no optimization since every function would be marked with optnone attribute. I am not sure about this.

It will only cause variables to be put on the stack instead of on the frame, which shouldn't affect developer's view?

If I understand this problem correctly, this patch could fix problems for the return value of symmetric transfer and the gro that we discussed in D98638. Then D98638 may be unneeded. I prefer the implementation in this patch.

I doubt it can fix the gro problem. I will need to double check on that latter.

lxfind added inline comments.Mar 23 2021, 10:58 PM
clang/lib/CodeGen/CGDecl.cpp
1318 ↗(On Diff #332826)

I searched in the codebase, and we always call this API to emit lifetime start in the front-end.
Also, for coroutine to behave correctly, we really only need SD_FullExpression to be able to emit it. Other cases are less critical.

Usually when it emits a LifetimeStart instruction, it will store it somewhere, and latter check on it to decide whether it needs to emit a lifetime end. That's when there is no checks needed for lifetime end.

Only one problem I had for emitting lifetime markers even at O0 is that would action make allocas to be optimized even at O0? If so, I wonder if it confuses programmers since they may find some variables disappear surprisingly. Or there would be no optimization since every function would be marked with optnone attribute. I am not sure about this.

It will only cause variables to be put on the stack instead of on the frame, which shouldn't affect developer's view?

Yes, I am just worry about the variable marked with lifetime intrinsic would be optimized by other passes. But functions would get attribute optnone in O0, my worries may be redundant. Then it is Ok to me to emit lifetime intrinsics all the time.

Is it feasible to outline the initial segment that you don't want to be part of the coroutine, and then have coroutine splitting force that outlined function to be inlined into the ramp function? IIUC, you were saying that the splitting patch was difficult, but maybe thinking about it as outlining simplifies things. I know we had some nasty representational problems with the async lowering that we solved with outlining and force-inlining.

I think you just set ShouldEmitLifetimeMarkers correctly in the first place instead of adding this as an extra condition to every place that considers it, however.

This was set when a CodeGenFunction is constructed, at that point it doesn't yet know if this function is a coroutine.
I could turn ShouldEmitLifetimeMarkers to non-const, and then modify it once it realizes it's a coroutine though, if that's better than the current approach.

That would be fine.

Is it feasible to outline the initial segment that you don't want to be part of the coroutine, and then have coroutine splitting force that outlined function to be inlined into the ramp function? IIUC, you were saying that the splitting patch was difficult, but maybe thinking about it as outlining simplifies things. I know we had some nasty representational problems with the async lowering that we solved with outlining and force-inlining.

That's a good idea. I will think about it. Thanks!

lxfind updated this revision to Diff 333338.Mar 25 2021, 10:22 AM

Address comments, and fix all tests

This revision is now accepted and ready to land.Mar 25 2021, 1:39 PM
This revision was landed with ongoing or failed builds.Mar 25 2021, 1:46 PM
This revision was automatically updated to reflect the committed changes.