Page MenuHomePhabricator

[RFC][Coroutine] Force stack allocation after await_suspend() call
AbandonedPublic

Authored by lxfind on Mar 15 2021, 10:36 AM.

Details

Summary

One of the challenges with the alloca analysis in CoroSplit is that in a few cases we need to make sure the allocas must be put on the stack, not on the frame.
One of the cases is symmetric transfer. Symmetric transfer is a newly introduced feature in C++ coroutines that allows for immediate transfer to a different coroutine when the current coroutine is suspended.
The await_suspend() call will return a coroutine handle type, and when that happens, the compiler should generate code to resume the returned handle. Like this:

coroutine_handle tmp = awaiter.await_suspend();
__builtin_coro_resume(tmp.address());

It's very common that after the call to await_suspend(), the current coroutine frame is already destroyed, which means we should not be accessing the coroutine frame from there.
And we shouldn't because we we use here is a temporary variable which will be short-lived. However in a debug build when we don't have lifetime intrinsics, it's very hard for the compiler to determine that tmp doesn't escape. This bug can be reproduced in this example: https://godbolt.org/z/KvPY66
It results in a TSAN failure because we are accessing the heap after it's destroyed.

There are two specific challenges here:

  1. If the address() function call is not inlined (this should be the default case with -O0), we will have a function call that takes tmp as a pointer. The compiler does not know that the address call will not capture. This will lead to tmp being put on the frame. We could potentially special handle the address function in either front-end or CoroSplit, but both are fragile (we will need to do some name pattern matching).
  2. If the address() function call is inlined (in some versions of libc++, address seems to have "always_inline" attribute), we will end up with a series of store/load instructions. For a naive analysis, a store of the pointer will also be treated as escape. To solve that problem, I introduced D91305, which tries to match this specific store/load pattern and be able to deal with it. It looks very hacky.

To solve this problem once for all, and provide a framework for solving similar problems in the future, this patch introduces 2 new intrinsics to mark a region where all data accessed must be put on the stack.
In the case of symmetric transfer, in order to be able to insert code during front-end codegen right after the await_suspend call, we need to split the Suspend subnode CoroutineSuspendExpr at await_suspend call, as the new AwaitSuspendCall subnode.
Then we create a OpaqueValueExpr to wrap around AwaitSuspendCall, and use it to continue build the rest of the Suspend subnode. OpaqueValueExpr is necessary because we don't want to emit the await_suspend call twice. OpaqueValueExpr serves as a stopper in codegen.
If there is no symmetric transfer, the new nodes will be nullptr.
After this patch, now right after the await_suspend() call, we will see a llvm.coro.forcestack.begin() intrinsic, and then right before coro.suspend(), we will see a llvm.coro.forcestack.end() intrinsic.
CoroSplit will then be able to use this information to decide whether some data must be put on the stack.
We are also able to remove the code that tries to match the special store/load instruction sequence.

Diff Detail

Unit TestsFailed

TimeTest
180 msx64 windows > Clang.CodeGenCoroutines::coro-symmetric-transfer-01.cpp
Script: -- : 'RUN: at line 1'; c:\ws\w16c2-2\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-2\llvm-project\premerge-checks\build\lib\clang\13.0.0\include -nostdsysteminc -fcoroutines-ts -std=c++14 -O0 -emit-llvm C:\ws\w16c2-2\llvm-project\premerge-checks\clang\test\CodeGenCoroutines\coro-symmetric-transfer-01.cpp -ast-dump | c:\ws\w16c2-2\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16c2-2\llvm-project\premerge-checks\clang\test\CodeGenCoroutines\coro-symmetric-transfer-01.cpp

Event Timeline

lxfind created this revision.Mar 15 2021, 10:36 AM
lxfind requested review of this revision.Mar 15 2021, 10:36 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 15 2021, 10:36 AM

It looks like there are two things this patch wants to do:

  1. Don't put the temporary generated by symmetric-transfer on the coroutine frame.
  2. Offer a mechanism to force some values (it is easy to extend Alloca to Value) to put in the stack instead of the coroutine frame.

I am a little confused about the first problem. Would it cause the program to crash? (e.g., we access the fields of coroutine frame after the frame gets destroyed). Or it just wastes some storage?
And I want to ask about the change of the AST nodes and SemaCoroutine. Can we know if a CoroutineSuspendExpr stands for a symmetric-transfer? If yes, it seems we can only do changes in CodeGen part.

Then I agree to introduce new intrinsic to hint the middle end to put some values on the stack. And the design of @llvm.coro.forcestack.begin() and @llvm.coro.forcestack.end() is a little strange to me. It says they mark a region where only data from the local stack can be accessed. But it looks error-prone since it is hard for the front-end to decide whether all the access of the region should be put on the stack. I think we could introduce only one intrinisic @llvm.coro.forcestack(Value* v), we can use the argument to mark the value need to be put on the stack.

And about the problem you mentioned in D96922: "The lifetime of %coro.gro" starts early and %coro.gro" would be used after coro.end (Possibly the destructor?) which would cause the program to access destroyed coroutine frame". It looks like the mechanism could solve this problem by a call to @llvm.coro.forcestack(%coro.gro).

clang/include/clang/AST/ExprCXX.h
4695

It looks strange for the change of CoroutineSuspendExpr at the first glance. It is easy to understand the coroutine suspend expression is consists of three parts: Ready, Suspend and resume. It is written in the language documentation. And the new added AwaitSuspendCall is confusing.

It looks like there are two things this patch wants to do:

  1. Don't put the temporary generated by symmetric-transfer on the coroutine frame.
  2. Offer a mechanism to force some values (it is easy to extend Alloca to Value) to put in the stack instead of the coroutine frame.

I am a little confused about the first problem. Would it cause the program to crash? (e.g., we access the fields of coroutine frame after the frame gets destroyed). Or it just wastes some storage?
And I want to ask about the change of the AST nodes and SemaCoroutine. Can we know if a CoroutineSuspendExpr stands for a symmetric-transfer? If yes, it seems we can only do changes in CodeGen part.

It will result in a crash, because we will be accessing memory that's already freed. If you run:

bin/clang -fcoroutines-ts -std=c++14 -stdlib=libc++ ../clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp -o - -emit-llvm -S -Xclang -disable-llvm-passes

You can see that in the final.suspend basic block, there are IRs like this:

  %call19 = call i8* @_ZN13detached_task12promise_type13final_awaiter13await_suspendENSt12experimental13coroutines_v116coroutine_handleIS0_EE(%"struct.detached_task::promise_type::final_awaiter"* nonnull dereferenceable(1) %ref.tm
p10, i8* %22) #2
  %coerce.dive20 = getelementptr inbounds %"struct.std::experimental::coroutines_v1::coroutine_handle.0", %"struct.std::experimental::coroutines_v1::coroutine_handle.0"* %coerce, i32 0, i32 0
  store i8* %call19, i8** %coerce.dive20, align 8
  %call21 = call i8* @_ZNKSt12experimental13coroutines_v116coroutine_handleIvE7addressEv(%"struct.std::experimental::coroutines_v1::coroutine_handle.0"* nonnull dereferenceable(8) %coerce) #2
  call void @llvm.coro.resume(i8* %call21)

The temporary variable %coerce will be put on the frame because it's used by the call to address function and LLVM thinks it may escape. But the call to await_suspend() (the first line) in reality could destroy the current coroutine frame. Hence after the call to await_suspend, it will be accessing the frame, leading to memory corruption.

Then I agree to introduce new intrinsic to hint the middle end to put some values on the stack. And the design of @llvm.coro.forcestack.begin() and @llvm.coro.forcestack.end() is a little strange to me. It says they mark a region where only data from the local stack can be accessed. But it looks error-prone since it is hard for the front-end to decide whether all the access of the region should be put on the stack. I think we could introduce only one intrinisic @llvm.coro.forcestack(Value* v), we can use the argument to mark the value need to be put on the stack.

This is a good idea. Let me play with it. Thanks!

And about the problem you mentioned in D96922: "The lifetime of %coro.gro" starts early and %coro.gro" would be used after coro.end (Possibly the destructor?) which would cause the program to access destroyed coroutine frame". It looks like the mechanism could solve this problem by a call to @llvm.coro.forcestack(%coro.gro).

lxfind added inline comments.Mar 16 2021, 12:18 PM
clang/include/clang/AST/ExprCXX.h
4695

I agree. But this seems to be the only way to break up Suspend at the point of await_suspend call so that we can insert instructions during CodeGen. Open to ideas though.

lxfind edited the summary of this revision. (Show Details)Mar 16 2021, 5:36 PM

I am a little confused about the first problem. Would it cause the program to crash? (e.g., we access the fields of coroutine frame after the frame gets destroyed). Or it just wastes some storage?

This is a repro of the crash (in TSAN mode): https://godbolt.org/z/KvPY66

lxfind added inline comments.Mar 16 2021, 5:50 PM
clang/include/clang/AST/ExprCXX.h
4695

One potential way to make this more clear is to rename these two nodes as: Suspend and Transfer.

ChuanqiXu added a comment.EditedMar 16 2021, 8:32 PM

I am a little confused about the first problem. Would it cause the program to crash? (e.g., we access the fields of coroutine frame after the frame gets destroyed). Or it just wastes some storage?

This is a repro of the crash (in TSAN mode): https://godbolt.org/z/KvPY66

Oh I got it. The program would crash since handle is destroyed in final_awaiter::await_suspend explicitly:

std::coroutine_handle<> await_suspend(std::coroutine_handle<promise_type> h) noexcept {
       h.destroy();
       return std::noop_coroutine();
}

And the normal symmetric transfer wouldn't destroy the handle (although it depends on the implementation of await_suspend).
So the problem met in this patch is program maybe crash with symmetric transfer with destroy coroutine handle explicitly (in the final awaiter normally) instead of normal symmetric transfer.

The explicitly destruction for the coroutine handle in the await_suspend of final awaiter is a normal pattern to enable the Coro-elide optimization. There is a discuss before in cafe-dev: http://clang-developers.42468.n3.nabble.com/Miscompilation-heap-use-after-free-in-C-coroutines-td4070320.html. It looks like it is a subsequent problem.

Here what I want to say is we shouldn't handle all the symmetric transfer from the above analysis. And we shouldn't change the ASTNodes and Sema part. We need to solve about the above pattern. It is not easy to give a solution since user could implement symmetric transfer in final awaiter without destroying the handle, which is more common.

My unfinished idea is to emit an intrinsic called @llvm.coro.finalize before we emit the promise_type::final_suspend. Then the @llvm.coro.finalize marks the end of the lifetime for current coroutine frame. And all the analysis in CoroFrame shouldn't consider use after @llvm.coro.finalize (We could emit warning for some cases). But this idea is also problematic, it makes the semantics of coroutine intrinsic more chaos. Just image that how a newbie feels when he see @llvm.coro.end, @llvm.coro.destroy and @llvm.coro.finalize. And we can't use @llvm.coro.end @llvm.coro.destroy since they have other semantics (llvm.coro.destroy means deletion and llvm.coro.end would be used to split the coroutine). Also, the idea of @llvm.coro.finalize seems available to solve the problem about%gro mentioned above.

It seems to be a workaround to use @llvm.coro.forcestack(%result_of_final_await_suspend) . Since I wondering if there are other corner cases as the %gro. My opinion about '@llvm.coro.forcestack' is that we could use it as a patch if we find any holes that is hard to handle immediately. But we also need to find a solution to solve problems more fundamentally.

Here what I want to say is we shouldn't handle all the symmetric transfer from the above analysis. And we shouldn't change the ASTNodes and Sema part. We need to solve about the above pattern. It is not easy to give a solution since user could implement symmetric transfer in final awaiter without destroying the handle, which is more common.

Just to clarify, in case there are any confusions around this. This patch would work no matter whether the coroutine frame is destroyed or not during await_suspend(). It simply makes sure that the temporary handle returned by await_suspend will be put in the stack instead of heap, and it will always be safe to do so, no matter what happens.
Whether or not the current coroutine frame would be destroyed completely depend on the implementation of await_suspend. So we cannot predict or know in advance. Therefore, the temporary handle returned by await_suspend must be put on the stack. I don't really see any other solutions other than this.

It seems to be a workaround to use @llvm.coro.forcestack(%result_of_final_await_suspend) . Since I wondering if there are other corner cases as the %gro. My opinion about '@llvm.coro.forcestack' is that we could use it as a patch if we find any holes that is hard to handle immediately. But we also need to find a solution to solve problems more fundamentally.

Yes as I mentioned in the description, there are really only two cases, one is after await_suspend call, and one is gro. gro is easy to handle and I will likely send a separate patch latter. But this problem with await_suspend is particularly challenging to solve.

What do you think is the fundamental problem, though?

lxfind added a comment.EditedMar 16 2021, 9:24 PM

Well, I guess another potential solution is to force emitting lifetime intrinsics for this part of coroutine in the front-end.
Like this:

diff --git a/clang/lib/CodeGen/CGDecl.cpp b/clang/lib/CodeGen/CGDecl.cpp
index 243d93a8c165..ef76e8dcb7c9 100644
--- a/clang/lib/CodeGen/CGDecl.cpp
+++ b/clang/lib/CodeGen/CGDecl.cpp
@@ -1317,7 +1317,7 @@ void CodeGenFunction::EmitAutoVarDecl(const VarDecl &D) {
 /// otherwise
 llvm::Value *CodeGenFunction::EmitLifetimeStart(uint64_t Size,
                                                 llvm::Value *Addr) {
-  if (!ShouldEmitLifetimeMarkers)
+  if (!ShouldEmitLifetimeMarkers && !isCoroutine())
     return nullptr;
 
   assert(Addr->getType()->getPointerAddressSpace() ==
diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
index 18f1468dcb86..2e6e6808db7f 100644
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -535,7 +535,7 @@ EmitMaterializeTemporaryExpr(const MaterializeTemporaryExpr *M) {
       break;
 
     case SD_FullExpression: {
-      if (!ShouldEmitLifetimeMarkers)
+      if (!ShouldEmitLifetimeMarkers && !isCoroutine())
         break;
 
       // Avoid creating a conditional cleanup just to hold an llvm.lifetime.end

What do you think is the fundamental problem, though?

It is hard to give a formal description for the problem. Let me try to explain it.
What I want to say here is about rules that decide whether a value should be put on the coroutine frame.
Initially, we put values on the frame for whose uses are crossing suspend points with their definition.
Then, we put values on the frame for whose uses are crossing suspend points with their definition and uses are not escaped.
In this patch, we want to put values on the frame for whose uses are crossing suspend points with their definition and uses are not escaped but except the result of symmetric transfer and %gro.
Then we need to answer the question: how can we prove that the result of symmetric transfer and %gro are the only exceptions from the above rules. Or how can we know the list of exceptions wouldn't get longer and longer in the future?

Then go back to the example in the summary. From my point of view, the key problem is that our escape analysis isn't powerful enough. I don't ask us to do excellent escape analysis. It may beyond our abilities. I just want to say how can we know the result of symmetric transfer and %gro are the only exceptions.

Whether or not the current coroutine frame would be destroyed completely depend on the implementation of await_suspend. So we cannot predict or know in advance. Therefore, the temporary handle returned by await_suspend must be put on the stack. I don't really see any other solutions other than this.

OK. Although the main stream implementation of await_suspend only destroy the coroutine handle in the final awaiter, the compiler can't assume the normal await_suspend won't destroy it. So I agree to guard the result of the await_suspend to make it put on the stack. At least, it would reduce the size of the coroutine frame.

Then if we want to put the result of the await_suspend in the stack, I think we can do it under CodeGen part only. It should be easy to judge the return type of await_suspend and create a call to llvm.coro.forcestack to the return value of await_suspend.

Well, I guess another potential solution is to force emitting lifetime intrinsics for this part of coroutine in the front-end.

I am not sure if this is a good idea. May it break the guide principle in LLVM? This need to be reviewed by others.

Then we need to answer the question: how can we prove that the result of symmetric transfer and %gro are the only exceptions from the above rules. Or how can we know the list of exceptions wouldn't get longer and longer in the future?

Then go back to the example in the summary. From my point of view, the key problem is that our escape analysis isn't powerful enough. I don't ask us to do excellent escape analysis. It may beyond our abilities. I just want to say how can we know the result of symmetric transfer and %gro are the only exceptions.

That's a fair point. I agree that we have no guarantee these are the only two cases.
It does seem to me that coroutine implementation somewhat relies on proper lifetime markers so that data are being put correctly, which may be the fundamental problem we are trying to solve.

Whether or not the current coroutine frame would be destroyed completely depend on the implementation of await_suspend. So we cannot predict or know in advance. Therefore, the temporary handle returned by await_suspend must be put on the stack. I don't really see any other solutions other than this.

OK. Although the main stream implementation of await_suspend only destroy the coroutine handle in the final awaiter, the compiler can't assume the normal await_suspend won't destroy it. So I agree to guard the result of the await_suspend to make it put on the stack. At least, it would reduce the size of the coroutine frame.

Then if we want to put the result of the await_suspend in the stack, I think we can do it under CodeGen part only. It should be easy to judge the return type of await_suspend and create a call to llvm.coro.forcestack to the return value of await_suspend.

Well, I guess another potential solution is to force emitting lifetime intrinsics for this part of coroutine in the front-end.

I am not sure if this is a good idea. May it break the guide principle in LLVM? This need to be reviewed by others.

Then if we want to put the result of the await_suspend in the stack, I think we can do it under CodeGen part only. It should be easy to judge the return type of await_suspend and create a call to llvm.coro.forcestack to the return value of await_suspend.

We probably could, but it would be very very tedious.
During CodeGen, we only have the AST that's calling __builtin_coro_resume, which we will call Emit as a whole.
So we need to manually match the AST 2 levels down to find the await_suspend call, get its name, and then walk through the emitted IR to find a call with the same name, and then find the tmp that's used to store the return value of the call, and then emit llvm.coro.forcestack.

That's a fair point. I agree that we have no guarantee these are the only two cases.
It does seem to me that coroutine implementation somewhat relies on proper lifetime markers so that data are being put correctly, which may be the fundamental problem we are trying to solve.

It is hard to prove it. This topic need more discuss and more folks get involved. But it is really valuable. I can't remember how many patches we had to judge whether values should be put on the coroutine frame. I am OK to emit lifetime markers even at O0. But I think you need to ask for other's opinion.

We probably could, but it would be very very tedious.
During CodeGen, we only have the AST that's calling __builtin_coro_resume, which we will call Emit as a whole.
So we need to manually match the AST 2 levels down to find the await_suspend call, get its name, and then walk through the emitted IR to find a call with the same name, and then find the tmp that's used to store the return value of the call, and then emit llvm.coro.forcestack.

Can't we did as inline comments?

clang/lib/CodeGen/CGCoroutine.cpp
221

can we rewrite it into:

else if (SuspendRet != nullptr && SuspendRet->getType()->isClassType()) {
     // generate:
     // llvm.coro.forcestack(SuspendRet)
}
ChuanqiXu added inline comments.Mar 16 2021, 11:03 PM
clang/lib/CodeGen/CGCoroutine.cpp
221

Sorry I find we can't did it directly. As you said, we need to traverse down SuspendRet. And I still think we should did it only at CodeGen part since it looks not so hard. I guess we could make it in above 10~15 lines of codes.

Can't we did as inline comments?

No, because it would have already been too late. SuspendExpr returns the result of __builtin_coro_resume(awaiter.await_suspend().address()), which is different from the result of awaiter.await_suspend().
We need to be able to control the placement of awaiter.await_suspend(), which is why I had to break up the AST at that boundary.

lxfind added inline comments.Mar 16 2021, 11:09 PM
clang/lib/CodeGen/CGCoroutine.cpp
221

Traversing down AST isn't the hard part. The hard part is to search the emitted IR, and look for the temporary alloca used to store the returned handle.

ChuanqiXu added inline comments.Mar 16 2021, 11:17 PM
clang/lib/CodeGen/CGCoroutine.cpp
221

Yes, I get your point. If we want to traverse the emitted IR, we could only search for the use-chain backward, which is also very odd. Let's see if there is other ways to modify the ASTNodes to make it more naturally.

bruno added a comment.Mar 17 2021, 9:08 PM

Hi Xun, great to see more improvements in this area.

clang/lib/CodeGen/CGCoroutine.cpp
221

I'm curious whether did you consider annotating instructions with some new custom metadata instead of using intrinsics? If so, what would be the tradeoff? For example, if you could conditionally attach metadata some "begin" metadata here:

auto *SaveCall = Builder.CreateCall(CoroSave, {NullPtr});

and "end" metadata here:

auto *SuspendResult = Builder.CreateCall(CoroSuspend, {SaveCall, Builder.getInt1(IsFinalSuspend)});

clang/test/CodeGenCoroutines/coro-symmetric-transfer-01.cpp
53

Nice tests. The codegen should live in a different file from the AST dump one, you can put the later in test/clang/SemaCXX or tes/clang/AST.

llvm/include/llvm/IR/Intrinsics.td
1308

This change seems unrelated to this patch.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
2083

collectForceStacks is only called once from a function that already traverses all instructions, can you take advantage of that to collect llvm::Intrinsic::coro_forcestack_begin/end?

2085

Do such intrinsics never get removed? What happens when this hits a backend?

Well, I guess another potential solution is to force emitting lifetime intrinsics for this part of coroutine in the front-end.
Like this:

diff --git a/clang/lib/CodeGen/CGDecl.cpp b/clang/lib/CodeGen/CGDecl.cpp
index 243d93a8c165..ef76e8dcb7c9 100644
--- a/clang/lib/CodeGen/CGDecl.cpp
+++ b/clang/lib/CodeGen/CGDecl.cpp
@@ -1317,7 +1317,7 @@ void CodeGenFunction::EmitAutoVarDecl(const VarDecl &D) {
 /// otherwise
 llvm::Value *CodeGenFunction::EmitLifetimeStart(uint64_t Size,
                                                 llvm::Value *Addr) {
-  if (!ShouldEmitLifetimeMarkers)
+  if (!ShouldEmitLifetimeMarkers && !isCoroutine())
     return nullptr;
 
   assert(Addr->getType()->getPointerAddressSpace() ==
diff --git a/clang/lib/CodeGen/CGExpr.cpp b/clang/lib/CodeGen/CGExpr.cpp
index 18f1468dcb86..2e6e6808db7f 100644
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -535,7 +535,7 @@ EmitMaterializeTemporaryExpr(const MaterializeTemporaryExpr *M) {
       break;
 
     case SD_FullExpression: {
-      if (!ShouldEmitLifetimeMarkers)
+      if (!ShouldEmitLifetimeMarkers && !isCoroutine())
         break;
 
       // Avoid creating a conditional cleanup just to hold an llvm.lifetime.end

We have already allowed to emit lifetime intrinsics for always inlined function under O2, so IMOO emitting lifetime intrinsics for coroutine function is OK since stack coloring has less effect on coroutine function.

@bruno Thanks for the review!

clang/lib/CodeGen/CGCoroutine.cpp
221

The "end" part could probably be done through metadata. But I'm not sure how to do it for the "begin" part. The "begin" part needs to happen after the emission of S.getAwaitSuspendCallExpr().

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
2085

They are added to the list of DeadInstructions after collected. So they will all be removed at the end of the pass.

lxfind abandoned this revision.Mar 25 2021, 10:25 AM

Abandoning in favor of D99227