This is an archive of the discontinued LLVM Phabricator instance.

[Coroutines] Split coroutine during CoroEarly into an init and ramp function
Changes PlannedPublic

Authored by lxfind on Apr 13 2021, 3:26 PM.

Details

Summary

(I haven't updated tests yet)

A coroutine has the following structure in LLVM IR:

entry:
  alloca ..
  %promise = alloca ...
  %0 = call token @llvm.coro.id(..., %promise
  %1 = call i1 @llvm.coro.alloc(token %0)
  br i1 %1, label %coro.alloc, label %coro.init

coro.alloc:
  %2 = call i64 @llvm.coro.size.i64()
  %call = call noalias nonnull i8* @_Znwm(i64 %2)
  br label %coro.init

coro.init:                                       ; preds = %coro.alloc, %entry
  %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ]
  %4 = call i8* @llvm.coro.begin(token %0, i8* %3)
  ...
  move parameters to stack alloca
  create promise object
  ...
  actual coroutine body
  ...

It uses coro.id to uniquely identify the coroutine (which also refers to the promise object), use coro.alloc to decide whether to create frame on the heap, and use coro.begin the mark the frame object.
After that, it always moves all parameters to stack (stored in allocas), create the promise object by calling its constructor.
Finally it emits the actual coroutine body code.

Having all of these in the same function creates problems: optimization passes blend the initialization code with the coroutine body and move code around, which latter violates some of the requirements by coroutines.
There are two examples:

  1. Frame objects accessed before coro.begin: coro.begin returns the frame pointer, that is, the frame is only ready after coro.begin. If any value is used across coroutine suspension and needs to be put on the frame, they need to be accessed through the frame instead of alloca. This is easy if the value is first accessed after coro.begin: we can just replace all their references by a pointer to the frame. However if a value is accessed before coro.begin, but also need to live on the frame, we are in trouble. D66230 made an initial attempt to fix this, but it wasn't complete. I made the fix more robust in D86859, which introduced a lot of complexity to AllocaUseVisitor. The basic idea is that we track every alloca use (both explicit and implicit through aliases) before coro.begin, and if they are touched we copy them into the frame after coro.begin. This is however not bullet-proof. If there exists complicated phi nodes, we may end up having to copy every single alloca to the frame. This patch separate the code before coro.begin and after coro.begin, making it impossible for optimization passes to mess around. There can be no complicated access to the frame before frame creation.
  2. Captured by-val parameter through MemCpyOptPass: https://bugs.llvm.org/show_bug.cgi?id=48857. To summarize the problem, in the coroutine IR, a first mem.copy copies a passed-by-value parameter to a local allloca, and latter (after a coroutine suspension) copies the local alloca to another local alloca. MemCpyOptPass merges them and turns the second copy to be copying directly from the parameter to the second local alloca. This will lead to crash because the passed-by-value parameter pointer would have died after the coroutine suspension. This patch separate the parameter copy code and the coroutine body, making this kind of optimizations impossible.

Overall, we want to split the coroutine as much as possible as early as possible to avoid any kind of violations of coroutine propertiers from optimization passes.

To split the coroutine early, this patch splits the coroutine right after parameter move during CoroEarly pass. Anything before remain in the original function (called init function), and the rest is put into a new function (called ramp function). It's done through 3 steps:

  1. In CGCoroutine, we need to emit a few new intrinsic instructions that CoroEarly can use to correctly split the function. First of all, the parameter move should only happen once in the init function. To achieve this effect, a new intrinsic coro.init() is created that returns a boolean value. It will return true in the init function while false in the ramp function. This allows us to control the behavior difference between init and ramp. Secondly, we need a marker that tells CoroEarly pass that the init function part is done, and the rest belongs to the ramp function. This is achieved by a new intrinsic coro.init.end(). This essentially marks the splitting point in CoroEarly split. Finally, every alloca that's storing the parameter copies will be annotated with metadata, indicating that they are parameters and will be used in the ramp function. The same thing is done to the promise object. These should be the only allocas that need to be used across init and ramp function. Such metadata will allow us to properly tag these values during CoroEarly pass, so that they won't be DCEed even though they may not be used in the rest of init function.
  1. In CoroEarly pass, if the coroutine is a switch-lowering coroutine (i.e. has a coro.id), we will split the coroutine. The split process works like this: we first go through every alloca that has metadata and generate calls to a new intrinsic coro.frame.get(). coro.frame.get() returns the pointer from the frame for the specific alloca. It captures the alloca, indicates whether it's a promise and has a unique ID. All uses of this alloca is then replaced with the value returned by coro.frame.get(); next the function is cloned into a new function (ramp function), which takes only one parameter, the coroutine frame. It then goes through all intrinsics in the new function, repalce coro.begin with the parameter, replace coro_init() with false, and also replaces coro_alloc() with false (only init function needs to alloca the frame); finally it goes through all intrinsics in the init function, replaces coro_init() with true, and generates a call to the ramp function at the location of coro.init.end(), which removes rest of the function.
  1. In CoroSplit pass, the idea is to inline the ramp function back to the init function, so that we can reuse the existing CoroSplit logic. To do so, we need to introduce a few more CORO_PRESPLIT_ATTR to tag the different states of the init and ramp function. When the ramp function is ready to split, and when we are processing the init function, we inline ramp function into the init function. To do so, we replace every coro.frame.get() with the original alloca. It also sets the promise field of coro.id to the promise object. After inlining, we update CGSCC and delete the ramp function.

Examples:

Coroutine IR emitted from the Clang front-end will look like this:

define @foo(i32 %val...)
entry:
  %val1 = alloca i32, align 4, !coroutine_frame_alloca !2
  ...
  %promise = alloca... !coroutine_frame_alloca !5
  %0 = call token @llvm.coro.id(..., null...
  %1 = call i1 @llvm.coro.alloc(token %0)
  br i1 %1, label %coro.alloc, label %coro.begin

coro.alloc:
  %2 = call i64 @llvm.coro.size.i64()
  %call = call noalias nonnull i8* @_Znwm(i64 %2)
  br label %coro.begin

coro.begin:                                       ; preds = %coro.alloc, %entry
  %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ]
  %4 = call i8* @llvm.coro.begin(token %0, i8* %3)
  %5 = call i1 @llvm.coro.init()
  br i1 %5, label %coro.init, label %coro.init.ready

coro.init:                                        ; preds = %coro.begin
  %6 = bitcast i32* %val1 to i8*
  %7 = load i32, i32* %val.addr, align 4
  store i32 %7, i32* %val1, align 4
  ...
  call void @llvm.coro.init.end()
  br label %coro.init.ready

coro.init.ready:
  ...

...
!2 = !{i1 false, i32 0}
!5 = !{i1 true, i32 3}
...

CoroEarly will then split into two functions:

define @foo(i32 %val...)
entry:
  %val1 = alloca i32, align 4
  ...
  %promise = alloca...
  %0 = call token @llvm.coro.id(..., null...
  %1 = call i1 @llvm.coro.alloc(token %0)
  br i1 %1, label %coro.alloc, label %coro.begin

coro.alloc:
  %2 = call i64 @llvm.coro.size.i64()
  %call = call noalias nonnull i8* @_Znwm(i64 %2)
  br label %coro.begin

coro.begin:                                       ; preds = %coro.alloc, %entry
  %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ]
  %4 = call i8* @llvm.coro.begin(token %0, i8* %3)
  %5 = bitcast i32* %val1 to i8*
  %6 = call i8* @llvm.coro.frame.get(i8* %4, i8* %5, i1 false, i32 0)
  %7 = bitcast i8* %6 to i32*
  ...
  br label %coro.init

coro.init:                                        ; preds = %coro.begin
  %17 = bitcast i32* %7 to i8*
  %18 = load i32, i32* %val.addr, align 4
  store i32 %18, i32* %7, align 4
  ...
  call void @_Z1fi8MoveOnly11MoveAndCopy.ramp(i8* %4)
  ret void

define @foo.ramp(i8* %0)
entry:
  %val1 = alloca i32, align 4
  ...
  %promise = alloca...
  %1 = call token @llvm.coro.id(..., null...
  br label %coro.begin

coro.begin:
  %2 = bitcast i32* %val1 to i8*
  %3 = call i8* @llvm.coro.frame.get(i8* %0, i8* %2, i1 false, i32 0)
  %4 = bitcast i8* %3 to i32*
  br label %coro.init.ready

coro.init.ready:
  ...

Diff Detail

Event Timeline

lxfind created this revision.Apr 13 2021, 3:26 PM
lxfind requested review of this revision.Apr 13 2021, 3:26 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 13 2021, 3:26 PM
lxfind edited the summary of this revision. (Show Details)
lxfind updated this revision to Diff 337290.Apr 13 2021, 5:17 PM

some cleanups

The overall idea looks good to me. Since this is a fundamental and large change, I need time to run it actually and look into details.
Noticed that this patch introduces new intrinsics and new concept init function, it may be needed to add them in the coroutines document.

It looks like this code may trigger assertion in CoroInstr.h for CoroIdInst::setCoroutineSelf.

Also, since this patch would enlarge the coroutine frame, it may affect the performance naturally. I believe it wouldn't really matter. I just find that we need coroutine benchmarks which seems missing now. It is really needed when we talk about the performance for coroutine. Although our intuition tell us it should be ok, we still need some data to approve us instead of imaging. I am not asking for benchmark or score for this patch. I just find there is no benchmark we can evaluate the performance for coroutine. I believe it would be a future direction.

Then, since there is hidden chance to decrease the performance, I wonder is it better to give an option to control how coroutine handle the parameters. We can use strategy in this patch by default. An option could help us in tuning the performance in the future.

I would try to look into the details for the code.

clang/lib/CodeGen/CGCoroutine.cpp
619

Did this vector have been used?

638

I wonder if it is better to document the metadata coroutine_frame_alloca in somewhere like the metadata tbaa.

646

It calls coro.init.end without calling coro.init in the front which looks odd.

llvm/lib/Transforms/Coroutines/CoroEarly.cpp
153

We need comment for the intention of the function.

156

It looks odd for several {} in Function to avoid name collision.

172
- VoidPt
+ VoidPtr
176

We need to document the semantics for coro.frame.get

193

Noticed that this patch deletes F.addFnAttr(CORO_PRESPLIT_ATTR, UNPREPARED_FOR_SPLIT); below, is it conflicting with D100282 . I want to know if we still ned to add Noinline attribute once D100282 checked in.

218

Why do we need to replace coro.alloc with 0 now?
Replace coro.alloc with 0 implies we should allocate the frame in the stack. I think we can't know how should we allocate the frame now.

335

Should we give a another name for splitRampFunction? It may be surprising to see split in Coro-early pass instead of Coro-split pass.
BTW, how do you think about create the ramp function in the CodeGen process of frontend?

@ChuanqiXu Thank you for the detailed review! Really appreciate it.
I agree we should create a coroutine benchmark at some point, ideally some realistic production-code driven benchmark. We can work on that in the future. For this patch, it's probably not worth it to hide it behind an option, for two reasons: 1) it would be extremely complicated, 2) most parameters would end up on the frame anyway 3) this patch actually doesn't force parameters to be put on the frame. Before frame creation, all the parameters are put back to allocas, the current alloca analysis and optimization still applies to them. So some parameters may actually end up not put on the frame. So I wouldn't expect this to increase frame size in most cases.

I will add documentation latter once the we all agree on the high-level idea/direction of this patch.

clang/lib/CodeGen/CGCoroutine.cpp
646

This path is conditionally guarded by coro.init alrady.

llvm/lib/Transforms/Coroutines/CoroEarly.cpp
193

Good question. For now they are somewhat redundant. We probably don't need to add NoInline here.

218

This is replacing it in the NewF (the cloned new ramp function). We only need to allocate the frame once, which will be done in the init function. So in the ramp function we can always skip it.

335

I thought about doing it in CodeGen. But it's really complicated to split functions in CodeGen.

@ChuanqiXu Thank you for the detailed review! Really appreciate it.
I agree we should create a coroutine benchmark at some point, ideally some realistic production-code driven benchmark. We can work on that in the future. For this patch, it's probably not worth it to hide it behind an option, for two reasons: 1) it would be extremely complicated, 2) most parameters would end up on the frame anyway 3) this patch actually doesn't force parameters to be put on the frame. Before frame creation, all the parameters are put back to allocas, the current alloca analysis and optimization still applies to them. So some parameters may actually end up not put on the frame. So I wouldn't expect this to increase frame size in most cases.

I will add documentation latter once the we all agree on the high-level idea/direction of this patch.

Thanks for the disclaimer. Although I am not familiar with many details in this patch, the high-level idea looks good to me.

llvm/lib/Transforms/Coroutines/CoroSplit.cpp
2231

I am not familiar with the policy in LLVM that how should we treat LegacyPass in trunk. I mean, are we responsible to update the LegacyPassManager?

lxfind added inline comments.Apr 18 2021, 10:43 AM
llvm/lib/Transforms/Coroutines/CoroSplit.cpp
2231

Yes I think so. I will deal with the legacypass latter.

lxfind planned changes to this revision.Apr 20 2021, 4:29 PM

Plan to add documentation, fix Legacy pass and address comments.

ychen added a subscriber: ychen.Apr 27 2021, 11:10 AM