(I haven't updated tests yet)
A coroutine has the following structure in LLVM IR:
entry: alloca .. %promise = alloca ... %0 = call token @llvm.coro.id(..., %promise %1 = call i1 @llvm.coro.alloc(token %0) br i1 %1, label %coro.alloc, label %coro.init coro.alloc: %2 = call i64 @llvm.coro.size.i64() %call = call noalias nonnull i8* @_Znwm(i64 %2) br label %coro.init coro.init: ; preds = %coro.alloc, %entry %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ] %4 = call i8* @llvm.coro.begin(token %0, i8* %3) ... move parameters to stack alloca create promise object ... actual coroutine body ...
It uses coro.id to uniquely identify the coroutine (which also refers to the promise object), use coro.alloc to decide whether to create frame on the heap, and use coro.begin the mark the frame object.
After that, it always moves all parameters to stack (stored in allocas), create the promise object by calling its constructor.
Finally it emits the actual coroutine body code.
Having all of these in the same function creates problems: optimization passes blend the initialization code with the coroutine body and move code around, which latter violates some of the requirements by coroutines.
There are two examples:
- Frame objects accessed before coro.begin: coro.begin returns the frame pointer, that is, the frame is only ready after coro.begin. If any value is used across coroutine suspension and needs to be put on the frame, they need to be accessed through the frame instead of alloca. This is easy if the value is first accessed after coro.begin: we can just replace all their references by a pointer to the frame. However if a value is accessed before coro.begin, but also need to live on the frame, we are in trouble. D66230 made an initial attempt to fix this, but it wasn't complete. I made the fix more robust in D86859, which introduced a lot of complexity to AllocaUseVisitor. The basic idea is that we track every alloca use (both explicit and implicit through aliases) before coro.begin, and if they are touched we copy them into the frame after coro.begin. This is however not bullet-proof. If there exists complicated phi nodes, we may end up having to copy every single alloca to the frame. This patch separate the code before coro.begin and after coro.begin, making it impossible for optimization passes to mess around. There can be no complicated access to the frame before frame creation.
- Captured by-val parameter through MemCpyOptPass: https://bugs.llvm.org/show_bug.cgi?id=48857. To summarize the problem, in the coroutine IR, a first mem.copy copies a passed-by-value parameter to a local allloca, and latter (after a coroutine suspension) copies the local alloca to another local alloca. MemCpyOptPass merges them and turns the second copy to be copying directly from the parameter to the second local alloca. This will lead to crash because the passed-by-value parameter pointer would have died after the coroutine suspension. This patch separate the parameter copy code and the coroutine body, making this kind of optimizations impossible.
Overall, we want to split the coroutine as much as possible as early as possible to avoid any kind of violations of coroutine propertiers from optimization passes.
To split the coroutine early, this patch splits the coroutine right after parameter move during CoroEarly pass. Anything before remain in the original function (called init function), and the rest is put into a new function (called ramp function). It's done through 3 steps:
- In CGCoroutine, we need to emit a few new intrinsic instructions that CoroEarly can use to correctly split the function. First of all, the parameter move should only happen once in the init function. To achieve this effect, a new intrinsic coro.init() is created that returns a boolean value. It will return true in the init function while false in the ramp function. This allows us to control the behavior difference between init and ramp. Secondly, we need a marker that tells CoroEarly pass that the init function part is done, and the rest belongs to the ramp function. This is achieved by a new intrinsic coro.init.end(). This essentially marks the splitting point in CoroEarly split. Finally, every alloca that's storing the parameter copies will be annotated with metadata, indicating that they are parameters and will be used in the ramp function. The same thing is done to the promise object. These should be the only allocas that need to be used across init and ramp function. Such metadata will allow us to properly tag these values during CoroEarly pass, so that they won't be DCEed even though they may not be used in the rest of init function.
- In CoroEarly pass, if the coroutine is a switch-lowering coroutine (i.e. has a coro.id), we will split the coroutine. The split process works like this: we first go through every alloca that has metadata and generate calls to a new intrinsic coro.frame.get(). coro.frame.get() returns the pointer from the frame for the specific alloca. It captures the alloca, indicates whether it's a promise and has a unique ID. All uses of this alloca is then replaced with the value returned by coro.frame.get(); next the function is cloned into a new function (ramp function), which takes only one parameter, the coroutine frame. It then goes through all intrinsics in the new function, repalce coro.begin with the parameter, replace coro_init() with false, and also replaces coro_alloc() with false (only init function needs to alloca the frame); finally it goes through all intrinsics in the init function, replaces coro_init() with true, and generates a call to the ramp function at the location of coro.init.end(), which removes rest of the function.
- In CoroSplit pass, the idea is to inline the ramp function back to the init function, so that we can reuse the existing CoroSplit logic. To do so, we need to introduce a few more CORO_PRESPLIT_ATTR to tag the different states of the init and ramp function. When the ramp function is ready to split, and when we are processing the init function, we inline ramp function into the init function. To do so, we replace every coro.frame.get() with the original alloca. It also sets the promise field of coro.id to the promise object. After inlining, we update CGSCC and delete the ramp function.
Examples:
Coroutine IR emitted from the Clang front-end will look like this:
define @foo(i32 %val...) entry: %val1 = alloca i32, align 4, !coroutine_frame_alloca !2 ... %promise = alloca... !coroutine_frame_alloca !5 %0 = call token @llvm.coro.id(..., null... %1 = call i1 @llvm.coro.alloc(token %0) br i1 %1, label %coro.alloc, label %coro.begin coro.alloc: %2 = call i64 @llvm.coro.size.i64() %call = call noalias nonnull i8* @_Znwm(i64 %2) br label %coro.begin coro.begin: ; preds = %coro.alloc, %entry %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ] %4 = call i8* @llvm.coro.begin(token %0, i8* %3) %5 = call i1 @llvm.coro.init() br i1 %5, label %coro.init, label %coro.init.ready coro.init: ; preds = %coro.begin %6 = bitcast i32* %val1 to i8* %7 = load i32, i32* %val.addr, align 4 store i32 %7, i32* %val1, align 4 ... call void @llvm.coro.init.end() br label %coro.init.ready coro.init.ready: ... ... !2 = !{i1 false, i32 0} !5 = !{i1 true, i32 3} ...
CoroEarly will then split into two functions:
define @foo(i32 %val...) entry: %val1 = alloca i32, align 4 ... %promise = alloca... %0 = call token @llvm.coro.id(..., null... %1 = call i1 @llvm.coro.alloc(token %0) br i1 %1, label %coro.alloc, label %coro.begin coro.alloc: %2 = call i64 @llvm.coro.size.i64() %call = call noalias nonnull i8* @_Znwm(i64 %2) br label %coro.begin coro.begin: ; preds = %coro.alloc, %entry %3 = phi i8* [ null, %entry ], [ %call, %coro.alloc ] %4 = call i8* @llvm.coro.begin(token %0, i8* %3) %5 = bitcast i32* %val1 to i8* %6 = call i8* @llvm.coro.frame.get(i8* %4, i8* %5, i1 false, i32 0) %7 = bitcast i8* %6 to i32* ... br label %coro.init coro.init: ; preds = %coro.begin %17 = bitcast i32* %7 to i8* %18 = load i32, i32* %val.addr, align 4 store i32 %18, i32* %7, align 4 ... call void @_Z1fi8MoveOnly11MoveAndCopy.ramp(i8* %4) ret void define @foo.ramp(i8* %0) entry: %val1 = alloca i32, align 4 ... %promise = alloca... %1 = call token @llvm.coro.id(..., null... br label %coro.begin coro.begin: %2 = bitcast i32* %val1 to i8* %3 = call i8* @llvm.coro.frame.get(i8* %0, i8* %2, i1 false, i32 0) %4 = bitcast i8* %3 to i32* br label %coro.init.ready coro.init.ready: ...
Did this vector have been used?