This is an archive of the discontinued LLVM Phabricator instance.

[coroutine] Fixes "cannot move instruction since its users are not dominated by CoroBegin" problem.
ClosedPublic

Authored by GorNishanov on Aug 14 2019, 10:14 AM.

Details

Summary

Fixes https://bugs.llvm.org/show_bug.cgi?id=36578 and https://bugs.llvm.org/show_bug.cgi?id=36296.
Supersedes: https://reviews.llvm.org/D55966

One of the fundamental transformation that CoroSplit pass performs before splitting the coroutine is to find which values need to survive between suspend and resume and provide a slot for them in the coroutine frame to spill and restore the value as needed.

Coroutine frame becomes available once the storage for it was allocated and that point is marked in the pre-split coroutine with a llvm.coro.begin intrinsic.

FE normally puts all of the user-authored code that would be accessing those values after llvm.coro.begin, however, sometimes instructions accessing those values would end up prior to coro.begin. For example, writing out a value of the parameter into the alloca done by the FE or instructions that are added by the optimization passes such as SROA when it rewrites allocas.

Prior to this change, CoroSplit pass would try to move instructions that may end up accessing the values in the coroutine frame after CoroBegin. However it would run into problems (report_fatal_error) if some of the values would be used both in the allocation function (for example allocator is passed as a parameter to a coroutine) and in the use-authored body of the coroutine.

To handle this case and to simplify the instruction moving logic, this change removes all of the instruction moving. Instead, we only change the uses of the spilled values that are dominated by coro.begin and leave other instructions intact.

Before:

%var = alloca i32
%1 = getelementptr .. %var; ; will move this one after coro.begin
%f = call i8* @llvm.coro.begin(

After:

%var = alloca i32
%1 = getelementptr .. %var; stays put
%f = call i8* @llvm.coro.begin(

If we discover that there is a potential write into an alloca, prior to coro.begin we would copy its value from the alloca into the spill slot in the coroutine frame.

Before:

%var = alloca i32
store .. %var ; will move this one after coro.begin
%f = call i8* @llvm.coro.begin(

After:

%var = alloca i32
store .. %var ;stays put
%f = call i8* @llvm.coro.begin(
%tmp = load %var
store %tmp, %spill.slot.for.var

Note: This change does not handle array allocas as that is something that C++ FE does not produce, but, it can be added in the future if need arises

Diff Detail

Repository
rL LLVM

Event Timeline

GorNishanov created this revision.Aug 14 2019, 10:14 AM
GorNishanov edited the summary of this revision. (Show Details)Aug 14 2019, 10:28 AM
modocache accepted this revision.Aug 14 2019, 12:24 PM

Thanks for looking into this, @GorNishanov! LGTM aside from a tiny nit-pick.

lib/Transforms/Coroutines/CoroFrame.cpp
476 ↗(On Diff #215156)

A nit-pick, but: it seems like most other variable names in this function are capitalized (e.g.: PtrI), but this one is a lowercase v. Is there a reason they're not consistent?

This revision is now accepted and ready to land.Aug 14 2019, 12:24 PM

Thank you very much for the reivew, Brian!
Nit addressed. Preparing to land.

GorNishanov added a reviewer: rjmccall.
GorNishanov marked an inline comment as done.

Merged with the latest CoroFrame changes from @rjmccall .

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 14 2019, 5:51 PM

Seems reasonable to me. If we're no longer imposing any *restrictions* for @llvm.coro.begin, is it serving any real purpose? Is it just a way of getting a handle to the coroutine frame?

Seems reasonable to me. If we're no longer imposing any *restrictions* for @llvm.coro.begin, is it serving any real purpose? Is it just a way of getting a handle to the coroutine frame?

I think the original reason for the coro.begin is intact. LLVM Coroutines.rst stays "Depending on the alignment requirements of the objects in the coroutine frame and/or on the codegen compactness reasons the pointer returned from coro.begin may be at offset to the %mem argument. "

C++ FE obtains a memory in some way for the coroutine frame and then give it to coro.begin, saying: "Here. Use it for coroutine frame." Lowering of coro.begin may simply return the same pointer or it may do some adjustment to it. One possible adjustment would be, if there are any over-aligned variables with alignment larger than the alignment of the allocator (supplied to coro.id), coro.begin would do (NOT YET IMPLEMENTED) a dynamic adjustment to make sure everything is properly aligned.

That reminded me, there was a brief common C++ coroutine ABI discussion at Cologne and I don't think the results were shared with the interested parties. I'll send an e-mail shortly

Is there a case you're imagining where the adjustment would have side-effects? Because I can see a reason to have an intrinsic that returns a frame pointer, but I don't see why that intrinsic would have any of the restrictions of @llvm.coro.begin.

Is there a case you're imagining where the adjustment would have side-effects? Because I can see a reason to have an intrinsic that returns a frame pointer, but I don't see why that intrinsic would have any of the restrictions of @llvm.coro.begin.

Which restrictions are you thinking about?
Marking it as NoDuplicate in CoroEarly helps simplify the CoroSplit logic.
If you are thinking about the attributes on the intrinsic itself, those probably can be relaxed.

Is there a case you're imagining where the adjustment would have side-effects? Because I can see a reason to have an intrinsic that returns a frame pointer, but I don't see why that intrinsic would have any of the restrictions of @llvm.coro.begin.

Which restrictions are you thinking about?

"A frontend should emit exactly one coro.begin intrinsic per coroutine."

Marking it as NoDuplicate in CoroEarly helps simplify the CoroSplit logic.

Does it? Why? There already has to be a single coro.id in the coroutine, and that's the intrinsic that takes useful information. coro.begin just has a bunch of requirements and no clear purpose except to return the frame, which could just as well be done with a duplicable intrinsic. Frame allocation has to happen in the ramp prologue anyway, and the underlying observation of this particular patch is that trying to move coro.begin to reflect its logical order in the function is just a lot of complication for no clear purpose. And honestly the frame pointer is not usually a useful thing to expose in a coroutine representation; aspects of frame layout being exposed in the ABI is a property of the switch lowering, not coroutines in general.

John.

aspects of frame layout being exposed in the ABI is a property of the switch lowering, not coroutines in general.

You may be right. Given that now we have two frontends targeting LLVM Coroutines, some refactoring may be in order. I need to study more the Swift approach before I can form an opinion.

Marking it as NoDuplicate in CoroEarly helps simplify the CoroSplit logic.

Does it? Why? There already has to be a single coro.id in the coroutine, and that's the intrinsic that takes useful information. coro.begin just has a bunch of requirements and no clear purpose except to return the frame, which could just as well be done with a duplicable intrinsic.

Here is how it looks to me: C++ frontends emits the following structure (simplified):

%id = coro.id(stuff)
%mem = SomeAllocCodeCouldBeAnything(stuff)
%frame = coro.begin(%mem)

coro.begin "blesses" the memory as the one to be used in the coroutine and therefore should dominate any possible uses of data that go into the coroutine frame and gives a convenient place to dump spills, copies, etc.

If we did not allow arbitrary allocation logic in c++, there would be no need for coro.begin at all.

Gor

aspects of frame layout being exposed in the ABI is a property of the switch lowering, not coroutines in general.

You may be right. Given that now we have two frontends targeting LLVM Coroutines, some refactoring may be in order. I need to study more the Swift approach before I can form an opinion.

Marking it as NoDuplicate in CoroEarly helps simplify the CoroSplit logic.

Does it? Why? There already has to be a single coro.id in the coroutine, and that's the intrinsic that takes useful information. coro.begin just has a bunch of requirements and no clear purpose except to return the frame, which could just as well be done with a duplicable intrinsic.

Here is how it looks to me: C++ frontends emits the following structure (simplified):

%id = coro.id(stuff)
%mem = SomeAllocCodeCouldBeAnything(stuff)
%frame = coro.begin(%mem)

coro.begin "blesses" the memory as the one to be used in the coroutine and therefore should dominate any possible uses of data that go into the coroutine frame and gives a convenient place to dump spills, copies, etc.

If we did not allow arbitrary allocation logic in c++, there would be no need for coro.begin at all.

Ah, okay, I see. Yes, if there has to be inlined code to do the allocation, then something like coro.begin does seem necessary; although of course the danger is that that arbitrary code — if nothing else, after inlining — might do something that we naively think needs the coroutine frame. For example, if the allocation called a user-defined allocation function, and we inlined that call before splitting the coroutine, and the inlined code contained an alloca (scoped to the call, of course, but we haven't taught CoroFrame to optimize based on alloca lifetimes yet), that would presumably cause serious problems for lowering because the alloca would have uses prior to coro.begin.

How arbitrary can allocation really be? If it can be emitted in a separate function call that can be provided abstractly to coro.id, then coroutine lowering can just insert a call to that function (or whatever more complicated pattern is necessary to enable static elimination of the allocation) and then trigger further optimization/inlining as necessary to optimize that new call. If we need to handle exceptions out of it then maybe we can make coro.id non-nounwind. (We can almost get away with just saying "allocation is the first thing the function does, so it never happens in an interesting EH context. Unfortunately, there are some features/ABIs that require parameters to be destroyed in the callee, which can mean that everything in the function is in an interesting EH context, even generated code in the prologue.)

lxfind added a subscriber: lxfind.EditedAug 30 2020, 9:54 AM

@GorNishanov This fix seems problematic.
Consider this code:

%var = alloca i32
%1 = getelementptr .. %var; stays put
%f = call i8* @llvm.coro.begin
store ... %1

After this fix, %1 will now stay put, however if a store happens after coro.begin and hence modifies the content, this change will not be reflected in the coroutine frame (and will eventually be DCEed).
To generalize the problem, if any alias ptr is created before coro.begin for an Alloca and that alias ptr is latter written into after coro.begin, it will lead to incorrect behavior.
I wonder what would be a correct fix?

Also, there seems to be a few other minor problems with this fix, for instance, in AllocaUseVisitor, we are only checking escape and store instructions. However this is insufficient to cover all potential writes. You can have a call instruction that's non-escaping but modifies the content of the pointer (e.g. llvm.memcpy). Also, in AllocaUseVisitor::visit, we are checking DT.dominates(&I, &CoroBegin), which should really be !DT.dominates(&CoroBegin, &I).

Overall, I find it difficult to patch this change to make it correct. Some fundamental rewrite of this part of the algorithm seems necessary. I would be happy to look into it but would like to hear your opinions first @GorNishanov.

A full repro IR of the issue for reference:

%"struct_foo" = type <{ i64, i64, [8 x i8] }>

define i8* @foo(%"struct_foo"* byval(%"struct_foo") align 8 %arg) "coroutine.presplit"="1" {
entry:
  %local = alloca [24 x i8], align 8
  %local.sub = getelementptr inbounds [24 x i8], [24 x i8]* %local, i64 0, i64 0
  %id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
  %size = call i32 @llvm.coro.size.i32()
  %alloc = call i8* @myAlloc(i32 %size)
  %hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)
  %arg.addr = bitcast %"struct_foo"* %arg to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(24) %local.sub, i8* nonnull align 8 dereferenceable(24) %arg.addr, i64 24, i1 false)
  %0 = call i8 @llvm.coro.suspend(token none, i1 false)
  switch i8 %0, label %suspend [i8 0, label %resume
                                i8 1, label %cleanup]
resume:
  call void @print2([24 x i8]* %local)
  br label %cleanup

cleanup:
  %mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
  call void @free(i8* %mem)
  br label %suspend
suspend:
  call i1 @llvm.coro.end(i8* %hdl, i1 0)
  ret i8* %hdl
}

declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg)

declare i8* @llvm.coro.free(token, i8*)
declare i32 @llvm.coro.size.i32()
declare i8  @llvm.coro.suspend(token, i1)
declare void @llvm.coro.resume(i8*)
declare void @llvm.coro.destroy(i8*)

declare token @llvm.coro.id(i32, i8*, i8*, i8*)
declare i1 @llvm.coro.alloc(token)
declare i8* @llvm.coro.begin(token, i8*)
declare i1 @llvm.coro.end(i8*, i1)

declare noalias i8* @myAlloc(i32)
declare double @print(double)
declare void @print2([24 x i8]*)
declare void @free(i8*)

Yes, we at Apple had similar problems and basically had to revert this internally. The basic problem is that LLVM's alloca representation is not well-suited for coroutines — we really need a guarantee that there are no stack allocations live across the coro.begin, or at least none that stay alive until the first suspend. That would be easy with something like SIL's alloc_stack, but LLVM's desire to push local allocations into the entry block really makes it difficult to even talk about the appropriate restrictions here.

Maybe we could force all local allocations in coroutines to be done with the llvm.coro.alloca.* intrinsics that I added? We could then actually verify that no allocations are live across the coro.begin. That would require a fair bit of abstraction in Clang, though, and possibly in some other passes that introduce allocas, and it might severely block optimization unless we teach mem2reg how to handle those intrinsics.

lxfind added a comment.Sep 3 2020, 9:31 AM

Yes, we at Apple had similar problems and basically had to revert this internally. The basic problem is that LLVM's alloca representation is not well-suited for coroutines — we really need a guarantee that there are no stack allocations live across the coro.begin, or at least none that stay alive until the first suspend. That would be easy with something like SIL's alloc_stack, but LLVM's desire to push local allocations into the entry block really makes it difficult to even talk about the appropriate restrictions here.

Maybe we could force all local allocations in coroutines to be done with the llvm.coro.alloca.* intrinsics that I added? We could then actually verify that no allocations are live across the coro.begin. That would require a fair bit of abstraction in Clang, though, and possibly in some other passes that introduce allocas, and it might severely block optimization unless we teach mem2reg how to handle those intrinsics.

Thank you @rjmccall for your feedback! Yes I agree with you that there seems no practical way to make this work 100% of the time. I made an attempt in D86859 which should at least make this more robust and handle majority of the cases. In the meanwhile I will look into how this is done using coro.alloca. It would be great if you could help take a look at D86859 too.