This is an archive of the discontinued LLVM Phabricator instance.

[C++20] [Coroutines] Introduce `@llvm.coro.opt.blocker` to block the may-be-incorrect optimization for awaiter
AbandonedPublic

Authored by ChuanqiXu on Aug 3 2023, 11:03 PM.

Details

Reviewers
ilya-biryukov
rjmccall
cor3ntin
weiwang
bruno
MatzeB
Group Reviewers
Restricted Project
Summary

Close https://github.com/llvm/llvm-project/issues/56301
Close https://github.com/llvm/llvm-project/issues/64151

Given its context is pretty long, I'll give it a simple summarize here.

The issue pattern here is:

C++
struct Awaiter {
  SomeAwaitable&& awaitable;
  bool suspended;

  bool await_ready() { return false; }

  std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h) {
    // Assume we will suspend unless proven otherwise below. We must do
    // this *before* calling Register, since we may be destroyed by another
    // thread asynchronously as soon as we have registered.
    suspended = true;

    // Attempt to hand off responsibility for resuming/destroying the coroutine.
    const auto to_resume = awaitable.Register(h);

    if (!to_resume) {
      // The awaitable is already ready. In this case we know that Register didn't
      // hand off responsibility for the coroutine. So record the fact that we didn't
      // actually suspend, and tell the compiler to resume us inline.
      suspended = false;
      return h;
    }

    // Resume whatever Register wants us to resume.
    return to_resume;
  }

  void await_resume() {
    // If we didn't suspend, make note of that fact.
    if (!suspended) {
      DidntSuspend();
    }
  }
};

In the example, the program would only access the coroutine frame conditionally. However, the optimizer failed to think the variable suspended may be aliased with the coroutine handle h in await_suspend. In this case, the value of suspended is the same with the nullness of to_resume. So the variable suspended is optimized out and the nullness of the to_resume is stored to the coroutine frame unconditionally. So that it is a UB if the coroutine handle get destroyed in awaitable.Register() in another thread while we access the coroutine frame.

The root cause of the problem is that the optimizer can't recognize the may_alias relationship between the data members of the awaiter with the coroutine handle. So I thought I must fight with AA initially. But I realized that it is sufficient to mark the awaiter as escaped during the await_suspend. This is not a workaround but a proper fix to the underlying issue since the C++ language doesn't allow the programmers to access the coroutine handle except via await_suspend.

And it is pretty easy to mark certain variables as escaped. We can make it by passing its address to a foreign function simply. So it is the framework of the patch. We introduced a LLVM intrinsic @llvm.coro.opt.blocker to block the optimization of awaiter.

Another important point is that we shouldn't make it for awaiter with no non-static data members. The instance of an empty class may still occupy 1 byte due to the requirement of C++ language. But the optimizer can optimize such variables out if it can inline await_ready, await_suspend and await_resume. So it doesn't matter. However, it won't be done after we introduced @llvm.coro.opt.blocker. This is not acceptable. So I spent some time in this page to make it not happen with the empty awaiter.

Ideally, it will be better to emit @llvm.coro.opt.blocker only if we observed the coroutine handle leaked from await_suspend and there is conditional access after that. But it is more complex and it is also questionable if the benefits worth the cost. Currently, coroutines already compile slowly. So let's leave it to the future.

Diff Detail

Event Timeline

ChuanqiXu created this revision.Aug 3 2023, 11:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 11:03 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
ChuanqiXu requested review of this revision.Aug 3 2023, 11:03 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2023, 11:03 PM
ChuanqiXu updated this revision to Diff 547107.Aug 3 2023, 11:12 PM
ChuanqiXu added inline comments.Aug 3 2023, 11:14 PM
clang/lib/CodeGen/CGCoroutine.cpp
158–169

Note that the may cause a crash if there are some types we are not able to covered. This is intentional. Since such crash is easy to fix by adding new type to the TypeVisitor. I guess this may be understandable since the type system of clang frontend (or should I say C++?) is really complex.

Also note that this only matters with people who enabled assertions, possibly developers or testers.

For end users who use a clean release build, it is completely OK to not match the type. They may only need to pay 1 byte memory overhead. I feel this is better for the user experience.

ChuanqiXu updated this revision to Diff 547134.Aug 4 2023, 1:15 AM

Let me try to rephrase to see if I understand the problem here. In the coroutine function, the frontend allocates the awaiter object as a temporary like any other, i.e. with just an alloca. The optimizer doesn't see any escapes of that alloca, so it thinks it can freely optimize accesses to it. But in fact the awaiter object does escape and can be accessed separately somehow, so the proposed fix is just to emit a call which looks like an escape of the pointer.

I don't really understand how the awaiter object escapes here — how is it possible to access it given just a std::coroutine_handle? But given that it apparently does escape, the proposed fix seems like the best way to record that.

ChuanqiXu added a comment.EditedAug 9 2023, 7:28 PM

Let me try to rephrase to see if I understand the problem here. In the coroutine function, the frontend allocates the awaiter object as a temporary like any other, i.e. with just an alloca. The optimizer doesn't see any escapes of that alloca, so it thinks it can freely optimize accesses to it. But in fact the awaiter object does escape and can be accessed separately somehow, so the proposed fix is just to emit a call which looks like an escape of the pointer.

Yes. Exactly.

I don't really understand how the awaiter object escapes here — how is it possible to access it given just a std::coroutine_handle?

The std::coroutine_handle refers to the coroutine frame, which is a semi opaque object. The language doesn't allow programmers to access the local variables from the coroutine frame directly. But the language allows the programmer to resume/destroy the corresponding coroutine. So that the programmer can access the local variables by the coroutine handle indirectly.

For this example,

C++
std::coroutine_handle<> await_suspend(const std::coroutine_handle<> h) {
    // Assume we will suspend unless proven otherwise below. We must do
    // this *before* calling Register, since we may be destroyed by another
    // thread asynchronously as soon as we have registered.
    suspended = true;

    // Attempt to hand off responsibility for resuming/destroying the coroutine.
    const auto to_resume = awaitable.Register(h);

    if (!to_resume) {
      // The awaitable is already ready. In this case we know that Register didn't
      // hand off responsibility for the coroutine. So record the fact that we didn't
      // actually suspend, and tell the compiler to resume us inline.
      suspended = false;
      return h;
    }

    // Resume whatever Register wants us to resume.
    return to_resume;
  }

The problem is that the call awaitable.Register(h); may destroy the coroutine referred by h then it is problematic to access the local variable suspended after that. The source code is good since it only access the local variable conditionally. But the optimizer ignores the fact that the local variable may be alias with the coroutine handle. So the compiler optimize the conditional access to unconditional access. So problem happens.


Then above explanation tells the reason why this only matters with awaiter. Since the language only allows the programmer to access the coroutine handle from await_suspend which is member function of an awaiter. So it is sufficient to take care of awaiter specially to me.

ChuanqiXu edited the summary of this revision. (Show Details)Aug 10 2023, 1:50 AM

Hmm. This is quite interesting. So we've got two things going on that aren't quite kosher from an LLVM perspective:

  • The allocas in the coroutine are potentially deallocated whenever there's a suspend. Normally this isn't a problem because code in the coroutine can't run when that happens, but:
  • The call to await_suspend is not really part of the coroutine execution, and this is really sneakily important in a lot of ways.

One of those ways that I, at least, hadn't realized before: the local variables of the await_suspend call *must not be allocated into the coroutine frame*, because otherwise they can be deallocated before they're allowed to be. This works out today because inlining adds lifetime markers to the allocas of the inlined function, and coroutine frame lowering detects when an alloca has lifetime markers that don't cross a suspend and leaves it as an alloca in the split function. But with await_suspend we're semantically reliant on that; as long as we can inline await_suspend into an unlowered coroutine, then if we have any gaps in that "optimization" at all, it can cause a miscompile.

The awaiter temporary has to go into the coroutine frame: its formal lifetime lasts until the end of the full-expression in the coroutine that contains the co_await, and of course we also have to call await_resume() after resuming from suspension. So necessarily await_suspend has to handle the this object being destroyed during the execution of the method: e.g. anything it does after enqueuing the coroutine to be asynchronously resumed is unordered (racing) with the destruction of this. If we inline await_suspend — which in general we'd really like to do because a lot of its data flow is just local to the coroutine — we have to be careful about how that's optimized, which ends up being the direct bug here. But we *also* have to be careful about how anything it might store a reference to might be optimized: the awaiter could captured a pointer to some other local variable from the coroutine, and that variable will be also apparently be deallocated during its alloca lifetime.

If marking an alloca as escaping (as your current patch does) is really sufficient to fix this bug, we should be good. For example, it implicitly fixes the problem with captured pointers to other allocas just by the nature of transitive escapes: if a pointer to the awaiter can escape, and the awaiter may be storing a pointer to some other variable A, then the pointer to A can also escape. I'm concerned that marking an alloca as escaping might not be sufficient, though. The optimizer does have to be a lot more conservative about optimizing accesses to an escaped alloca, but there are still *some* optimizations it can do because they can't be detected by well-behaved code. For example, if the optimizer sees a conditional store to an alloca followed by a return, it could certainly still make it unconditional (or eliminate it entirely) — the memory is guaranteed to exist, and there's no valid way to observe the store because there are no memory accesses or synchronizations before it goes out of scope. In the example, what actually blocks the optimization is that the store is followed by an intrinsic call, which the optimizer has to pessimistically assume can observe the store through the escaped pointer. So I'm worried that just marking the variable as having escaped isn't sufficient to prevent all mis-optimization here, and the analysis of why it probably won't happen seems really brittle. Maybe it's fine, though.

What does the coroutine IR actually look like here? We must be setting up the coroutine frame for the suspension before the call to await_suspend, or else it won't be properly ordered before whatever dispatch we do there; but the actual suspension point must come *after* the call to await_suspend, or the optimizer's understanding of the control flow will be completely messed up. I'm wondering if we might need to keep the await_suspend call outlined somehow until after coroutine splitting.

Hmm. This is quite interesting. So we've got two things going on that aren't quite kosher from an LLVM perspective:

  • The allocas in the coroutine are potentially deallocated whenever there's a suspend. Normally this isn't a problem because code in the coroutine can't run when that happens, but:
  • The call to await_suspend is not really part of the coroutine execution, and this is really sneakily important in a lot of ways.

Yes, exactly. And for the second point, it is stated formally that the coroutine is considered as suspended after await_ready returns false: http://eel.is/c++draft/expr.await#5.1. And we already considered this when designing coroutines intrinsics: this is the job of @llvm.coro.save: https://llvm.org/docs/Coroutines.html#llvm-coro-save-intrinsic. I feel we just had an oversight before.

One of those ways that I, at least, hadn't realized before: the local variables of the await_suspend call *must not be allocated into the coroutine frame*, because otherwise they can be deallocated before they're allowed to be. This works out today because inlining adds lifetime markers to the allocas of the inlined function, and coroutine frame lowering detects when an alloca has lifetime markers that don't cross a suspend and leaves it as an alloca in the split function. But with await_suspend we're semantically reliant on that; as long as we can inline await_suspend into an unlowered coroutine, then if we have any gaps in that "optimization" at all, it can cause a miscompile.

Yes and I don't feel it is scary. The semantics here are clear. So if a miscomputation of such cases happens, it implies that there is a bug either in CoroFrame or in the inlining pass. We'll know how to handle them.

The awaiter temporary has to go into the coroutine frame: its formal lifetime lasts until the end of the full-expression in the coroutine that contains the co_await, and of course we also have to call await_resume() after resuming from suspension. So necessarily await_suspend has to handle the this object being destroyed during the execution of the method: e.g. anything it does after enqueuing the coroutine to be asynchronously resumed is unordered (racing) with the destruction of this. If we inline await_suspend — which in general we'd really like to do because a lot of its data flow is just local to the coroutine — we have to be careful about how that's optimized, which ends up being the direct bug here. But we *also* have to be careful about how anything it might store a reference to might be optimized: the awaiter could captured a pointer to some other local variable from the coroutine, and that variable will be also apparently be deallocated during its alloca lifetime.

Yes and while I don't think we have any gaps here, I just want to note that there cases we can choose to not put the awaiter to the coroutine frame: in case we inlined await_ready, await_suspend and await_resume and there is no non-static data members used in await_resume.
Such cases are qutie common. e.g., the std::await_suspend; Of course, it may be always safe to put the awaiter to the coroutine frame.

If marking an alloca as escaping (as your current patch does) is really sufficient to fix this bug, we should be good. For example, it implicitly fixes the problem with captured pointers to other allocas just by the nature of transitive escapes: if a pointer to the awaiter can escape, and the awaiter may be storing a pointer to some other variable A, then the pointer to A can also escape. I'm concerned that marking an alloca as escaping might not be sufficient, though. The optimizer does have to be a lot more conservative about optimizing accesses to an escaped alloca, but there are still *some* optimizations it can do because they can't be detected by well-behaved code. For example, if the optimizer sees a conditional store to an alloca followed by a return, it could certainly still make it unconditional (or eliminate it entirely) — the memory is guaranteed to exist, and there's no valid way to observe the store because there are no memory accesses or synchronizations before it goes out of scope. In the example, what actually blocks the optimization is that the store is followed by an intrinsic call, which the optimizer has to pessimistically assume can observe the store through the escaped pointer. So I'm worried that just marking the variable as having escaped isn't sufficient to prevent all mis-optimization here, and the analysis of why it probably won't happen seems really brittle. Maybe it's fine, though.

I think we need to believe in other optimizations for the cases you mentioned. Otherwise, not only the design of LLVM coroutines is potentially broken, I feel the LLVM itself is not stable too. Since the cases you said are not limited to coroutines. So I feel we had to believe other optimizations and if we meet any other bugs, let's fix them.

I'm wondering if we might need to keep the await_suspend call outlined somehow until after coroutine splitting.

This works. But I feel this is more or less a policy issues. I mean the reason why we put some of the coroutine functionalities in the middle end is that we want to get benefits from the middle end optimizations. And in fact, we got it. We got much better performances than GCC within coroutines. Of course, the other side of the decision is that we meet many more middle end bugs than GCC. And in fact, there are many bugs can be fixed automatically by putting CoroSplit pass in the front of some other certain optimization passes. But this is what we didn't do. We always tried to find the solution with minimal impact.

What does the coroutine IR actually look like here? We must be setting up the coroutine frame for the suspension before the call to await_suspend, or else it won't be properly ordered before whatever dispatch we do there; but the actual suspension point must come *after* the call to await_suspend, or the optimizer's understanding of the control flow will be completely messed up.

The actual IR generated here before inlining is:

await.suspend:                                    ; preds = %init.ready
  %10 = call token @llvm.coro.save(ptr null)
  call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp11) #3
  %call14 = call ptr @_ZNSt7__n486116coroutine_handleIN6MyTask12promise_typeEE12from_addressEPv(ptr noundef %4)
  %call18 = call ptr @_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 8 dereferenceable(9) %ref.tmp7, ptr %call14)
  store ptr %call18, ptr %ref.tmp11, align 8
  %call20 = call noundef ptr @_ZNKSt7__n486116coroutine_handleIvE7addressEv(ptr noundef nonnull align 8 dereferenceable(8) %ref.tmp11) #3
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp11) #3
  %11 = call ptr @llvm.coro.subfn.addr(ptr %call20, i8 0)
  call fastcc void %11(ptr %call20) #3
  %12 = call i8 @llvm.coro.suspend(token %10, i1 false)
  switch i8 %12, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %cleanup21
  ]

By the semantics of https://llvm.org/docs/Coroutines.html#llvm-coro-save-intrinsic, the call to @llvm.coro.save implies that we've entered the suspension state. And actually, the @llvm.coro.save will be lowered into an update of the index of the coroutine frame. So as far as I understood, it looks fine basically.

And for the issue itself, the IR after inlining looks like:

await.suspend:                                    ; preds = %init.ready
  %10 = call token @llvm.coro.save(ptr null)
  call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp11) #3
  %suspended.i = getelementptr inbounds %struct.Awaiter, ptr %ref.tmp7, i64 0, i32 1
  store i8 1, ptr %suspended.i, align 8, !tbaa !5
  %11 = load ptr, ptr %ref.tmp7, align 8, !tbaa !11
  %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %11, ptr %4) #3
  %tobool.i.not.i = icmp eq ptr %call.i, null
  br i1 %tobool.i.not.i, label %if.then.i, label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

if.then.i:                                        ; preds = %await.suspend
  store i8 0, ptr %suspended.i, align 8, !tbaa !5
  br label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit: ; preds = %await.suspend, %if.then.i
  %retval.sroa.0.0.i = phi ptr [ %4, %if.then.i ], [ %call.i, %await.suspend ]
  store ptr %retval.sroa.0.0.i, ptr %ref.tmp11, align 8
  %12 = load ptr, ptr %ref.tmp11, align 8, !tbaa !12
  call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp11) #3
  %13 = call ptr @llvm.coro.subfn.addr(ptr %12, i8 0)
  call fastcc void %13(ptr %12) #3
  %14 = call i8 @llvm.coro.suspend(token %10, i1 false)
  switch i8 %14, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %cleanup21
  ]

await.ready:                                      ; preds = %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit, %init.ready
  %suspended.i43 = getelementptr inbounds %struct.Awaiter, ptr %ref.tmp7, i64 0, i32 1
  %15 = load i8, ptr %suspended.i43, align 8, !tbaa !5, !range !14, !noundef !15
  %tobool.not.i = icmp eq i8 %15, 0
  br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

Here the %11 is the this pointer of the awaiter. And the IR is correct so far. The potentially on-the-frame variable %suspended.i (awaiter->suspended) may only be accessed if the coroutine is not destroyed. The semantics comes from the programmer. Then the optimizer found that the value of the variable %suspended.i (awaiter->suspended) is consistent with nullness of the return value of _ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE and the awaiter is not escaped. So the optimizer choose to optimize the variable %suspended.i (awaiter->suspended) out.

await.suspend:                                    ; preds = %init.ready
  %9 = call token @llvm.coro.save(ptr null)
  %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %7, ptr %4) #3
  %tobool.i.not.i = icmp eq ptr %call.i, null
  br i1 %tobool.i.not.i, label %if.then.i, label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

if.then.i:                                        ; preds = %await.suspend
  br label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit: ; preds = %await.suspend, %if.then.i
  %ref.tmp7.sroa.5.0 = phi i8 [ 0, %if.then.i ], [ 1, %await.suspend ]
  %retval.sroa.0.0.i = phi ptr [ %4, %if.then.i ], [ %call.i, %await.suspend ]
  %10 = call ptr @llvm.coro.subfn.addr(ptr %retval.sroa.0.0.i, i8 0)
  call fastcc void %10(ptr %retval.sroa.0.0.i) #3
  %11 = call i8 @llvm.coro.suspend(token %9, i1 false)
  switch i8 %11, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %cleanup21
  ]

await.ready:                                      ; preds = %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit, %init.ready
  %ref.tmp7.sroa.5.1 = phi i8 [ %8, %init.ready ], [ %ref.tmp7.sroa.5.0, %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit ]
  %tobool.not.i = icmp eq i8 %ref.tmp7.sroa.5.1, 0
  br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

Then after a trivial optimization (SimplifyCFG):

coro.init:
  %5 = call token @llvm.coro.save(ptr null)
  %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %ref.tmp8, ptr %4) #3
  %tobool.i.not.i = icmp eq ptr %call.i, null
  %ref.tmp7.sroa.5.0 = select i1 %tobool.i.not.i, i8 0, i8 1
  %retval.sroa.0.0.i = select i1 %tobool.i.not.i, ptr %4, ptr %call.i
  %6 = call ptr @llvm.coro.subfn.addr(ptr %retval.sroa.0.0.i, i8 0)
  call fastcc void %6(ptr %retval.sroa.0.0.i) #3
  %7 = call i8 @llvm.coro.suspend(token %5, i1 false)
  switch i8 %7, label %coro.ret [
    i8 0, label %await.ready
    i8 1, label %cleanup21
  ]

await.ready:                                      ; preds = %coro.init
  %tobool.not.i = icmp eq i8 %ref.tmp7.sroa.5.0, 0
  br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

if.then.i44:                                      ; preds = %await.ready
  call void @_Z12DidntSuspendv() #3
  br label %_ZN7Awaiter12await_resumeEv.exit

Now the value %tobool.not.i is used across suspend points so that it belongs to the coroutine frame. Then we're accessing the coroutine frame unconditionally after the call to @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE . Then boom after the coroutine handle got destroyed in that foreign functions. Here is the complete story.


So my personal summary is that we've already handled the case of await_suspend by @llvm.coro.save. But we didn't leak the awaiter before the await_suspend and the other optimization doesn't know the relationship between the awaiter and the coroutine handle. Then here is the problem. So I feel we can solve the issue simply after we mark the awaiter as escaped.

Yes and I don't feel it is scary. The semantics here are clear. So if a miscomputation of such cases happens, it implies that there is a bug either in CoroFrame or in the inlining pass. We'll know how to handle them.

Hmm. The problem is that both CoroFrame and the inliner treat those as best-effort, because they're just trying to provide optimizations, not preserve a critical semantic property. With that said, I don't have a great alternative than your suggestion of just fixing bugs as they come up.

I think we need to believe in other optimizations for the cases you mentioned. Otherwise, not only the design of LLVM coroutines is potentially broken, I feel the LLVM itself is not stable too. Since the cases you said are not limited to coroutines. So I feel we had to believe other optimizations and if we meet any other bugs, let's fix them.

I think you're misunderstanding my point. That optimization would be correct in a non-coroutine. You're relying on stronger properties here than LLVM IR guarantees.

This works. But I feel this is more or less a policy issues

Not really, no. Ideally the performance of generated code would never be in conflict with correctness, but when it is, it becomes a clearly subordinate goal. So the obligation here is to demonstrate that we can still compile code correctly while making the more aggressive representational choices that are enabling those performance benefits. If we can't demonstrate that, we need to make less aggressive choices.

The transformation in CoroSplit is incorrect according to normal IR semantics because the execution of the call to async_suspend can become asynchronous to the execution of the coroutine, breaking normal assumptions about instruction ordering and local allocation.

And actually, the @llvm.coro.save will be lowered into an update of the index of the coroutine frame.

That isn't all that coro.save has to do — it also has to make sure that spills are written into the frame. And in fact there's a subtle thing with that I'm not sure we're getting right, which again comes back to the special behavior of await_suspend and the ways our representation choices defy ordinary IR semantics.

Consider an awaiter that schedules a coroutine to be resumed asynchronously. It is important that stores that occur before this scheduling in program order remain before it after optimization. Now, optimizations like Mem2Reg lose information about when loads and stores are performed, but that normally doesn't matter because Mem2Reg has proven that the memory isn't visible non-locally and so memory ordering is irrelevant. But control flow in a coroutine doesn't really follow the CFG when you've got a non-trivial await_suspend — the control flow of the coroutine really stops at the coro.save and then picks up again after the coro.suspend, and the execution of await_suspend is in some sense asynchronous after the coro.save. And this is directly relevant because, when Mem2Reg needs to introduce a phi, that phi will generally appear *after* the stores that went into it, which is to say, not necessarily prior to the scheduling that might happen within await_suspend.

Let's make that more concrete. Suppose that our coroutine looks like this:

void *var = nullptr;
co_await someFunction(&var);
free(var);

And suppose that the awaiter returned by someFunction contains the pointer &var, and its await_suspend does something like *var_p = malloc(16) before asynchronously scheduling the coroutine handle. In the unoptimized code, var is just part of the coro frame, and this store happens in its usual program order, and as long as that happens-before the async scheduling, it'll happen-before the resumption of the coroutine and so will be visible when we execute free(var).

But consider what happens under optimization. After inlining and a little bit of optimization, we'll do the coro.save, then call malloc, store the result to var, do the coro.suspend, then load of var. Mem2Reg should then directly forward that store to the load, creating a direct value dependency across the coro.suspend. CoroFrame knows how to handle this: it has to spill the value into the coro frame. But where it should spill? It cannot spill at the point where the original store was done, because that information is lost. It cannot spill at the coro.save, because the malloc call hasn't happened yet. It cannot wait to spill until the coro.suspend to do it, because that would not necessarily be ordered before the async scheduling. Fortunately, CoroFrame will generally spill values at the point where the value was defined, which in this case happens to be approximately where the original store was, which is great.

However, consider what happens if the store was conditional within await_suspend. Mem2Reg will still do its work, but the value that's live across coro.suspend will now be a phi, and that phi will be introduced at the join point in the CFG. That point is no longer necessarily prior to the async scheduling, and so when CoroFrame spills it, it will insert a store that isn't necessarily ordered before the resumption, and we will have a miscompile.

I am very concerned we are going to be fighting problems like this indefinitely because of the way we are abusing IR.

ChuanqiXu planned changes to this revision.Aug 10 2023, 10:41 PM

Thanks for your valuable input. It pretty makes sense. I'll try to mark await_suspend as no inline as a (temporary) solution.

That isn't all that coro.save has to do — it also has to make sure that spills are written into the frame. And in fact there's a subtle thing with that I'm not sure we're getting right, which again comes back to the special behavior of await_suspend and the ways our representation choices defy ordinary IR semantics.

This is the key point that I agree to mark await_suspend as noinline now. Since currently we didn't implement the semantics for coro.save "it also has to make sure that spills are written into the frame" actually. Currently the coro.save intrinsic will take the coroutine handle as the first argument to make the coroutine handle as escaped. But the problem is, ..., other optimizations don't know other local variables may be alias with other local variables. So the problem you described is possible. Then I think this is the real root cause of the issue. Thanks for your insight again.

ChuanqiXu abandoned this revision.Aug 16 2023, 11:58 PM