This is an archive of the discontinued LLVM Phabricator instance.

[C++] [Coroutines] Introduce `coro_always_done` attribute to help optimizations (2/2)
Needs ReviewPublic

Authored by ChuanqiXu on Sep 15 2022, 8:08 PM.

Details

Summary

Closing https://github.com/llvm/llvm-project/issues/56980

Coroutines will be split into pieces, which will complexify the control flow graph. Then the compiler must be conservative to avoid miscompilation. However, there are some cases that the programmer can understand the behavior of their coroutines but they can't pass the information to compiler now. And this patch fixes this.

Currently, the only optimization will reduce the size of the destroy function. I have other ideas but need to be proved. But I think this should be fine enough. For a internal coroutine-heavily-used example, the code sizes reduces 4% after I applied the attribute.

Diff Detail

Event Timeline

ChuanqiXu created this revision.Sep 15 2022, 8:08 PM
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: jdoerfert. · View Herald Transcript
ChuanqiXu requested review of this revision.Sep 15 2022, 8:08 PM
ChuanqiXu edited the summary of this revision. (Show Details)Sep 15 2022, 8:16 PM
ChuanqiXu added reviewers: erichkeane, ychen, rjmccall.
ChuanqiXu updated this revision to Diff 460685.Sep 16 2022, 2:34 AM

Add description with `__has_attribute(coro_always_done)`.

I only vaguely understand what this is supposed to do,please see if you can take another run at the AttrDocs.td changes to see if you can make it more clear from the perspective of a user.

Source changes are otherwise minor, but I'm still on the fence as to whether there is value here, or whether this is something the optimizer shoudl be able to figure out on its own.

clang/docs/ReleaseNotes.rst
251
clang/include/clang/Basic/AttrDocs.td
6720

I dont get what this sentence means. Could you try again? I only barely understand the rest of this first section, otherwise I'd try to suggest an edit.

In this document, it should also explain what the programmer is promising the compiler here.

I only vaguely understand what this is supposed to do,please see if you can take another run at the AttrDocs.td changes to see if you can make it more clear from the perspective of a user.

Yeah, I'll try to refactor later.

Source changes are otherwise minor, but I'm still on the fence as to whether there is value here, or whether this is something the optimizer shoudl be able to figure out on its own.

Let me answer the value question first. The attribute is helpful 2 optimizations (I got now, there may be more cases): one is implemented in the previous patch and the another one is a little bit harder to be implement. I'm fine if you want me to implement both although I feel like the first one is good enough since the codes in the frontend part are simply trivial.

The first optimization is that we can reduce the destroy function just like the example shows. I'll try to explain/refactor it more if it is not clear enough.

And the second optimization is about the coroutine elision or HALO, which would eliminate the dynamic allocation for coroutines. I am not sure if you know the dynamic allocation are necessary for coroutines. Then there is an optimization called coroutine elision in the compiler to eliminate the dynamic allocation. For example:

C++
Lazy<int> foo(); // Lazy is a classic coroutine task type.
void bar() {
     auto f = foo();
} // The coroutine frame for f will be destroyed automatically by the destructor of `Lazy`

In this example, compiler can optimize out the allocation since the lifetime for Lazy f is pretty simple. It is created after bar() starts and it is destroyed before bar() ends. So the coroutine frame of f can be a local variable of bar() safely.
However, this simple case is not common in real code (of course). The more real and common code pattern may be like:

C++
Lazy<int> foo(); // Lazy is a classic coroutine task type.
Lazy<int> bar() {
     auto f = foo();
     co_await f; // coroutine `bar()` will be suspended,
                        // and the executor will decide when the coroutine `bar()` will be resumed too.
     // other stuff
} // The coroutine frame for f will be destroyed automatically by the destructor of `Lazy`

Now from the perspective of compiler, it can't know anything about the executor. So the compiler can't infer the lifetime of f. The compiler could follow the instruction from the language spec: allocate the coroutine frame of f dynamically.

However, the programmer know the behavior of their program. They know the coroutine bar() must be resumed. But they can't send this message to the compiler now. Then here comes the attribute I introduced. Now the compiler can get this information if the coroutine is marked as the new attribute. So the attribute can be helpful to the optimizations.

However, the second optimization are a little far from the end user. It is a little like the compiler won't tell the actual optimization they would perform based on the [[likely]] attribute or assume intrinsics. So I'd like to mention these optimization vaguely in the document.

But again, I want to say that even if there is only the first optimization (reduce the destroy function), I think it is good enough to be adopted.

ChuanqiXu updated this revision to Diff 461127.Sep 18 2022, 8:49 PM
ChuanqiXu marked an inline comment as done.

Address comments.

ChuanqiXu added a subscriber: avogelsgesang.

@avogelsgesang I guess you're able to review the docs to see if it is clear enough from the perspective of a user.

took a look at the documentation; didn't read the implementation itself

clang/docs/ReleaseNotes.rst
252

can we add a direct link to the attribute documentation here?

clang/include/clang/Basic/AttrDocs.td
6721–6726
6755
6756
6757
6761

I am a bit surprised that we apply the [[clang::coro_always_done]] to the coroutine type and not the promise_type. Usually I would expect all things which influence the way that the compiler generates code for coroutines inside the promise_type

6776–6778
6790–6791

I only vaguely understand what this is supposed to do,please see if you can take another run at the AttrDocs.td changes to see if you can make it more clear from the perspective of a user.

Yeah, I'll try to refactor later.

Source changes are otherwise minor, but I'm still on the fence as to whether there is value here, or whether this is something the optimizer shoudl be able to figure out on its own.

Let me answer the value question first. The attribute is helpful 2 optimizations (I got now, there may be more cases): one is implemented in the previous patch and the another one is a little bit harder to be implement. I'm fine if you want me to implement both although I feel like the first one is good enough since the codes in the frontend part are simply trivial.

The first optimization is that we can reduce the destroy function just like the example shows. I'll try to explain/refactor it more if it is not clear enough.

And the second optimization is about the coroutine elision or HALO, which would eliminate the dynamic allocation for coroutines. I am not sure if you know the dynamic allocation are necessary for coroutines. Then there is an optimization called coroutine elision in the compiler to eliminate the dynamic allocation. For example:

C++
Lazy<int> foo(); // Lazy is a classic coroutine task type.
void bar() {
     auto f = foo();
} // The coroutine frame for f will be destroyed automatically by the destructor of `Lazy`

In this example, compiler can optimize out the allocation since the lifetime for Lazy f is pretty simple. It is created after bar() starts and it is destroyed before bar() ends. So the coroutine frame of f can be a local variable of bar() safely.
However, this simple case is not common in real code (of course). The more real and common code pattern may be like:

C++
Lazy<int> foo(); // Lazy is a classic coroutine task type.
Lazy<int> bar() {
     auto f = foo();
     co_await f; // coroutine `bar()` will be suspended,
                        // and the executor will decide when the coroutine `bar()` will be resumed too.
     // other stuff
} // The coroutine frame for f will be destroyed automatically by the destructor of `Lazy`

Now from the perspective of compiler, it can't know anything about the executor. So the compiler can't infer the lifetime of f. The compiler could follow the instruction from the language spec: allocate the coroutine frame of f dynamically.

However, the programmer know the behavior of their program. They know the coroutine bar() must be resumed. But they can't send this message to the compiler now. Then here comes the attribute I introduced. Now the compiler can get this information if the coroutine is marked as the new attribute. So the attribute can be helpful to the optimizations.

However, the second optimization are a little far from the end user. It is a little like the compiler won't tell the actual optimization they would perform based on the [[likely]] attribute or assume intrinsics. So I'd like to mention these optimization vaguely in the document.

But again, I want to say that even if there is only the first optimization (reduce the destroy function), I think it is good enough to be adopted.

Thanks for the explanation, I see justification for it. A lot of that would fit very well in the commit message (and perhaps the docs!). Please do your best to take another run at the documentation to make it much more clear on the intent and functionality.

ChuanqiXu added inline comments.Sep 19 2022, 7:29 PM
clang/include/clang/Basic/AttrDocs.td
6761

There are two reasons for this:

  • Current style is easier to implement. We keep the relationship of coroutine with its promise_type in the Sema (semantic analysis) part. And we generate the code in the CodeGen part, where the relation information are lost. So it is easier to make the decision by the return type.
  • Also, it wouldn't be hard to depend on the promise_type. But the other reason I thought is that when people talking about the different coroutine behaviors, we generally say "generator coroutine", "task coroutine" or even "cppcoro task coroutine", "folly task coroutine". I mean we're always talking about the return type. And the promise_type one looks more like a library implementer's tongue. (I won't argue if you say the attribute is mostly for the library writer).

So I think the current choice might not be bad.

avogelsgesang added inline comments.Sep 20 2022, 4:35 AM
clang/include/clang/Basic/AttrDocs.td
6761

I thought is that when people talking about the different coroutine behaviors, we generally say "generator coroutine", "task coroutine" or even "cppcoro task coroutine", "folly task coroutine"

good point. I wonder, though, if it might makes sense to apply this attribute also to other return types, like a std::future. For std::future, a library implementor is able to specialize coroutine_traits accordingly, so he can provide an own promise_type for std::future. In that case, the library writer won't be able to add the [[clang::coro_always_done]] attribute to std::future but only to the promise_type injected why the coroutine_traits specialization.

6780–6783

can you explain to me why exceptions are a problem here? The documentation for the always_complete_coroutine LLVM-IR attribute in https://reviews.llvm.org/D134009 does not mention this restriction.

Would exceptions still be problematic if we used the unreachable instead of the new LLVM-IR attribute as proposed in https://reviews.llvm.org/D134009#inline-1292827?

avogelsgesang added inline comments.Sep 20 2022, 8:28 AM
clang/include/clang/Basic/AttrDocs.td
6761

Another related thought:

I think instead of the [[clang::coro_always_done]] attribute, we maybe want to introduce a [[clang::co_await_always_resumes]] attribute. This attribute would apply to the awaitable instead of the coroutine itself.

Proposed semantics

The [[clang::co_await_always_resumes]] can be applied either to a co_await as in

task<int> foo();
task<int> foo2();

task<int> bar() {
   int x = [[clang::co_await_always_resumes]] co_await foo();
   co_return x + [[clang::co_await_always_resumes]] co_await foo2();
}

or directly to an Awaitable class

class [[clang::co_await_always_resumes]] task {...};

If applied to a co_await, the attribute implies that for this co_await suspension point the coroutine will always be resumed. Destroying the coroutine while suspended at this suspension point is undefined behavior.

If applied to a struct/class, all co_awaits waiting on an instance of this struct/class are guaranteed to be resumed.

Given, e.g.,

class [[clang::co_await_always_resumes]] task {...};
class std::lazy {...}; // from the standard library; not annotated with `co_await_always_resumes`

task<int> foo1();
std::lazy<int> foo2();
std::lazy<int> foo3();

task<int> bar() {
   int x = co_await foo1();
   int y = co_await foo2();
   int z = [[clang::co_await_always_resumes]] co_await foo3();
   co_return x + y + z;
}

the coroutine bar

  • cannot be destroyed at suspension point foo1, because the task class is annotated with co_await_always_resumes
  • can be destroyed at suspension point foo2, because std::lazy is not annotated with co_await_always_resumes
  • cannot be destroyed at suspension point foo3, because the co_await is annotated with co_await_always_resumes.

Whether bar can be destroyed while suspended at the initial/final suspension point can be customized by annotating the return type of task::promise_type::{initial_suspend,final_suspend} with co_await_always_resumes.

Underlying mental model

Everytime I co_await on something, I am stopping the coroutine and handing over control to the Awaitable. The Awaitable then decides what it wants to do with the std::coroutine_handle and when to resume it, e.g., calling resume after some asynchronous read operation finished. Given that the Awaitable decides when to call resume, it would be natural to let the Awaitable make the promise "I will always call resume on the received coroutine handle" by adding the co_await_always_resumes attribute.

Usage examples

With the co_await_always_resumes attribute we can decide for each co_await suspension point independently whether the coroutine can be destroyed while suspended at this point. I see this capability as relevant in practice, as shown by the following examples

Usage example: Asynchronous generators

Let's assume I have an asynchronous generator similar to

namespace io {
  class [[clang::co_await_always_resumes]] task { ... };
  class FileHandle {...};
  task<FileHandle> openFile(std::path p);
  task<std::string> readLine(FileHandle f);
}

async_generator<std::string> readLinesFromFiles(std::span<std::path> filePaths) {
     for (auto& filePath : filePaths) {
         auto fileHandle = co_await io::openFile(filePath);
         while (!fileHandle.eof_reached()) {
              std::string line = co_await io::readLine(fileHandle);
              co_yield std::move(line);
         }
     }
}

io::task<int> findLineNumberOfFirstEmptyLine(std::span<std::path> filePaths) {
    int i = 0;
    // The special syntax `for co_await (auto line : lineGenerator)` for async generators
    // was part of https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/n4775.pdf
    // but was removed in https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0664r8.html#35.
    // Although the syntax is no longer supported directly, I think the concept of async
    // generators still makes sense.
    async_generator<std::string> lineGenerator = readLinesFromFiles(filePaths);
    while (!lineGenerator.done()) {
         auto line = co_await lineGenerator.next();
         if (line.empty()) {
             co_return i;
         }
         ++i;
     }
     co_return i;
}

In this case, while suspended at the co_await openFile or co_await readLine, our coroutine is guaranteed to be resumed eventually. The compiler knows this because the io::task class which we are co_awaiting on is annotated as co_await_always_resumes. Hence, LLVM does not need to generate the corresponding destructor calls in the destroy function of our coroutine.

When suspended at the co_yield, however, the coroutine might be destroyed. In fact, the usage in findLineNumberOfFirstEmptyLine returns from the loop early and does not execute the generator until completion. Correspondingly we would not annotate the Awaitable returned by async_generator::promise_type::yield_value with co_await_always_resumes.

Usage example: Asynchronous tasks which can be destroyed while at the initial suspension point

(Use case identical to to https://reviews.llvm.org/D134009#3796543)

Let's assume an execution model similar to https://github.com/lewissbaker/cppcoro, i.e. coroutines are always constructed in suspended state, and there is a when_all function to combine multiple coroutine tasks and the code:

template<typename T>
struct [[clang::co_await_always_resumes]] task { 
    struct promise_type {
        auto initial_suspend { return std::suspend_always{} }
        // [...] rest of the boilerplate code stripped for readability
    }
    bool await_ready { return false; }
    std::coroutine_handle<> await_suspend(std::coroutine_handle<> hdl) { /* [...] */ }
    T await_resume() { /* [...] */ }
};

lazy<int> foo(std::unique_ptr<int>);
lazy<int> getFileSize(std::path p);
lazy<int> getFileSizeSum(std::vector<std::path> paths) {
    int sum = 0;
    for (auto p : paths) {
       sum += co_await getFileSize(p);
    }
}

lazy<int> foobar(std::vector<std::path> paths) {
     lazy<int> coro1 = foo(make_unique<int>(12));
     lazy<int> coro2 = getFileSizeSum(paths);
     auto [x,y] = co_await when_all(std::move(coro1), std::move(coro2));
     return x + y;
}

In this case, coro1 and coro2 can be destroyed after both coroutines finished execution (i.e. after the when_all returned). In addition, coro1 could be destroyed before it is ever co_awaited on, e.g. in case the coroutine allocator for getFileSizeSum throws. However, if coro1/coro2 ever get resumed from their initial suspend point, i.e. if they were ever co_awaited on, we know for sure that they won't be destroyed while suspended at one of the intermediate suspension points.

The above code communicates the fact that the co_awaits inside the function will not be destroyed, by annotating the task with co_await_always_resumes. At the same time, the initial_suspend returns std::suspend_always which is not annotated with co_await_always_resumes. Hence, the coroutine is allowed to be destroyed at the initial suspension point.

ChuanqiXu added inline comments.Sep 20 2022, 8:27 PM
clang/include/clang/Basic/AttrDocs.td
6761

I wonder, though, if it might makes sense to apply this attribute also to other return types, like a std::future. For std::future, a library implementor is able to specialize coroutine_traits accordingly, so he can provide an own promise_type for std::future. In that case, the library writer won't be able to add the [[clang::coro_always_done]] attribute to std::future but only to the promise_type injected why the coroutine_traits specialization.

Yeah, the example looks compelling. I'll try to make the attribute work for promise_type.

I think instead of the [[clang::coro_always_done]] attribute, we maybe want to introduce a [[clang::co_await_always_resumes]] attribute.

Thanks for the writte-up! But I think there are two problems: one about implementation and one about the language semantics. Let's talk about the semantics first. When I wrote:

C++
struct [[clang::co_await_always_resumes]]  awaitable {...};

We're expecting every time the awaitable get co_awaited, the compiler knew it will be resumed. However, the semantics can't be true in all cases. Since the behavior of co_await awaitable_instances; are also controlled by promise_type:await_transform (if present) of the enclosing coroutine. And the awaitables can't be sure about the enclosing coroutines. So the semantics looks a little bit weird since we need to double-check the promise_type:await_transform to make sure the behavior when we see co_await awaitable_type{};.

Another point is about the implementation. I won't say it is impossible to implement. But it'll be much harder. Let me introduce some details.

First of all, we can't infer that the suspend points are guaranteed to be resumed in the following LLVM IR:

%1 = call @llvm.coro.suspend(...)
switch i8 %1, label %suspend [i8 0, label %continue2
                                                    i8 1, label %unreachable]

The compiler can only get the information that %1 won't be 1 in this case. From the perspective of LLVM IR, we can't know the fact the coroutine must be resumed from the suspend point. The idea here is that LLVM IR and C++ are two different languages. And LLVM IR has its own semantics. It doesn't make sense to require the LLVM IR to change its semantics without changing it actually. (BTW, to be honest, in fact, we did this for implementing symmetric transfer. It is not a good idea. But we had no choice at that time since symmetric transfer is required by the standard and we chose the 2 phase model. But let's be in the right direction later).

So in the end, we must add something to the above expression call @llvm.coro.suspend(...) to inform the middle end about the semantic changes. However, the @llvm.coro.suspend intrinsics will be removed after the coroutine splitted. The resume function will look like the examples I wrote in D134009:

%coro_index = load i64, ptr %frame.index_addr
switch i64 %coro_index, label %unreachable [i64 0, label %bb0
                                            i64 1, label %bb1
                                            i64 2, label %bb2
                                            ...
                                            i64 n, label bb_n]

bb0:
  ...
  %coro_next_index = add i64 %coro_index, 1
  store i64 %coro_next_index, ptr %frame.index_addr
  ret

bb1:
  ...
  %coro_next_index = add i64 %coro_index, 1
  store i64 %coro_next_index, ptr %frame.index_addr
  ret

Again, the compiler can't get the information that here is a control flow from bb_i to bb_{i+1} unless we add something else, which would make it much more complex.

So it will much harder to implement and I can see less benefit. I think the it is sufficient to add [[clang::coro_always_done]] and you proposed another attribute. It is much easier to implement and useful in practice. As one side, as the enclosing coroutine, we can be sure about the behavior as much as possible (promise_type::await_transform). As the other side, in case we want to co_await an awaiter which may not resumed all the time, we'll have many methods.

For example, if we're going to await a network IO which we can't be sure if we'll get it correctly, we can write:

Task foo(...) { // Task is ``coro_always_done``
   co_await when_any(IO_Awaiter{...}, TimeoutAwaiter5s));
   // ... handling the results.
}

In this case, the IO_Awaiter and TimeoutAwaiter5s may not be coro_always_done, depends on the implementation detail of when_any. In this case, we can be sure that foo() will be done finally. And there are many other methods to design the interfaces in such cases. So I think the current style is sufficient.

6780–6783

Would exceptions still be problematic if we used the unreachable instead of the new LLVM-IR attribute as proposed in https://reviews.llvm.org/D134009#inline-1292827?

Short answer: yes. There will be problems too.

can you explain to me why exceptions are a problem here? The documentation for the always_complete_coroutine LLVM-IR attribute in https://reviews.llvm.org/D134009 does not mention this restriction.

Yes, the reason why D134009 doesn't mention the restriction is that LLVM IR and C++ are two different languages so they have different meanings.

The reason may be a little bit complex. Let's see if I can make it clearly.

First things first, in fact, the exceptions are not problems at the most of time. For example, the following example is always good:

template<typename T>
struct [[clang::coro_always_done]] task { 
    struct promise_type {
        void unhandled_exception() noexcept {...}
        ...
    }
};

As long as the unhandled_exception() doesn't throw exceptions, the paragraph we're mentioning won't be a problem. In other words, the current paragraph says if unhandled_exception() throws and the coroutine function is coro_always_done. the behavior is UB although the standards requires coroutine_handle<>::done() should return true here. The reasons behind is a little bit complex. Let's explain it in the perspective of C++ instead of LLVM IR to make it simpler.

See the following code:

C++
{
   coroutine-function-body
} catch (...) {
   unhandled_exception();
}
final_suspend:
    ...

As long as the coroutine function will reach the final_suspend: point actually all the time, it is OK to mark the coroutine function as coro_always_done. And exceptions are not problems as long as they are handled in `unhandled_exception (), in which case the coroutines are able to reach the final_suspend` normally. However, if unhandled_exception () throws, the control flow can't reach the final_suspend any more. And the most confusing thing is that the standard spec says the coroutine is considered at done in this case. So I think it'll be confusing that if people found their coroutine_handle<>::done() returns true all the time but their codes can't work with coro_always_done.

@avogelsgesang would you like to take another look?

ChuanqiXu abandoned this revision.Oct 16 2022, 8:00 PM
This comment was removed by ChuanqiXu.
ChuanqiXu reclaimed this revision.Oct 16 2022, 8:01 PM

Mis-Operations

Sorry for the late reply - and thanks for your detailed answers!

To summarize my current position:

  • I am fine with introducing a [[clang::coro_always_suspend]] attribute. I still think a [[clang::co_await_always_resumes]] would be more flexible. I agree with you that [[clang::coro_always_suspend]] also provides value, and therefore I do not oppose its introduction
  • I am opposed to introducing a new LLVM IR attribute for this
    • I think we can reach the same codegen benefits by using unreachable inside the destroy block.
    • I understood your point that this unreachable will not be sufficient to "get the information that here is a control flow from bb_i to bb_{i+1}". But I did not yet understand why this information would be useful for the compiler and which additional optimizations it would enable
    • To convince me otherwise, I would like to see examples on which additional optimizations "get the information that here is a control flow from bb_i to bb_{i+1}" unlocks
  • I would prefer if we don't declare "throwing unhandled_exception()" undefined behavior. Users can already declare unhandled_exception() as noexcept. I see two potential solutions here:
    • We allow "unhandled_exception()" to throw even if [[clang::coro_always_suspend]] is specified. To get the full optimization potential, users can just declare their unhandled_exception as noexpect. Currently this is my preferred solution.
    • We make it a compiler error if users use [[clang::coro_always_suspend]] without marking unhandled_exception as noexcept. That way, we avoid undefined behavior, while getting all the same optimization benefits for our destroy function

More detailed answers/arguments inside the two sub-threads

clang/include/clang/Basic/AttrDocs.td
6761

Thanks for the writte-up! But I think there are two problems:

Language Semantics

the behavior of co_await awaitable_instances; are also controlled by promise_type:await_transform (if present) of the enclosing coroutine

Right, there is also await_transform and operator co_await. Allowing [[clang::co_await_always_resumes]] direclty on the co_await call is not a good idea. Also, we should only look for it on the Awaiter, not on the Awaitable (using the terminology from https://lewissbaker.github.io/2017/11/17/understanding-operator-co-await). That way, await_transform and operator co_await would be taken into account as part of the usual protocol to get the Awaiter from an Awaitable.

Similarly, for co_yield we would also first resolve the Awaiter and then look for the co_await_always_resumes attribute on that Awaiter.

Implementation

The compiler can only get the information that %1 won't be 1 in this case. *From the perspective* of LLVM IR, we can't know the fact the coroutine must be resumed from the suspend point

I think I had a misunderstanding of LLVM semantics here: My understanding so far was that the switch after a @llvm.coro.suspend has 3 outgoing edges:

  • resume: this control flow edge is still allowed by the co_await_always_resumes edge
  • destroy: this one is forbidden by the unreachable and the optimizer knows it won't be taken
  • suspend: my understanding of this edge so far was that this is only an artificial control flow edge needed for the LLVM passes, but that this edge would not actually be allowed to be taken.

It seems my understanding of the "suspend" edge was incorrect. So far, I thought of this as undefined behavior since it would leak memory. Are you saying "In LLVM IR, it is valid to suspend a coroutine forever, i.e. we never call resume/destroy on it"?

With that out of the way, I now also understood your point

Again, the compiler can't get the information that here is a control flow from bb_i to bb_{i+1}

Thanks for patiently explainging it! :)

A couple of follow-up questions, though:

  • what would we lose by the compiler not knowing that there is control flow from bb_i to bb_{i+1}? Do you have any specific optimization in mind which would be blocked by not having this information?
  • My understanding of the coroutine design is, that all optimizations across resumption points should be done before coroutine splitting. Why would we want to do some of the cross-suspend-point optimizations after splitting?
  • how does D134009 solve the problem? Afaict, with that approach we currently also lose the information that bb_i resumes at bb_{i+1} during coroutine splitting?

General discussion

I think the it is sufficient to add [[clang::coro_always_done]] and you proposed another attribute. It is [...] useful in practice.

Agree, [[clang::coro_always_done]] is useful in practice. I am not opposed to introducing it, although I think [[clangco_await_always_resumes]] would be more widely applicable and wouldn't be harder to implement (at least not, if we just lower both as unreachable).

Implementationwise, I am still not convinced that we indeed need a new LLVM IR attribute for this. The main difference between unreachable and the new attribute seems to be "the compiler knows that the coroutine will not be suspended forever". But I don't understand yet, what the LLVM optimizer would do with this additional knowledge (see follow-up questions above).

6780–6783

Thanks for this detailed explanation. Based on some godbolt experiments (https://godbolt.org/z/e9c8Yco5v), I understood that this the destroy function needs to know whether the final suspension point was reached due to an exception or through normal control flow, so that it knows whether it needs to destruct the FinalAwaitable. Is that understanding correct? Or are there additional points/nuances which I missed?

I would prefer, if we don't declare "unhandled_exception throws" as undefined behavior, even if [[clang::coro_always_done]] is applied.

This is based on my current understanding that:

  1. [[clang::coro_always_done]] simplifies the destroy function, even if we still have a throwing unhandled_exception.
  2. by allowing [[clang::coro_always_done]] also on throwing unhandled_exception, it can be applied in more cases.
  3. by declaring unhandled_exception as noexcept, the user could still achieve the full optimization potential

@avogelsgesang hi, thanks for the detailed reply. After reading your reply, my mind changes a little bit. So before replying to your comments, I want to hear your opinion about the other idea. How about the idea of coro_always_elide attribute? For example, we can mark a coroutine function as coro_always_elide attribute, or we can mark it at the callsite. Just like:

[[clang::coro_always_elide]]] coro_func(); // no dynamic allocation

The idea is that this attribute is much easier to understand and useful. Since I feel like even if we can get some consensus in the discussion, the other users may be confusing about it. And also the coroutine elision would be more helpful than reducing the destroy functions.

Interesting... . I will need to read up on coro_always_elide a bit before being able to form an informed opinion.
I found https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2594r0.pdf
Any other relevant resources?

Also, how does "guaranteed HALO" contribute to reducing the amount of code inside the destroy function? How does it address/help https://github.com/llvm/llvm-project/issues/56980, i.e. the starting point of this review?

My (very rough, current) understanding is:

  • HALO relies on inlining the ramp function
  • Even after HALO, the destroy function is still called through a function call (which might be inlined separately by the normal function inlining functionality)
  • HALO by itself will not simplify the destroy function

As such, I currently see independent value in always_coro_elide (avoids a heap allocation) and coro_always_done/co_await_always_resumes (removes unreachable code from the destroy function).
Both of which seem like useful goals. I don't see always_coro_elide as a better alternative to coro_always_done/co_await_always_resumes, I think both address independent goals

Interesting... . I will need to read up on coro_always_elide a bit before being able to form an informed opinion.
I found https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2594r0.pdf
Any other relevant resources?

Here is the original paper about HALO: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html.
I think my proposal may be helpful too: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2477r3.html. This is the paper which tries to allow the user to control the coroutine elision in the standard level. And I feel it may be too slow or not feasible to make it in the standard. So I am just wondering if it is better to make it an attribute.

Also, how does "guaranteed HALO" contribute to reducing the amount of code inside the destroy function? How does it address/help https://github.com/llvm/llvm-project/issues/56980, i.e. the starting point of this review?

No, it is not related. They are two separate problems.

My (very rough, current) understanding is:

  • HALO relies on inlining the ramp function
  • Even after HALO, the destroy function is still called through a function call (which might be inlined separately by the normal function inlining functionality)
  • HALO by itself will not simplify the destroy function

As such, I currently see independent value in always_coro_elide (avoids a heap allocation) and coro_always_done/co_await_always_resumes (removes unreachable code from the destroy function).
Both of which seem like useful goals. I don't see always_coro_elide as a better alternative to coro_always_done/co_await_always_resumes, I think both address independent goals

In fact, coro_always_done has two goals:

  • Reduce the size of destroy function. (Done in the series of patches)
  • Give helpful information to coroutine elision optimization (Not implemented or in imagination)

Both of goals are planned to be archived by the additional control flow graph information provided by coro_always_done attribute.

Then here let me introduce some more information about the coroutine elision. There are 2 conditions to perform the coroutine elision:

  • Inline the ramp function.
  • For the inlined ramp function, every path from the ramp function (the creation point) to the exit point of the enclosing function will (be known to) meet a destroy point. (coroutine_handle::destroy());

So for example, the following generator case will be elided all the way:

class generator {
    //...

    ~generator() {
        coroutine_handle::get_from_promise(&promise).destroy();
    }

private:
    promise_type promise;
};

void foo() {
    generator g;
    // any other information
}

After inlining and some optimization, the body of foo will look like:

void foo() {
     void *g_frame = new (...); // the creation for the frame of generator g
     ...
     destroy(g_frame);
}

So every path from the creation of the coroutine frame to the exit of the function will meet the destroy function. So it meets the condition for coroutine elision. So this is the reason why we would think generator wouldn't get any dynamic allocation.

Before move on, let me emphasis another point here: (to be known). What does this mean? See the following generator example:

C++
class generator {
    //...

    ~generator() {
        coroutine_handle::get_from_promise(&promise).destroy();
    }

private:
    promise_type promise;
    // any other field member
};

void any_other_function(void*);

void foo() {
    generator g;
    any_other_function(&g)
    // any other information
}

then the generated code after inlining and optimization would look like:

void foo() {
     void *g_frame = new (...); // the creation for the frame of generator g
     generator g(g_frame);
     any_other_function(&g)
     // ...
     destroy(g.promise - 16); // Get the address for the promise and delete it.
}

Then we can't perform coroutine elision now. What's the problem? It is the compiler can't be sure the g.promise-16 is referring to the same coroutine frame with the one created. Since it is completely possible that the g.promise get changed during the execution of any_other_function(xxx)! So the compiler can't assume anything about this for security.

OK, let's move to another point. Let's consider the case the enclosing function is a coroutine:

// task is the general task type.
// And let's assume that the destructor of task (or the await_resume) would destroy the corresponding task.
task g();
task f() {
   co_await g();
}

then the generated code would look like:

task f() {
    void *g_frame = new(...);
    task g(g_frame);
    auto awaiter = operator co_await(g);
    if (! awaiter.await_ready()) {
       awaiter.await_suspend();
       return; // There is an additional exit point !
    }
    v = awaiter.await_resume();
    destroy(g_frame);
}

Look about the above example. We can find that the second condition for coroutine elision is broken. Since there is path from the creation of the coroutine to the exit of the function wouldn't meet the destroy function. Since we can't be sure that the suspended coroutine f would always resume again. (It may be destroyed directly. Or it may never get accessed again and memory leaks.)

And now, it should be clear that why the coro_always_done and coro_always_resume is helpful for coroutine elision. Since it is helpful for such cases.

So let me answer the question: why I want to introduce coro_always_elide to replace coro_always_done /coro_always_resume? Because for coro_always_done /coro_always_resume, my original purpose is mainly for the coroutine elision instead of the reducing the destroy function. Reducing the destroy function is helpful for the size definitely (from my investigate, it can reduce 4% size for a coroutine intensively used project). But we care about the runtime performance much more than the size (while the size change is not compelling). And from the above analysis, you can know that coro_always_done /coro_always_resume care about the process and the reason. They are harder to understand and there are cases they can't handle (see the case where any_other_function shows). And coro_always_elide cares about the result and it is much easier to understand. So this is the reason why I want to introduce it now.

And now, it should be clear that why the coro_always_done and coro_always_resume is helpful for coroutine elision. Since it is helpful for such cases.

Yes, I understood that point now :)
Now I see why the information "that here is a control flow from bb_i to bb_{i+1}" is very valuable, and why we also need to give the middle-end the information that the case "coroutine gets suspended, but never destroyed/resumed" cannot occur.

So let me answer the question: why I want to introduce coro_always_elide to replace coro_always_done /coro_always_resume? Because for coro_always_done / coro_always_resume, my original purpose is mainly for the coroutine elision instead of the reducing the destroy function. Reducing the destroy function is helpful for the size definitely (from my investigate, it can reduce 4% size for a coroutine intensively used project). But we care about the runtime performance much more than the size (while the size change is not compelling).

100% with you. HALO is much more important than code size, also in my use case. The size reduction is still quite nice, and I would still take it, if we can find a solution to solve both with the same mechanism

And from the above analysis, you can know that coro_always_done /coro_always_resume care about the process and the reason. They are harder to understand and there are cases they can't handle (see the case where any_other_function shows). And coro_always_elide cares about the result and it is much easier to understand. So this is the reason why I want to introduce it now.

Personally, I think there is still value to have both coro_always_resume and coro_always_elide.
coro_always_resume targets library authors, while coro_always_elide targets library users.
coro_always_resume could be added to the awaitable classes inside a coroutines library by the library author: The library author knows, that his coroutine task type will always resume, and can provide this guarantee to the compiler. However, the author of the coroutine library cannot know what his users are doing with the task instances and whether they swap them out "under the hood".
Through coro_always_elide the user of the library can let the compiler know: Yes, I am leaking a reference to my task. But don't worry, I will not swap the coroutine_handle, you can still do HALO.
I would see coro_always_resume as the default solution which would capture 90% of the use cases. Given the library authors add it to their libraries, the users rarely have to worry about it.
coro_always_elide I would see as an escape hatch for the hard cases where coro_always_resume is insufficient.

But I agree: coro_always_elide is also necessary, and I would love if we had it in clang :)

Which brings me to another question though: How would you implement the coro_always_elide attribute?
Both on the LLVM IR level and on the C++ level. E.g., it's not completely clear to me where exactly the coro_always_elide would be placed... on the co_await?
I think we should move the discussion on coro_always_elide to discourse, so that it gets more visibility...

Yeah, it is good to know we get some consensus now.

Which brings me to another question though: How would you implement the coro_always_elide attribute?

Both on the LLVM IR level and on the C++ level. E.g., it's not completely clear to me where exactly the coro_always_elide would be placed... on the co_await?

I had some implementation experience before so I am sure it is implementable. And for the C++ syntax, it should be completely the same with always_inline. We know that always_inline can apply to function declarations. And recently it can be applied to statements. Here is an example: https://clang.llvm.org/docs/AttributeReference.html#always-inline-force-inline. So we can make it for the following case:

C++
Task<int> foo();
Task<int> bar() {
    int i = 0;
    [[clang::coro_always_elide]] i = co_await foo(); // OK.
    int j = co_await foo(); // not work since statement attribute can't be applied to declaration statement. But this defect **may** be able to be fixed in future.
}

I think we should move the discussion on coro_always_elide to discourse, so that it gets more visibility...

Agreed. I'll send one after I send a patch.

Personally, I think there is still value to have both coro_always_resume and coro_always_elide.

Agreed. I only adjust the priority here.