Page MenuHomePhabricator

[Coroutines] Optimize the lifespan of temporary co_await object
ClosedPublic

Authored by lxfind on Jun 22 2020, 9:45 AM.

Details

Summary

If we ever assign co_await to a temporary variable, such as foo(co_await expr),
we generate AST that looks like this: MaterializedTemporaryExpr(CoawaitExpr(...)).
MaterializedTemporaryExpr would emit an intrinsics that marks the lifetime start of the
temporary storage. However such temporary storage will not be used until co_await is ready
to write the result. Marking the lifetime start way too early causes extra storage to be
put in the coroutine frame instead of the stack.
As you can see from https://godbolt.org/z/zVx_eB, the frame generated for get_big_object2 is 12K, which contains a big_object object unnecessarily.
After this patch, the frame size for get_big_object2 is now only 8K. There are still room for improvements, in particular, GCC has a 4K frame for this function. But that's a separate problem and not addressed in this patch.

The basic idea of this patch is during CoroSplit, look for every local variable in the coroutine created through AllocaInst, identify all the lifetime start/end markers and the use of the variables, and sink the lifetime.start maker to the places as close to the first-ever use as possible.

Diff Detail

Event Timeline

lxfind created this revision.Jun 22 2020, 9:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2020, 9:45 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Rather than doing it here, can we build await_resume call expression with MaterializedTemporaryExpr when expand the coawait expression. That's how gcc does.

Rather than doing it here, can we build await_resume call expression with MaterializedTemporaryExpr when expand the coawait expression. That's how gcc does.

There doesn't appear to be a way to do that in Clang. It goes from the AST to IR directly, and there needs to be a MaterializedTemporaryExpr to wrap the result of co_await. Could you elaborate on how this might be done in Clang?

rsmith added a subscriber: rsmith.EditedJun 23 2020, 11:35 AM

Rather than doing it here, can we build await_resume call expression with MaterializedTemporaryExpr when expand the coawait expression. That's how gcc does.

There doesn't appear to be a way to do that in Clang. It goes from the AST to IR directly, and there needs to be a MaterializedTemporaryExpr to wrap the result of co_await. Could you elaborate on how this might be done in Clang?

For a call such as:

coro f() { awaitable a; (co_await a).g(); }

we produce an AST like:

|   |   `-CXXMemberCallExpr <col:25, col:40> 'void'
|   |     `-MemberExpr <col:25, col:38> '<bound member function type>' .g 0x55d38948ca98
|   |       `-MaterializeTemporaryExpr <col:25, col:36> 'huge' xvalue
|   |         `-ParenExpr <col:25, col:36> 'huge'
|   |           `-CoawaitExpr <col:26, col:35> 'huge'
|   |             |-DeclRefExpr <col:35> 'awaitable' lvalue Var 0x55d38948d4c8 'a' 'awaitable'
|   |             |-CXXMemberCallExpr <col:35> 'bool'
|   |             | `-MemberExpr <col:35> '<bound member function type>' .await_ready 0x55d38948cee8
|   |             |   `-OpaqueValueExpr <col:35> 'awaitable' lvalue
|   |             |     `-DeclRefExpr <col:35> 'awaitable' lvalue Var 0x55d38948d4c8 'a' 'awaitable'
|   |             |-CXXMemberCallExpr <col:35> 'void'
|   |             | |-MemberExpr <col:35> '<bound member function type>' .await_suspend 0x55d38948d298
|   |             | | `-OpaqueValueExpr <col:35> 'awaitable' lvalue
|   |             | |   `-DeclRefExpr <col:35> 'awaitable' lvalue Var 0x55d38948d4c8 'a' 'awaitable'
|   |             | `-CXXConstructExpr <col:35> 'std::coroutine_handle<>':'std::experimental::coroutines_v1::coroutine_handle<void>' 'void (std::experimental::coroutines_v1::coroutine_handle<void> &&) noexcept'
|   |             |   `-ImplicitCastExpr <col:35> 'std::experimental::coroutines_v1::coroutine_handle<void>' xvalue <DerivedToBase (coroutine_handle)>
|   |             |     `-MaterializeTemporaryExpr <col:35> 'std::experimental::coroutines_v1::coroutine_handle<coro::promise_type>' xvalue
|   |             |       `-CallExpr <col:35> 'std::experimental::coroutines_v1::coroutine_handle<coro::promise_type>'
|   |             |         |-ImplicitCastExpr <col:35> 'std::experimental::coroutines_v1::coroutine_handle<coro::promise_type> (*)(void *) noexcept' <FunctionToPointerDecay>
|   |             |         | `-DeclRefExpr <col:35> 'std::experimental::coroutines_v1::coroutine_handle<coro::promise_type> (void *) noexcept' lvalue CXXMethod 0x55d38948adb0 'from_address' 'std::experimental::coroutines_v1::coroutine_handle<coro::promise_type> (void *) noexcept'
|   |             |         `-CallExpr <col:35> 'void *'
|   |             |           `-ImplicitCastExpr <col:35> 'void *(*)() noexcept' <FunctionToPointerDecay>
|   |             |             `-DeclRefExpr <col:35> 'void *() noexcept' lvalue Function 0x55d38948ef38 '__builtin_coro_frame' 'void *() noexcept'
|   |             `-CXXBindTemporaryExpr <col:35> 'huge' (CXXTemporary 0x55d38948ffb8)
|   |               `-CXXMemberCallExpr <col:35> 'huge'
|   |                 `-MemberExpr <col:35> '<bound member function type>' .await_resume 0x55d38948cff8
|   |                   `-OpaqueValueExpr <col:35> 'awaitable' lvalue
|   |                     `-DeclRefExpr <col:35> 'awaitable' lvalue Var 0x55d38948d4c8 'a' 'awaitable'

See https://godbolt.org/z/hVn2u9.

I think the suggestion is to move the MaterializeTemporaryExpr from being wrapped around the CoawaitExpr to being wrapped around the .await_resume call within it.

Unfortunately, that's not a correct change. The language rules require us to delay materializing the temporary until we see how it is used. For example, in a case such as

g(co_await a);

... no temporary is materialized at all, and the await_resume call instead directly initializes the parameter slot for the function.


It seems to me that the same issue (of making large objects unnecessarily live across suspend points) can arise for other similar cases too. For example, consider:

huge(co_await a).g(); // suppose `co_await a` returns something small

Here again we will start the lifetime of the huge temporary before the suspend point, and only initialize it after the resume. Your change won't help here, because it's not the lifetime markers of the value produced by co_await that are the problem.

As a result of the above, I think this is the wrong level at which to perform this optimization. Instead, I think you should consider whether we can move lifetime start markers later (and end markers earlier, for unescaped locals) as part of the coroutine splitting pass.

lxfind added a comment.EditedJun 23 2020, 11:56 AM

@rsmith Thanks. That's a good point. Do you know if there already exists optimization passes in LLVM that attempts to shrink the range of lifetime intrinsics? If so, I am curious why that does not help in this case. Or is it generally unsafe to move the lifetime intrinsics, and we could only do it here with specific context knowledge about coroutines.

@rsmith Thanks. That's a good point. Do you know if there already exists optimization passes in LLVM that attempts to shrink the range of lifetime intrinsics? If so, I am curious why that does not help in this case. Or is it generally unsafe to move the lifetime intrinsics, and we could only do it here with specific context knowledge about coroutines.

I don't know for sure, but I would expect someone to have implemented such a pass already. Moving a lifetime start intrinsic later, past instructions that can't possibly reference the object in question, seems like it should always be safe and (presumably) should always be a good thing to do, and similarly for moving lifetime end markers earlier. It could be that such a pass exists but it is run too late in the pass pipeline, so the coroutine split pass doesn't get to take advantage of it.

@rsmith Thanks. That's a good point. Do you know if there already exists optimization passes in LLVM that attempts to shrink the range of lifetime intrinsics? If so, I am curious why that does not help in this case. Or is it generally unsafe to move the lifetime intrinsics, and we could only do it here with specific context knowledge about coroutines.

I don't know for sure, but I would expect someone to have implemented such a pass already. Moving a lifetime start intrinsic later, past instructions that can't possibly reference the object in question, seems like it should always be safe and (presumably) should always be a good thing to do, and similarly for moving lifetime end markers earlier. It could be that such a pass exists but it is run too late in the pass pipeline, so the coroutine split pass doesn't get to take advantage of it.

@lxfind, Also lifetime marker of variable are much complex because of the existing of exceptional path(multiple lifetime start & multiple lifetime end) , so it is hard to optimize such cases.

Rather than doing it here, can we build await_resume call expression with MaterializedTemporaryExpr when expand the coawait expression. That's how gcc does.

There doesn't appear to be a way to do that in Clang. It goes from the AST to IR directly, and there needs to be a MaterializedTemporaryExpr to wrap the result of co_await. Could you elaborate on how this might be done in Clang?

For now, we only wrap coawait expression with MaterializedTemporaryExpr when the kind of result is VK_RValue, We can wrap await_resume call instead in such case when build coawait expression. so in emitSuspendExpression, we can directly emit await_call expression with MaterializedTemporaryExpr.

I think this should work, although i'm not so sure.

lxfind updated this revision to Diff 272904.EditedJun 23 2020, 8:15 PM

Tackle this problem inside CoroSplit as an optimization. Instead of only handling one particular case, we now look at every local variable in the coroutine, and sink their lifetime start markers when possible. This will bring in more benefits than doing so during IR emit. Confirmed that it works.

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2020, 8:15 PM
lxfind updated this revision to Diff 272916.Jun 23 2020, 11:03 PM

Address test failures

@lxfind, Thank you! And could you please add some testcases?

llvm/lib/Transforms/Coroutines/CoroSplit.cpp
1286

It is possible to handle multiple cast instructions as long as they are only used by lifetime marker intrinsic.

lxfind marked an inline comment as done.Jun 24 2020, 9:02 AM
lxfind added inline comments.
llvm/lib/Transforms/Coroutines/CoroSplit.cpp
1286

It's certainly possible. I didn't do it here because a reasonable compiler frontend should never emit multiple cast instructions for the same variable in order to mark lifetime. If there are, they must be used for something else.

lxfind updated this revision to Diff 273169.Jun 24 2020, 2:42 PM

Actually it seems pretty easy to handle the case of multiple BitCastInst, so did it here. Also added a test.

lxfind marked an inline comment as done.Jun 24 2020, 2:43 PM
lxfind updated this revision to Diff 273171.Jun 24 2020, 2:46 PM

A few unintended changes

lxfind edited the summary of this revision. (Show Details)Jun 24 2020, 2:54 PM
lxfind updated this revision to Diff 273173.Jun 24 2020, 2:57 PM

Remove unused variable

Harbormaster completed remote builds in B61633: Diff 273173.

There are still room for improvements, in particular, GCC has a 4K frame for this function.

I think GCC having a smaller coroutine frame is probably because it does not yet implement the allocation-elision optimisation which inlines the nested coroutine frame (which is 4K) into the parent coroutine frame.
I have not looked yet, but I suspect that you'll see within the get_big_object2() code it will perform another heap allocation for the get_big_object() frame.

Until we have more general support for async RVO, I think 8K is probably as good as we're going to be able to get (4K for this coroutine's promise and 4K for child coroutine-frame and its promise).

There are still room for improvements, in particular, GCC has a 4K frame for this function.

I think GCC having a smaller coroutine frame is probably because it does not yet implement the allocation-elision optimisation which inlines the nested coroutine frame (which is 4K) into the parent coroutine frame.
I have not looked yet, but I suspect that you'll see within the get_big_object2() code it will perform another heap allocation for the get_big_object() frame.

Until we have more general support for async RVO, I think 8K is probably as good as we're going to be able to get (4K for this coroutine's promise and 4K for child coroutine-frame and its promise).

Yes your observation is correct. It's due to the nesting of the child frame.

lxfind retitled this revision from [RFC][Coroutines] Optimize the lifespan of temporary co_await object to [Coroutines] Optimize the lifespan of temporary co_await object.Jun 26 2020, 4:26 PM
junparser accepted this revision.Jun 28 2020, 5:33 AM

LGTM, Thank you!

This revision is now accepted and ready to land.Jun 28 2020, 5:33 AM
lxfind updated this revision to Diff 273946.Jun 28 2020, 8:55 AM

Rebase before landing

This revision was automatically updated to reflect the committed changes.

@lxfind This patch causes some mismatch when variable is used in both resume and destroy function. Besides, we should move this patch and the check in buildCoroutineFrame.

@lxfind This patch causes some mismatch when variable is used in both resume and destroy function. Besides, we should move this patch and the check in buildCoroutineFrame.

@lxfind, Would you try to fix this? If you do not have time, then I'll try do this. Thanks

lxfind added a comment.Jul 1 2020, 8:52 AM

@lxfind This patch causes some mismatch when variable is used in both resume and destroy function. Besides, we should move this patch and the check in buildCoroutineFrame.

@lxfind, Would you try to fix this? If you do not have time, then I'll try do this. Thanks

Could you please help take a look, if you have a local repro? Thanks!

@lxfind This patch causes some mismatch when variable is used in both resume and destroy function. Besides, we should move this patch and the check in buildCoroutineFrame.

@lxfind, Would you try to fix this? If you do not have time, then I'll try do this. Thanks

Could you please help take a look, if you have a local repro? Thanks!

of course.