This is an archive of the discontinued LLVM Phabricator instance.

[Coroutine] Prevent value reusing across coroutine suspensions in EarlyCSE and GVN
Changes PlannedPublic

Authored by lxfind on Oct 19 2020, 10:02 AM.

Details

Summary

There are two problems with reusing values across coroutine suspensions:

  1. It has been assumed that all the code within the same function would run in the same thread. For example, the glibc library pthread_self is defined with attribute((const)) (which would have readnone attribute in LLVM IR) even though it in fact reads global memory. This was OK because within the same function the thread ID will never change in a non-coroutine function. Optimizazers take advantage of that and would reuse the results if there are multiple calls to pthread_self within the same function. However with coroutines this is no longer true. We cannot reuse the results of pthread_self if they cross suspension points because they could be running on different threads.
  2. Any data that needs to lives across coroutine suspensions will have to be spilled onto the coroutine frame (heap). Value reusing across suspensions would generate lots of data that needs to live on the frame. This can be expensive and makes the frame large unnecessarily in common cases.

Diff Detail

Event Timeline

lxfind created this revision.Oct 19 2020, 10:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2020, 10:02 AM
lxfind requested review of this revision.Oct 19 2020, 10:02 AM
rjmccall added inline comments.Oct 19 2020, 10:49 AM
llvm/lib/Transforms/Scalar/GVN.cpp
2141

Different coroutine lowerings use different suspend instructions. Can you write this (and similar conditions elsewhere in the patch) so that it'll apply in any lowering? Maybe add a canSuspendCoroutine method to IntrinsicInst.

For the first point, if the IR definition isn't consistent, I'd prefer to actually fix that, instead of work around it. There are a lot of places that assume alias analysis is accurate.

For example, the glibc library pthread_self is defined with attribute((const)) (which would have readnone attribute in LLVM IR) even though it in fact reads global memory.

Can we fix up the attributes on pthread_self in clang? Or are we more generally concerned with people marking their functions const?

On a related note, we probably need to change the representation of references to thread-local variables.


For the second point, it isn't obvious to me that disabling CSE is universally profitable. We can actually end up reducing the number of values live across the suspend point in some cases. And it seems simpler to teach coroutine lowering to rematerialize instructions when it's profitable.

For the first point, if the IR definition isn't consistent, I'd prefer to actually fix that, instead of work around it. There are a lot of places that assume alias analysis is accurate.

For example, the glibc library pthread_self is defined with attribute((const)) (which would have readnone attribute in LLVM IR) even though it in fact reads global memory.

Can we fix up the attributes on pthread_self in clang? Or are we more generally concerned with people marking their functions const?

On a related note, we probably need to change the representation of references to thread-local variables.


For the second point, it isn't obvious to me that disabling CSE is universally profitable. We can actually end up reducing the number of values live across the suspend point in some cases. And it seems simpler to teach coroutine lowering to rematerialize instructions when it's profitable.

Thanks for the suggestions.
On the first point, I could certainly try fixing up pthread_self in Clang, but that would also mean we would never be able to optimize out redundant pthread_self() calls. I am not sure if that's acceptable in general. So far I only find pthread_self() to be problematic, but I am overall worried there might be more things like this that I am not aware of. Anyway, I could give it a try.

And yes thread local variables are another set of problems. I don't have a solution yet on how to handle them.

The second point makes sense to me.

but that would also mean we would never be able to optimize out redundant pthread_self() calls

We can probably mess with alias analysis so it understands that pthread_self doesn't alias operations other than calls to a coroutine suspend; that should be enough to recover the relevant optimizations. Not sure if we want to add some sort of IR attribute, or just special-case that specific library call using TargetLibraryInfo.

And yes thread local variables are another set of problems. I don't have a solution yet on how to handle them.

We probably need an intrinsic that computes the runtime address of a thread-local variable, so we compute the address at some specific point in the function.

lxfind added a comment.EditedOct 19 2020, 4:49 PM

but that would also mean we would never be able to optimize out redundant pthread_self() calls

We can probably mess with alias analysis so it understands that pthread_self doesn't alias operations other than calls to a coroutine suspend; that should be enough to recover the relevant optimizations. Not sure if we want to add some sort of IR attribute, or just special-case that specific library call using TargetLibraryInfo.

And yes thread local variables are another set of problems. I don't have a solution yet on how to handle them.

We probably need an intrinsic that computes the runtime address of a thread-local variable, so we compute the address at some specific point in the function.

After thinking about it more:
First of all, we cannot drop the readnone tag in the definition of pthread_self in Clang, the regression in the non-coroutine cases are likely unacceptable and they shouldn't pay for it if not using coroutines.
Secondly, because one can call pthread_self through indirect function calls, hence just checking for pthread_self in coroutines is not sufficient. Instead, in a coroutine function, we never want to reuse the results of function calls.
There doesn't seem to be a way to tag a callsite that it might access memory (except through operand bundles, which doesn't seem to fit here), so it seems to me there are only two possible solutions:

  1. Rewrite Clang frontend for coroutine so that it directly emit multiple functions for each suspension region. It eliminates the problem but then optimizing across multiple functions that in fact belong to one will be quite challenging and the change will be very significant.
  2. In all the relevant passes that would reuse call results (EarlyCSE and GVN as far as I know, but do let me know if there are others), do not reuse call results within a coroutine.

Since the first solution is way to heavy and has a lot of downsides, the second solution seems the way to go. It will be basically along the shape of this patch, but limit the damage to only call result sharing, not other expressions. What do you think?

First of all, we cannot drop the readnone tag in the definition of pthread_self in Clang, the regression in the non-coroutine cases are likely unacceptable and they shouldn't pay for it if not using coroutines.

Like I noted before, we can recover the optimization power by special-casing it in alias analysis. For almost all optimizations, it doesn't matter that it reads memory if that read doesn't actually alias anything.

Secondly, because one can call pthread_self through indirect function calls, hence just checking for pthread_self in coroutines is not sufficient. Instead, in a coroutine function, we never want to reuse the results of function calls.

This depends on what we decide the "const" attribute actually means, in the context of coroutines. We could decide that const actually means the value can't depend on the thread ID, then treat fixing up the libc header as a compatibility hack. Or we could decide that "const" is actually allowed to change across a coroutine suspend, in which case we need to do something more invasive.

Either way, I'd still prefer that the IR readnone exclude functions that behave like pthread_self, I think.

In all the relevant passes that would reuse call results (EarlyCSE and GVN as far as I know, but do let me know if there are others), do not reuse call results within a coroutine.

We also have a few other GVN-ish passes: NewGVN, GVNHoist and GVNSink. SimplifyCFG does a form of CSE, although I'm not sure it actually affects this case. LICM also reuses call results in a sense that's relevant here.

Even if we do catch all the cases that are relevant right now, having an "extra" form of memory access that isn't visible to alias analysis makes understanding existing code, and writing new code, harder.


I think this discussion is getting to the point where it would benefit from a wider audience on llvm-dev.

but that would also mean we would never be able to optimize out redundant pthread_self() calls

We can probably mess with alias analysis so it understands that pthread_self doesn't alias operations other than calls to a coroutine suspend; that should be enough to recover the relevant optimizations. Not sure if we want to add some sort of IR attribute, or just special-case that specific library call using TargetLibraryInfo.

And yes thread local variables are another set of problems. I don't have a solution yet on how to handle them.

We probably need an intrinsic that computes the runtime address of a thread-local variable, so we compute the address at some specific point in the function.

After thinking about it more:
First of all, we cannot drop the readnone tag in the definition of pthread_self in Clang, the regression in the non-coroutine cases are likely unacceptable and they should pay for it if not using coroutines.
Secondly, because one can call pthread_self through indirect function calls, hence just checking for pthread_self in coroutines is not sufficient. Instead, in a coroutine function, we never want to reuse the results of function calls.
There doesn't seem to be a way to tag a callsite that it might access memory (except through operand bundles, which doesn't seem to fit here), so it seems to me there are only two possible solutions:

  1. Rewrite Clang frontend for coroutine so that it directly emit multiple functions for each suspension region. It eliminates the problem but then optimizing across multiple functions that in fact belong to one will be quite challenging and the change will be very significant.
  2. In all the relevant passes that would reuse call results (EarlyCSE and GVN as far as I know, but do let me know if there are others), do not reuse call results within a coroutine.

Since the first solution is way to heavy and has a lot of downsides, the second solution seems the way to go. It will be basically along the shape of this patch, but limit the damage to only call result sharing, not other expressions. What do you think?

First of all, we cannot drop the readnone tag in the definition of pthread_self in Clang, the regression in the non-coroutine cases are likely unacceptable and they shouldn't pay for it if not using coroutines.

Like I noted before, we can recover the optimization power by special-casing it in alias analysis. For almost all optimizations, it doesn't matter that it reads memory if that read doesn't actually alias anything.

Secondly, because one can call pthread_self through indirect function calls, hence just checking for pthread_self in coroutines is not sufficient. Instead, in a coroutine function, we never want to reuse the results of function calls.

This depends on what we decide the "const" attribute actually means, in the context of coroutines. We could decide that const actually means the value can't depend on the thread ID, then treat fixing up the libc header as a compatibility hack. Or we could decide that "const" is actually allowed to change across a coroutine suspend, in which case we need to do something more invasive.

Either way, I'd still prefer that the IR readnone exclude functions that behave like pthread_self, I think.

In all the relevant passes that would reuse call results (EarlyCSE and GVN as far as I know, but do let me know if there are others), do not reuse call results within a coroutine.

We also have a few other GVN-ish passes: NewGVN, GVNHoist and GVNSink. SimplifyCFG does a form of CSE, although I'm not sure it actually affects this case. LICM also reuses call results in a sense that's relevant here.

Even if we do catch all the cases that are relevant right now, having an "extra" form of memory access that isn't visible to alias analysis makes understanding existing code, and writing new code, harder.


I think this discussion is getting to the point where it would benefit from a wider audience on llvm-dev.

Thanks. I think I get it now. I will write up something in more details and post in llvm-dev.
To summarize:

  1. In Clang we special handle the pthread_self() function declaration and remove the readnone attribute.
  2. In Alias Analysis we make it such that pthread_self() only interferes with Coroutine suspension intrinsics.
lxfind planned changes to this revision.Nov 10 2020, 10:00 AM