This is an archive of the discontinued LLVM Phabricator instance.

[LICM][Coroutine] Don't sink stores from loops with coro.suspend instructions
ClosedPublic

Authored by lxfind on Feb 17 2021, 8:02 PM.

Details

Summary

See pr46990(https://bugs.llvm.org/show_bug.cgi?id=46990). LICM should not sink store instructions to loop exit blocks which cross coro.suspend intrinsics. This breaks semantic of coro.suspend intrinsic which return to caller directly. Also this leads to use-after-free if the coroutine is freed before control returns to the caller in multithread environment.

This patch disable promotion by check whether loop contains coro.suspend intrinsics.
This is a resubmit of D86190.
Disabling LICM for loops with coroutine suspension is a better option not only for correctness purpose but also for performance purpose.
In most cases LICM sinks memory operations. In the case of coroutine, sinking memory operation out of the loop does not improve performance since coroutien needs to get data from the frame anyway. In fact LICM would hurt coroutine performance since it adds more entries to the frame.

Diff Detail

Event Timeline

lxfind created this revision.Feb 17 2021, 8:02 PM
lxfind requested review of this revision.Feb 17 2021, 8:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2021, 8:02 PM
ChuanqiXu added inline comments.Feb 17 2021, 10:16 PM
llvm/lib/Transforms/Scalar/LICM.cpp
370

coroutine

please see comments of D87817.

please see comments of D87817.

I don't think we should go that route, because LICM will mostly hurt coroutine, as I explained in the summary.
Hence we should simply disable LICM for coroutine, and I don't think this is a temporary change. What do you think?

@efriedma, do you think the inlined document is sufficient? Or would you prefer to see it somewhere else too?

I want to see changes to LangRef and/or the coroutine documentation to describe the semantic restriction. If there's a correctness issue, it clearly isn't specific to LICM, so I want to see the rule described in general terms. And please ping llvm-dev when you have it written up.

It's not completely clear to me that disabling sinking is always a performance win, but it's not that important.

please see comments of D87817.

I don't think we should go that route, because LICM will mostly hurt coroutine, as I explained in the summary.
Hence we should simply disable LICM for coroutine, and I don't think this is a temporary change. What do you think?

I do not think we should disable LICM for coroutine, also this is not semantic restriction of coroutine (GCC does not do this). It just caused by current pipeline of llvm coroutine as well as debug info issues. I was thinking maybe we can invoke corosplit as early as possible (not considering performance). Anyway , we can discuss this in D95807.

please see comments of D87817.

I don't think we should go that route, because LICM will mostly hurt coroutine, as I explained in the summary.
Hence we should simply disable LICM for coroutine, and I don't think this is a temporary change. What do you think?

I do not think we should disable LICM for coroutine, also this is not semantic restriction of coroutine (GCC does not do this). It just caused by current pipeline of llvm coroutine as well as debug info issues. I was thinking maybe we can invoke corosplit as early as possible (not considering performance). Anyway , we can discuss this in D95807.

Could you explain why we want to enable LICM for coroutine (from performance perspective)?
My theory is this:

  • majority of LICM should be memory operations. Constant function calls that are not inlined should be relatively rare compared to memory operations.
  • For memory operations, LICM does not reduce the number of memory operations in the loop in the case of coroutine; instead it adds one extra entry to the coroutine frame, increasing memory usage.

please see comments of D87817.

I don't think we should go that route, because LICM will mostly hurt coroutine, as I explained in the summary.
Hence we should simply disable LICM for coroutine, and I don't think this is a temporary change. What do you think?

I do not think we should disable LICM for coroutine, also this is not semantic restriction of coroutine (GCC does not do this). It just caused by current pipeline of llvm coroutine as well as debug info issues. I was thinking maybe we can invoke corosplit as early as possible (not considering performance). Anyway , we can discuss this in D95807.

Indeed GCC splits coroutine very early one so it doesn't get exposed to lots of issues like Clang does.
However if we do that, there will be no chance to optimize and we will always end up with a huge coroutine frame (unless we can redesign it in a way that can still optimize the coroutine frame post-split.

please see comments of D87817.

I don't think we should go that route, because LICM will mostly hurt coroutine, as I explained in the summary.
Hence we should simply disable LICM for coroutine, and I don't think this is a temporary change. What do you think?

I do not think we should disable LICM for coroutine, also this is not semantic restriction of coroutine (GCC does not do this). It just caused by current pipeline of llvm coroutine as well as debug info issues. I was thinking maybe we can invoke corosplit as early as possible (not considering performance). Anyway , we can discuss this in D95807.

Could you explain why we want to enable LICM for coroutine (from performance perspective)?
My theory is this:

  • majority of LICM should be memory operations. Constant function calls that are not inlined should be relatively rare compared to memory operations.
  • For memory operations, LICM does not reduce the number of memory operations in the loop in the case of coroutine; instead it adds one extra entry to the coroutine frame, increasing memory usage.

LICM move the memory operations out of the loop. It does reduce the number of memory operations. More importantly, I agree with @efriedma that either we have general solution or we describe these restrictions other than fix them anywhere.

lxfind added a comment.EditedFeb 22 2021, 10:27 AM

LICM move the memory operations out of the loop. It does reduce the number of memory operations. More importantly, I agree with @efriedma that either we have general solution or we describe these restrictions other than fix them anywhere.

Let me elaborate in more detail:
This patch does not attempt to disable the entire LICM in the presence of coroutines. Instead, it disables a specific part of LICM: promoting memory references to scalars.
Such promotion works by sinking stores out of the loop and moving loads to before the loop. Let's look at each of the two cases and see why they won't reduce memory operations for coroutines:

  1. Sinking stores out of the loop. LICM sinks stores out of the loop by turning the memory stores into scalar stores, and then outside of the loop it stores the scalars into the memory. So it is important to note that LICM introduces a scalar in the loop that needs to stay alive until the loop ends, so that it can store that scalar into the memory. In the presence of coroutine, that is, the loop can suspend and resume, anything that needs to live through the loop will need to be put on the coroutine frame (i.e. heap). So even though LICM can turn the memory store into a scalar store, with coroutine, that scalar needs to live on the coroutine frame and hence scalar store will eventually become a memory store again. So effectively as you can see, we still have the same number of memory stores in the loop, and further more we introduced one more entry in the frame to store the sunk scalar.
  2. Moving loads to before the loop. The reasoning is similar here. In order to move loads to before the loop, we need a scalar to store the result of the load, so that we can access that scalar within the loop. However in the presence of coroutine, if the scalar value needs to live through the loop, it also needs to be put on the coroutine frame, which is the heap. Hence every read of the scalar value in the loop is still a memory load. We still end up with the same number of memory loads, and we also added one more entry to the frame.

Does this make sense?

LICM move the memory operations out of the loop. It does reduce the number of memory operations. More importantly, I agree with @efriedma that either we have general solution or we describe these restrictions other than fix them anywhere.

Let me elaborate in more detail:
This patch does not attempt to disable the entire LICM in the presence of coroutines. Instead, it disables a specific part of LICM: promoting memory references to scalars.
Such promotion works by sinking stores out of the loop and moving loads to before the loop. Let's look at each of the two cases and see why they won't reduce memory operations for coroutines:

  1. Sinking stores out of the loop. LICM sinks stores out of the loop by turning the memory stores into scalar stores, and then outside of the loop it stores the scalars into the memory. So it is important to note that LICM introduces a scalar in the loop that needs to stay alive until the loop ends, so that it can store that scalar into the memory. In the presence of coroutine, that is, the loop can suspend and resume, anything that needs to live through the loop will need to be put on the coroutine frame (i.e. heap). So even though LICM can turn the memory store into a scalar store, with coroutine, that scalar needs to live on the coroutine frame and hence scalar store will eventually become a memory store again. So effectively as you can see, we still have the same number of memory stores in the loop, and further more we introduced one more entry in the frame to store the sunk scalar.
  2. Moving loads to before the loop. The reasoning is similar here. In order to move loads to before the loop, we need a scalar to store the result of the load, so that we can access that scalar within the loop. However in the presence of coroutine, if the scalar value needs to live through the loop, it also needs to be put on the coroutine frame, which is the heap. Hence every read of the scalar value in the loop is still a memory load. We still end up with the same number of memory loads, and we also added one more entry to the frame.

Does this make sense?

Yes, lots of the cases follow this rules. However, as long as the values stored are loop invariant, then LICM should move store out of the loop to reduce memory operation.

BTW, although this patch is OK enough for pr46990 and use-after-free issue, Fixing these issues case by case for coroutine is not good enough. That's the reason why i agree with efriedma.

BTW, although this patch is OK enough for pr46990 and use-after-free issue, Fixing these issues case by case for coroutine is not good enough. That's the reason why i agree with efriedma.

The problem I see right now is there is no general fix available for this problem. From LLVM IR perspective, the default edge of the coro.suspend switch is not something reliable to discover, nor something developers can avoid moving instructions over. So even if we want to document, I don't know what to document in LangRef other than explaining in the Coroutines.ts on how this works and what the issue is.
Perhaps we need to introduce a new IR instruction for suspend instead of relying on intrinsics, but that's not going to be something we can redesign in a short amount of time.
I am happy to add more detailed documentation in Coroutines.ts. But beyond that I don't see much I can do here.

BTW, although this patch is OK enough for pr46990 and use-after-free issue, Fixing these issues case by case for coroutine is not good enough. That's the reason why i agree with efriedma.

The problem I see right now is there is no general fix available for this problem. From LLVM IR perspective, the default edge of the coro.suspend switch is not something reliable to discover, nor something developers can avoid moving instructions over. So even if we want to document, I don't know what to document in LangRef other than explaining in the Coroutines.ts on how this works and what the issue is.
Perhaps we need to introduce a new IR instruction for suspend instead of relying on intrinsics, but that's not going to be something we can redesign in a short amount of time.
I am happy to add more detailed documentation in Coroutines.ts. But beyond that I don't see much I can do here.

@efriedma any idea?

Perhaps we need to introduce a new IR instruction for suspend instead of relying on intrinsics, but that's not going to be something we can redesign in a short amount of time.

How about add fence instruction between coro.suspend and ret block?

Perhaps we need to introduce a new IR instruction for suspend instead of relying on intrinsics, but that's not going to be something we can redesign in a short amount of time.

How about add fence instruction between coro.suspend and ret block?

Can you elaborate using the test case from this patch?
The problem is at the edge from the switch to the block. There doesn't seem a way to prevent from sending instructions over an edge.

I'd be satisfied with a description of the issue in Coroutines.ts. Having a description of the problem is the first step to fixing it.

Yes, lots of the cases follow this rules. However, as long as the values stored are loop invariant, then LICM should move store out of the loop to reduce memory operation.

You are right that in some cases LICM does reduce memory operations for coroutines.
Now that we have all agreed on the documentation part, let me think a bit more about this patch vs D87817

lxfind updated this revision to Diff 326602.Feb 25 2021, 10:52 PM

Add more detailed documentation

lxfind added a comment.Mar 2 2021, 1:54 PM

Yes, lots of the cases follow this rules. However, as long as the values stored are loop invariant, then LICM should move store out of the loop to reduce memory operation.

You are right that in some cases LICM does reduce memory operations for coroutines.
Now that we have all agreed on the documentation part, let me think a bit more about this patch vs D87817

I concluded that there is no better way than disabling LICM for coroutines right now.

LGTM, you may need @efriedma's approval

efriedma accepted this revision.Mar 3 2021, 3:00 PM

LGTM

This revision is now accepted and ready to land.Mar 3 2021, 3:00 PM