This is an archive of the discontinued LLVM Phabricator instance.

[coroutines] Promote cleanup.dest.slot-like allocas to registers to avoid storing them in the coroutine frame
AbandonedPublic

Authored by GorNishanov on Aug 23 2017, 10:52 PM.

Download Raw Diff

Details

Reviewers

Summary

Allocas that follow the pattern of cleanup.dest.slot, namely, they are
integer allocas that don't escape and only store constants. We don't want
them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

For example in this coroutine %slot will be promoted to register even in -O0

define i8* @happy.case() "coroutine.presplit"="1" {
entry:
  %slot = alloca i32
  %id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
  %size = call i32 @llvm.coro.size.i32()
  %alloc = call i8* @malloc(i32 %size)
  %hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)

  store i32 1, i32* %slot

  %tok = call i8 @llvm.coro.suspend(token none, i1 false)
  switch i8 %tok, label %suspend [i8 0, label %resume
                                i8 1, label %cleanup]
resume:
  store i32 2, i32* %slot
  br label %cleanup

cleanup:
  %mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
  call void @free(i8* %mem)
  %x = load i32, i32* %slot
  call void @print.i32(i32 %x)
  br label %suspend
suspend:
  call i1 @llvm.coro.end(i8* %hdl, i1 0)
  ret i8* %hdl
}

This fixes PR34289

Diff Detail

Event Timeline

GorNishanov created this revision.Aug 23 2017, 10:52 PM

Herald added subscribers: EricWF, qcolombet. · View Herald TranscriptAug 23 2017, 10:52 PM

eric_niebler added inline comments.Aug 25 2017, 1:40 PM

lib/Transforms/Coroutines/CoroFrame.cpp
789 ↗	(On Diff #112504)	Knowing nothing about how this code works, it seems strange to me that you are ignoring the last instruction in the basic block, but the comment says you're looking at all instructions.

GorNishanov marked an inline comment as done.Aug 25 2017, 1:47 PM

GorNishanov added inline comments.

lib/Transforms/Coroutines/CoroFrame.cpp

789 ↗

(On Diff #112504)

Block should always end in a terminator instruction. AllocaInst is not a terminator, hence, no need to check for it. Here is another place with a similar loop (from where it was shamelessly stolen from :-))

C:\GitHub\llvm\lib\Transforms\Utils\Mem2Reg.cpp:
   39      // Find allocas that are safe to promote, by looking at all instructions in
   40      // the entry node
   41:     for (BasicBlock::iterator I = BB.begin(), E = --BB.end(); I != E; ++I)
   42        if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) // Is it an alloca?
   43          if (isAllocaPromotable(AI))

I'd love to see this approved. It's a bad bug.

How are integer allocas special?

In D37093#857223, @majnemer wrote:

How are integer allocas special?

clang synthesizes integer allocas for flags and cleanup states to implement dispatching of cleanup code to decide whether something is fully constructed and need destruction and to decide where to go after common cleanup sequence is executed. In O2, those synthetic variables are gone. In O0, they stay as variables and coroutine frame building code will put them into the coroutine frame, that may lead to accessing them for cleanup purposes in the initiation function after coroutine frame is already destroyed due to coroutine never suspending.

Potentially this code can be generalized to deal with arbitrarily typed allocas with this pattern, but, I wanted to keep it only for integers since that is the scenario (clang synthesized flags and destination slot variables) I wanted to address.

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

In D37093#858860, @majnemer wrote:

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

In D37093#858860, @majnemer wrote:

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I can run the promotion of alloca in CoroEarly before any optimization passes could interfere. Essentially, cleaning up after the frontend "heavy handed" :-) cleanup code codegen.

In D37093#859380, @hfinkel wrote:

In D37093#858860, @majnemer wrote:

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

Most of the data in the coroutine folds neatly into two categories:

Data that can be used after resume (all of the variables, temporaries and parameter copies in the user authored body.
Data used only during coroutine creation when it is first invoked, such as parameters and coroutine return value, such as generator<int>, task<void>, etc.

Cleanup destination slot breaks this pattern. It is used both before coroutine state is created, during coroutine body execution and after coroutine state is destroyed.

I was thinking for awhile and was not able to come up with a simple algorithm to deal with it apart from the one offered in this patch.

In D37093#860080, @GorNishanov wrote:

In D37093#859380, @hfinkel wrote:

In D37093#858860, @majnemer wrote:

Seems fragile. What if we run optimizations and transform the IR to another, different, version of the IR but the alloca is not promotable anymore? On occasion, the compiler can pessimize a code sequence and coroutine lowering should be resilient to such things.

I agree. If we take a step back, what's necessary to make the current algorithm sound in general?

Most of the data in the coroutine folds neatly into two categories:

Data that can be used after resume (all of the variables, temporaries and parameter copies in the user authored body.

Data used only during coroutine creation when it is first invoked, such as parameters and coroutine return value, such as generator<int>, task<void>, etc.

Cleanup destination slot breaks this pattern. It is used both before coroutine state is created, during coroutine body execution and after coroutine state is destroyed.

I was thinking for awhile and was not able to come up with a simple algorithm to deal with it apart from the one offered in this patch.

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

The problem this patch is fixing only occurs in -O0, at any other optimization levels, mem2reg pass during SROA eliminates all of the cleanup related allocas (and more).
I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot), so I am looking for a pattern used in those, namely, those are integer allocas that have only constants stored in them.

In D37093#860506, @GorNishanov wrote:

I still don't see what checking that the type is an integer or the value is a constant buys you. Why can't you look for allocas that are used after the call to llvm.coro.free? Isn't that the relevant factor?

The problem this patch is fixing only occurs in -O0, at any other optimization levels, mem2reg pass during SROA eliminates all of the cleanup related allocas (and more).
I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

so I am looking for a pattern used in those, namely, those are integer allocas that have only constants stored in them.

I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

Because as Gor writes at the top:

We don't want them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

This bug manifests as a codegen use-after-free in -O0.

In D37093#860565, @eric_niebler wrote:

I do not want to promote all promotable allocas, I only would like to target cleanup allocas (flags and dest.slot),

Why? It sounds like you're trying to optimize some aspect of the -O0 lowering.

Because as Gor writes at the top:

We don't want them to be saved into the coroutine frame (as some of the cleanup code may
access them after coroutine frame destroyed).

This bug manifests as a codegen use-after-free in -O0.

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

For cleanup, clang creates a pattern that current coroutine frame building algorithm cannot handle. To handle it, we will probably need to convert allocas to SSA and see def/use chains, but, once converted to SSA, the problematic pattern disappears.
Another approach is to prove that for a particular alloca store-load pairs never cross a suspend point, if that is the case, we can duplicate the alloca (so that every coroutine part has its own copy) since the value is never carried over from one suspend/resume to another. However, this algorithm seems complicated and have to be hand built, whereas, we can reuse well-tested alloca promotion algorithm to eliminate the problematic pattern.

Probably moving the promotion from CoroSplit to CoroEarly which rans immediately after front-end will help to make sure that later optimization passes will not interfere with cleanup.dest alloca promotion.

In D37093#860646, @GorNishanov wrote:

That's fine, but does not in itself explain why the variable cannot be extracted from the frame, or otherwise made available to the cleanup code. I'm not worried about the size of the stack frame at -O0 being larger than necessary (it will, compared to optimized code, be larger regardless). Moreover, the lowering algorithm must be sound at the IR level (not just in a state where it happens to work for some patterns that Clang happens to create).

For cleanup, clang creates a pattern that current coroutine frame building algorithm cannot handle. To handle it, we will probably need to convert allocas to SSA and see def/use chains, but, once converted to SSA, the problematic pattern disappears.
Another approach is to prove that for a particular alloca store-load pairs never cross a suspend point, if that is the case, we can duplicate the alloca (so that every coroutine part has its own copy) since the value is never carried over from one suspend/resume to another. However, this algorithm seems complicated and have to be hand built, whereas, we can reuse well-tested alloca promotion algorithm to eliminate the problematic pattern.

Probably moving the promotion from CoroSplit to CoroEarly which rans immediately after front-end will help to make sure that later optimization passes will not interfere with cleanup.dest alloca promotion.

I'm missing something here. So the problem is this code, right:

%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
call void @free(i8* %mem)
%x = load i32, i32* %slot

If we have an alloca, %slot in this case, that is promoted to live in the coroutine frame, why don't you just extract it again if it's needed after the call to coro.free?

Currently, this gets transformed into:

define internal fastcc void @happy.case.destroy(%happy.case.Frame* %FramePtr) {
entry.destroy:
  %vFrame = bitcast %happy.case.Frame* %FramePtr to i8*
  %slot.reload.addr = getelementptr inbounds %happy.case.Frame, %happy.case.Frame* %FramePtr, i32 0, i32 4
  call void @free(i8* %vFrame)
  %x = load i32, i32* %slot.reload.addr
  call void @print.i32(i32 %x)
  ret void
}

How about, for all uses of alloca after the original call to @llvm.coro.free, you should create a new alloca in the foo.destroy function, copy the value from the frame into that new alloca, and then use that alloca in the later code? Like this:

define internal fastcc void @happy.case.destroy(%happy.case.Frame* %FramePtr) {
entry.destroy:
  %slot = alloca i32
  %vFrame = bitcast %happy.case.Frame* %FramePtr to i8*
  %slot.reload.addr = getelementptr inbounds %happy.case.Frame, %happy.case.Frame* %FramePtr, i32 0, i32 4
  call void @llvm.memcpy.p0i32.p0i32.i32(i32* %slot, i32* %slot.reload.addr, i32 4, i32 4, i1 false)
  call void @free(i8* %vFrame)
  %x = load i32, i32* %slot
  call void @print.i32(i32 %x)
  ret void
}

In D37093#861535, @hfinkel wrote:

How about, for all uses of alloca after the original call to @llvm.coro.free, you should create a new alloca in the foo.destroy function, copy the value from the frame into that new alloca, and then use that alloca in the later code? Like this:

I see where you were going. I'll prototype and see how it looks. Thank you for your feedback.

Moved the promotion to CoroEarly to make sure no optimization passes can intervene.

@hfinkel, I looked at your suggestion, but, I think it is difficult to make it work in the general case and the transformation should be done before coroutine is split.
Consider a case of a fire and forget coroutine (no final suspend point, so it cleans itself when it runs to completion.
Now, if we keep destination slot in the coroutine frame and try to patch up the load, similar to your suggestion, in the initial function, there is no safe place to place the load, since the coroutine frame can be destroyed by a different thread that resumed the coroutine and it managed to run to the completion before the initiating function finished. The approach in this patch is simple, uses existing algorithm and takes care of this case by making sure that cleanup slot is never stored in memory.

Added the test case

Bump!

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

In D37093#910909, @GorNishanov wrote:

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

In D37093#910951, @rjmccall wrote:

In D37093#910909, @GorNishanov wrote:

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

In D37093#910964, @rjmccall wrote:

In D37093#910951, @rjmccall wrote:

In D37093#910909, @GorNishanov wrote:

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

coro-early pass is run as early pass during "cleanup after a frontend" phase of the optimizer. Thus, arbitrarly inlined code cannot happen. I also put out for review an alternative fix that eliminates cleanup.dest.slot for coroutines in the CodeGenFunction::FinishFunction: https://reviews.llvm.org/D39768

In D37093#918717, @GorNishanov wrote:

In D37093#910964, @rjmccall wrote:

In D37093#910951, @rjmccall wrote:

In D37093#910909, @GorNishanov wrote:

@rjmccall suggested to do the cleanup dest slot elimination wholesale for all functions in clang. (Not specific for coroutines).
That is a sweeping change :-), so, I don't mind still having this here to unblock Facebook and when the clang change lands, remove this part from llvm.

Well, I said that I wouldn't be opposed to cleaning up the cleanup dest slot unconditionally. I expressed no opinion about whether that was a reasonable way to fix this problem in your representation. (This doesn't affect my lowering as coro.end is always a terminator.)

I think you probably do need to solve this problem more generally, as it sounds like any use of an alloca in the post-coro.end section of the function will lead to miscompiles. It sounds like you have some control of the post-coro.end generated code, so maybe allocas don't normally affect you there — but from the fact that you have uses of the cleanup dest, I assume you're branching through cleanups, which means you have calls, which means you can have arbitrary code there if the callee is marked always_inline.

coro-early pass is run as early pass during "cleanup after a frontend" phase of the optimizer. Thus, arbitrarly inlined code cannot happen.

Hmm. After digging through source code for a while, it looks like this does precede the always-inliner, chiefly by virtue of being added to the function pass manager which seems to be run before the module passes. Still, this really feels like a hack that's going to come back to bite us. What if there's some weird control-flow feature elsewhere in the function that prevents mem2reg from working completely? I assume the direct bug you're trying to fix is that the frame is destroyed before your last use of the cleanup-dest alloca; is there a strong argument why this problem is specific to just the cleanup-dest alloca? Are we going to end up tightly integrating coroutine emission (which, with all due respect, is going to be very lightly tested in terms of the language features it interacts with) with of the cleanup and cleanup-emission logic across IRGen? I'm sorry if some of these points have already been hashed out elsewhere; if you want to just refer me to that discussion, please do.

If this really is special to cleanup dests, and we have a compelling argument that mem2reg will always succeed, then I think we can accept this patch. The other option is to find an IR-generaton pattern that avoids needing to access the cleanup-dest alloca after the destruction of the frame at all.

Abandoned in favor of https://reviews.llvm.org/rL317981

Revision Contents

Path

Size

lib/

Transforms/

Coroutines/

CoroEarly.cpp

59 lines

test/

Transforms/

Coroutines/

coro-cleanup-dest-promote.ll

119 lines

Diff 117393

lib/Transforms/Coroutines/CoroEarly.cpp

//===- CoroEarly.cpp - Coroutine Early Function Pass ----------------------===//		//===- CoroEarly.cpp - Coroutine Early Function Pass ----------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// This pass lowers coroutine intrinsics that hide the details of the exact		// This pass lowers coroutine intrinsics that hide the details of the exact
// calling convention for coroutine resume and destroy functions and details of		// calling convention for coroutine resume and destroy functions and details of
// the structure of the coroutine frame.		// the structure of the coroutine frame.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "CoroInternal.h"		#include "CoroInternal.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
		#include "llvm/IR/Dominators.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InstIterator.h"		#include "llvm/IR/InstIterator.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
		#include "llvm/Transforms/Utils/PromoteMemToReg.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "coro-early"		#define DEBUG_TYPE "coro-early"

namespace {		namespace {
// Created on demand if CoroEarly pass has work to do.		// Created on demand if CoroEarly pass has work to do.
class Lowerer : public coro::LowererBase {		class Lowerer : public coro::LowererBase {
IRBuilder<> Builder;		IRBuilder<> Builder;
PointerType *const AnyResumeFnPtrTy;		PointerType *const AnyResumeFnPtrTy;

void lowerResumeOrDestroy(CallSite CS, CoroSubFnInst::ResumeKind);		void lowerResumeOrDestroy(CallSite CS, CoroSubFnInst::ResumeKind);
void lowerCoroPromise(CoroPromiseInst *Intrin);		void lowerCoroPromise(CoroPromiseInst *Intrin);
void lowerCoroDone(IntrinsicInst *II);		void lowerCoroDone(IntrinsicInst *II);

public:		public:
Lowerer(Module &M)		Lowerer(Module &M)
: LowererBase(M), Builder(Context),		: LowererBase(M), Builder(Context),
AnyResumeFnPtrTy(FunctionType::get(Type::getVoidTy(Context), Int8Ptr,		AnyResumeFnPtrTy(FunctionType::get(Type::getVoidTy(Context), Int8Ptr,
/isVarArg=/false)		/isVarArg=/false)
->getPointerTo()) {}		->getPointerTo()) {}
bool lowerEarlyIntrinsics(Function &F);		bool lowerEarlyIntrinsics(Function &F, DominatorTree &DT);
};		};
}		}

// Replace a direct call to coro.resume or coro.destroy with an indirect call to		// Replace a direct call to coro.resume or coro.destroy with an indirect call to
// an address returned by coro.subfn.addr intrinsic. This is done so that		// an address returned by coro.subfn.addr intrinsic. This is done so that
// CGPassManager recognizes devirtualization when CoroElide pass replaces a call		// CGPassManager recognizes devirtualization when CoroElide pass replaces a call
// to coro.subfn.addr with an appropriate function address.		// to coro.subfn.addr with an appropriate function address.
void Lowerer::lowerResumeOrDestroy(CallSite CS,		void Lowerer::lowerResumeOrDestroy(CallSite CS,
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	void Lowerer::lowerCoroDone(IntrinsicInst *II) {
auto *Gep = Builder.CreateConstInBoundsGEP1_32(FrameTy, BCI, 0);		auto *Gep = Builder.CreateConstInBoundsGEP1_32(FrameTy, BCI, 0);
auto *Load = Builder.CreateLoad(Gep);		auto *Load = Builder.CreateLoad(Gep);
auto *Cond = Builder.CreateICmpEQ(Load, NullPtr);		auto *Cond = Builder.CreateICmpEQ(Load, NullPtr);

II->replaceAllUsesWith(Cond);		II->replaceAllUsesWith(Cond);
II->eraseFromParent();		II->eraseFromParent();
}		}

		static bool allocaOnlyStoresConstantInts(AllocaInst *AI) {
		for (User *U : AI->users())
		if (auto *SI = dyn_cast<StoreInst>(U))
		if (!isa<ConstantInt>(SI->getValueOperand()))
		return false;

		return true;
		}

		// Allocas that follow the pattern of cleanup.dest.slot, namely, they are
		// integer allocas that don't escape and only store constants. We don't want
		// them to be saved into the coroutine frame (as some of the cleanup code will
		// access them after coroutine frame destroyed).
		//
		// PromiseAlloca should be always stored in the coroutine frame even if its
		// usage pattern only consists of constant stores.
		static void
		promoteCleanupSlotLikeAllocasToRegisters(Function &F, DominatorTree& DT,
		AllocaInst *PromiseAlloca) {
		SmallVector<AllocaInst *, 2> Allocas;
		BasicBlock &BB = F.getEntryBlock(); // Get the entry node for the function

		// Find allocas that are safe to promote, by looking at all non-terminator
		// instructions in the entry node.
		for (BasicBlock::iterator I = BB.begin(), E = --BB.end(); I != E; ++I)
		if (auto *AI = dyn_cast<AllocaInst>(I))
		if (AI != PromiseAlloca && AI->getAllocatedType()->isIntegerTy() &&
		isAllocaPromotable(AI) && allocaOnlyStoresConstantInts(AI))
		Allocas.push_back(AI);

		// None found, nothing to do.
		if (Allocas.empty())
		return;

		PromoteMemToReg(Allocas, DT);
		}

// Prior to CoroSplit, calls to coro.begin needs to be marked as NoDuplicate,		// Prior to CoroSplit, calls to coro.begin needs to be marked as NoDuplicate,
// as CoroSplit assumes there is exactly one coro.begin. After CoroSplit,		// as CoroSplit assumes there is exactly one coro.begin. After CoroSplit,
// NoDuplicate attribute will be removed from coro.begin otherwise, it will		// NoDuplicate attribute will be removed from coro.begin otherwise, it will
// interfere with inlining.		// interfere with inlining.
static void setCannotDuplicate(CoroIdInst *CoroId) {		static void setCannotDuplicate(CoroIdInst *CoroId) {
for (User *U : CoroId->users())		for (User *U : CoroId->users())
if (auto *CB = dyn_cast<CoroBeginInst>(U))		if (auto *CB = dyn_cast<CoroBeginInst>(U))
CB->setCannotDuplicate();		CB->setCannotDuplicate();
}		}

bool Lowerer::lowerEarlyIntrinsics(Function &F) {		bool Lowerer::lowerEarlyIntrinsics(Function &F, DominatorTree &DT) {
bool Changed = false;		bool Changed = false;
CoroIdInst *CoroId = nullptr;		CoroIdInst *CoroId = nullptr;
SmallVector<CoroFreeInst *, 4> CoroFrees;		SmallVector<CoroFreeInst *, 4> CoroFrees;
for (auto IB = inst_begin(F), IE = inst_end(F); IB != IE;) {		for (auto IB = inst_begin(F), IE = inst_end(F); IB != IE;) {
Instruction &I = *IB++;		Instruction &I = *IB++;
if (auto CS = CallSite(&I)) {		if (auto CS = CallSite(&I)) {
switch (CS.getIntrinsicID()) {		switch (CS.getIntrinsicID()) {
default:		default:
Show All 17 Lines	if (auto CS = CallSite(&I)) {
// Mark a function that comes out of the frontend that has a coro.id		// Mark a function that comes out of the frontend that has a coro.id
// with a coroutine attribute.		// with a coroutine attribute.
if (auto *CII = cast<CoroIdInst>(&I)) {		if (auto *CII = cast<CoroIdInst>(&I)) {
if (CII->getInfo().isPreSplit()) {		if (CII->getInfo().isPreSplit()) {
F.addFnAttr(CORO_PRESPLIT_ATTR, UNPREPARED_FOR_SPLIT);		F.addFnAttr(CORO_PRESPLIT_ATTR, UNPREPARED_FOR_SPLIT);
setCannotDuplicate(CII);		setCannotDuplicate(CII);
CII->setCoroutineSelf();		CII->setCoroutineSelf();
CoroId = cast<CoroIdInst>(&I);		CoroId = cast<CoroIdInst>(&I);
		// Eliminate cleanup.slot-like allocas. Required for O0, since
		// cleanup slot allocas are accessed (potentially) after coroutine
		// frame is destroyed.
		promoteCleanupSlotLikeAllocasToRegisters(F, DT,
		CoroId->getPromise());
}		}
}		}
break;		break;
case Intrinsic::coro_resume:		case Intrinsic::coro_resume:
lowerResumeOrDestroy(CS, CoroSubFnInst::ResumeIndex);		lowerResumeOrDestroy(CS, CoroSubFnInst::ResumeIndex);
break;		break;
case Intrinsic::coro_destroy:		case Intrinsic::coro_destroy:
lowerResumeOrDestroy(CS, CoroSubFnInst::DestroyIndex);		lowerResumeOrDestroy(CS, CoroSubFnInst::DestroyIndex);
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	if (coro::declaresIntrinsics(M, {"llvm.coro.id", "llvm.coro.destroy",
L = llvm::make_unique<Lowerer>(M);		L = llvm::make_unique<Lowerer>(M);
return false;		return false;
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
if (!L)		if (!L)
return false;		return false;

return L->lowerEarlyIntrinsics(F);		auto &DTWP = getAnalysis<DominatorTreeWrapperPass>();
		return L->lowerEarlyIntrinsics(F, DTWP.getDomTree());
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
		AU.addRequired<DominatorTreeWrapperPass>();
}		}
StringRef getPassName() const override {		StringRef getPassName() const override {
return "Lower early coroutine intrinsics";		return "Lower early coroutine intrinsics";
}		}
};		};
}		}

char CoroEarly::ID = 0;		char CoroEarly::ID = 0;
INITIALIZE_PASS(CoroEarly, "coro-early", "Lower early coroutine intrinsics",		INITIALIZE_PASS_BEGIN(CoroEarly, "coro-early",
		"Lower early coroutine intrinsics", false, false)
		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
		INITIALIZE_PASS_END(CoroEarly, "coro-early", "Lower early coroutine intrinsics",
false, false)		false, false)

Pass *llvm::createCoroEarlyPass() { return new CoroEarly(); }		Pass *llvm::createCoroEarlyPass() { return new CoroEarly(); }

test/Transforms/Coroutines/coro-cleanup-dest-promote.ll

This file was added.

				; Check that we do not spill cleanup-slot like allocas.
				; RUN: opt < %s -coro-early -S \| FileCheck %s

				define i8* @happy.case() {
				entry:
				%slot = alloca i32
				%id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
				%size = call i32 @llvm.coro.size.i32()
				%alloc = call i8* @malloc(i32 %size)
				%hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)

				store i32 1, i32* %slot

				%tok = call i8 @llvm.coro.suspend(token none, i1 false)
				switch i8 %tok, label %suspend [i8 0, label %resume
				i8 1, label %cleanup]
				resume:
				store i32 2, i32* %slot
				br label %cleanup

				cleanup:
				%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
				call void @free(i8* %mem)
				%x = load i32, i32* %slot
				call void @print.i32(i32 %x)
				br label %suspend
				suspend:
				call i1 @llvm.coro.end(i8* %hdl, i1 0)
				ret i8* %hdl
				}

				; See that we got a PHI with constants in the happy case
				; CHECK-LABEL: define i8* @happy.case(
				; CHECK: %[[SLOT:.+]] = phi i32 [ 1, %entry ], [ 2, %resume ]
				; CHECK: call void @free(i8* {{.+}})
				; CHECK: call void @print.i32(i32 %[[SLOT]])
				; CHECK: ret i8*

				define i8* @escaped.case() {
				entry:
				%slot = alloca i32
				%id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
				%size = call i32 @llvm.coro.size.i32()
				%alloc = call i8* @malloc(i32 %size)
				%hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)

				store i32 1, i32* %slot
				call void @escape(i32* %slot)

				%tok = call i8 @llvm.coro.suspend(token none, i1 false)
				switch i8 %tok, label %suspend [i8 0, label %resume
				i8 1, label %cleanup]
				resume:
				store i32 2, i32* %slot
				br label %cleanup

				cleanup:
				%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
				%x = load i32, i32* %slot
				call void @print.i32(i32 %x)
				call void @free(i8* %mem)
				br label %suspend
				suspend:
				call i1 @llvm.coro.end(i8* %hdl, i1 0)
				ret i8* %hdl
				}

				; See that we kept the load when the alloca escaped
				; CHECK-LABEL: define i8* @escaped.case(
				; CHECK: %[[SLOT:.+]] = load i32, i32* %{{.+}}
				; CHECK: call void @print.i32(i32 %[[SLOT]])
				; CHECK: ret i8*

				define i8* @promise.case() {
				entry:
				%slot = alloca i32
				%pv = bitcast i32* %slot to i8*
				%id = call token @llvm.coro.id(i32 0, i8* %pv, i8* null, i8* null)
				%size = call i32 @llvm.coro.size.i32()
				%alloc = call i8* @malloc(i32 %size)
				%hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)

				%tok = call i8 @llvm.coro.suspend(token none, i1 false)
				switch i8 %tok, label %suspend [i8 0, label %resume
				i8 1, label %cleanup]
				resume:
				store i32 2, i32* %slot
				br label %cleanup

				cleanup:
				%x = load i32, i32* %slot
				call void @print.i32(i32 %x)
				%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
				call void @free(i8* %mem)
				br label %suspend
				suspend:
				call i1 @llvm.coro.end(i8* %hdl, i1 0)
				ret i8* %hdl
				}

				; See that we kept the load when the alloca is from promise
				; CHECK-LABEL: define i8* @promise.case(
				; CHECK: %[[SLOT:.+]] = load i32, i32* %{{.+}}
				; CHECK: call void @print.i32(i32 %[[SLOT]])
				; CHECK: ret i8*

				declare i8* @llvm.coro.free(token, i8*)
				declare i32 @llvm.coro.size.i32()
				declare i8 @llvm.coro.suspend(token, i1)

				declare token @llvm.coro.id(i32, i8, i8, i8*)
				declare i1 @llvm.coro.alloc(token)
				declare i8* @llvm.coro.begin(token, i8*)
				declare i1 @llvm.coro.end(i8*, i1)

				declare noalias i8* @malloc(i32)
				declare void @print.i32(i32)
				declare void @escape(i32*)
				declare void @free(i8*)