This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/Coroutines/
-
lib/
-
Transforms/
-
Coroutines/
1
CoroFrame.cpp

Differential D156835

CoroFrame: Rework SuspendCrossingInfo analysis
AbandonedPublic

Authored by MatzeB on Aug 1 2023, 4:07 PM.

Download Raw Diff

Details

Reviewers

ChuanqiXu
GorNishanov
witstorm95

Summary

This changes the dataflow solver into a two-phase approach:

In a first pass visit every block in a reverse post-order resulting in a topological ordering for loop-free code. Put "backedges" into worklist.
Revisit blocks with worklist dataflow algorithm until fixpoint is reached.

This should be optimal for loop-free code and lead to faster convergence with loops present (compare to the current code).

This is inspired by https://reviews.llvm.org/D154695

Performance of artifical testcase in https://github.com/llvm/llvm-project/issues/62348 without this patch:

n: 20000
6.27user 0.22system 0:06.50elapsed 99%CPU (0avgtext+0avgdata 498664maxresident)k
0inputs+8824outputs (0major+115980minor)pagefaults 0swaps

n: 40000
18.21user 0.87system 0:19.10elapsed 99%CPU (0avgtext+0avgdata 1722016maxresident)k
0inputs+17576outputs (0major+425395minor)pagefaults 0swaps

n: 60000
35.72user 1.74system 0:37.47elapsed 99%CPU (0avgtext+0avgdata 3742560maxresident)k
0inputs+26456outputs (0major+936001minor)pagefaults 0swaps

n: 80000
58.58user 3.06system 1:01.65elapsed 99%CPU (0avgtext+0avgdata 6513700maxresident)k
0inputs+35328outputs (0major+1631166minor)pagefaults 0swaps

n: 100000
82.78user 4.64system 1:27.44elapsed 99%CPU (0avgtext+0avgdata 10117168maxresident)k
0inputs+44016outputs (0major+2552478minor)pagefaults 0swaps

with this change:

n: 20000
3.59user 0.22system 0:03.82elapsed 99%CPU (0avgtext+0avgdata 498804maxresident)k
0inputs+9016outputs (0major+119143minor)pagefaults 0swaps

n: 40000
8.57user 0.81system 0:09.39elapsed 99%CPU (0avgtext+0avgdata 1722108maxresident)k
0inputs+17832outputs (0major+431756minor)pagefaults 0swaps

n: 60000
12.70user 1.79system 0:14.50elapsed 99%CPU (0avgtext+0avgdata 3742640maxresident)k
0inputs+26328outputs (0major+945180minor)pagefaults 0swaps

n: 80000
19.23user 3.07system 0:22.32elapsed 99%CPU (0avgtext+0avgdata 6513620maxresident)k
0inputs+35200outputs (0major+1643901minor)pagefaults 0swaps

n: 100000
22.48user 4.71system 0:27.20elapsed 99%CPU (0avgtext+0avgdata 10117160maxresident)k
0inputs+43952outputs (0major+2571138minor)pagefaults 0swaps

and with https://reviews.llvm.org/D154695 which performs the same within measurement noise:

n: 20000
3.84user 0.22system 0:04.07elapsed 99%CPU (0avgtext+0avgdata 498536maxresident)k
0inputs+8824outputs (0major+116001minor)pagefaults 0swaps

n: 40000
8.33user 0.83system 0:09.18elapsed 99%CPU (0avgtext+0avgdata 1721764maxresident)k
0inputs+17704outputs (0major+425381minor)pagefaults 0swaps

n: 60000
12.68user 1.78system 0:14.47elapsed 99%CPU (0avgtext+0avgdata 3742440maxresident)k
0inputs+26328outputs (0major+936004minor)pagefaults 0swaps

n: 80000
18.10user 3.02system 0:21.13elapsed 99%CPU (0avgtext+0avgdata 6513456maxresident)k
0inputs+35328outputs (0major+1631150minor)pagefaults 0swaps

n: 100000
22.15user 4.56system 0:26.73elapsed 99%CPU (0avgtext+0avgdata 10116828maxresident)k
0inputs+43824outputs (0major+2552466minor)pagefaults 0swaps

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MatzeB created this revision.Aug 1 2023, 4:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 4:07 PM

Herald added subscribers: modimo, wenlei, hiraditya, mcrosier. · View Herald Transcript

MatzeB requested review of this revision.Aug 1 2023, 4:07 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 4:07 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

MatzeB edited the summary of this revision. (Show Details)Aug 1 2023, 4:10 PM

MatzeB mentioned this in D154695: [Coroutines] Add an O(n) algorithm for computing the cross suspend point information..

MatzeB edited the summary of this revision. (Show Details)Aug 1 2023, 4:13 PM

MatzeB edited the summary of this revision. (Show Details)Aug 1 2023, 4:21 PM

MatzeB edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B249624: Diff 546258.Aug 1 2023, 5:40 PM

Looks pretty good. Thanks.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
217	Is it better to have only one `BasicBlock` arguments? Then we can get the BlockData and the Block index by `Mapping` and `Block`.

In D156835#4552147, @MatzeB wrote:

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

It work well at testcase in https://github.com/llvm/llvm-project/issues/62348. But if we add some loops to generate more complex CFG, It takes a long time on suspendCrossingInfo analysis. The modification about gen2.cpp as follow,

#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}

It takes 9 mins when n = 20000.

In D156835#4552575, @witstorm95 wrote:
In D156835#4552147, @MatzeB wrote:

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

It work well at testcase in https://github.com/llvm/llvm-project/issues/62348. But if we add some loops to generate more complex CFG, It takes a long time on suspendCrossingInfo analysis. The modification about gen2.cpp as follow,
#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}
It takes 9 mins when n = 20000.

https://reviews.llvm.org/D156850
I create another NFC patch to fixed this issue. And the result is,

n: 20000
6.29user 0.43system 0:06.72elapsed 99%CPU (0avgtext+0avgdata 1065792maxresident)k
0inputs+16816outputs (0major+256284minor)pagefaults 0swaps

n: 40000
13.17user 1.14system 0:14.31elapsed 99%CPU (0avgtext+0avgdata 3787144maxresident)k
0inputs+33536outputs (0major+932693minor)pagefaults 0swaps

n: 60000
20.32user 2.16system 0:22.50elapsed 99%CPU (0avgtext+0avgdata 8262256maxresident)k
0inputs+50256outputs (0major+2045494minor)pagefaults 0swaps

n: 80000
29.12user 3.85system 0:32.98elapsed 99%CPU (0avgtext+0avgdata 14499404maxresident)k
8inputs+66976outputs (0major+3596571minor)pagefaults 0swaps

n: 100000
37.64user 6.01system 0:43.65elapsed 99%CPU (0avgtext+0avgdata 22475628maxresident)k
0inputs+83696outputs (0major+5582433minor)pagefaults 0swaps

Curious... With a single loop (and no loop nestings) I would have expected the worklist algorithm to converge quick enough (like maybe double the time of the loop-free program in practice). Would still be interested to learn why this goes wrong as I think of the worklist algorithm as the standard way to solve dataflow problems...

Though also happy to go with @witstorm change as I probably shouldn't spend much more time on this now that our builds are restored.

I checked and the number of blocks visited only increases linearly. So that's all expected. There are some effects leading to quadratic compiletime as the Consume/Kill BitVectors also get bigger with number of basic blocks, but we can't avoid that and it's independent of dataflow solving order.

I finally understand what happened. Turns out the variant loop runtime of the Worklist algorithm is dominated by some exception handling block having N predecessors for an input of genX input of size N. We updated every of those N predecessors and each time re-visisted that block and merged the info of all the N predecessors. So not sure what to make of this... it's surely a bit of an artifact of those artifically big inputs, but yeah we happen to avoid this problem by just doing multiple rounds of RPO over the whole functions as in https://reviews.llvm.org/D156850 so let's go with that.

as in https://reviews.llvm.org/D156850 that is.

on another side note: The quadratic behavior for blocks with multiple predecessors would not happen for dataflow implementations "by-the-book" where the IN and OUT sets are saved separately. The fact that we compute the IN-set on the fly here but don't store it in the BlockData lead to the runtime explosion with the worklist algo...

The re-visit order still importance. I mean it need to keep RPOs order too. So we can use priority queue to replace deque for WorkList， then the previous problem can be solved.

witstorm95 mentioned this in D156850: [NFC][Coroutines] Use a reverse post-order to guide the computation about cross suspend infomation to reach a fixed point faster..Aug 4 2023, 10:27 PM

I think this advise is enough. Maybe I need to create another patch to show my idea?

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Coroutines/

CoroFrame.cpp

160 lines

Diff 546258

llvm/lib/Transforms/Coroutines/CoroFrame.cpp

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	public:
size_t size() const { return V.size(); }		size_t size() const { return V.size(); }

BlockToIndexMapping(Function &F) {		BlockToIndexMapping(Function &F) {
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
V.push_back(&BB);		V.push_back(&BB);
llvm::sort(V);		llvm::sort(V);
}		}

size_t blockToIndex(BasicBlock *BB) const {		size_t blockToIndex(const BasicBlock *BB) const {
auto *I = llvm::lower_bound(V, BB);		auto *I = llvm::lower_bound(V, BB);
assert(I != V.end() && *I == BB && "BasicBlockNumberng: Unknown block");		assert(I != V.end() && *I == BB && "BasicBlockNumberng: Unknown block");
return I - V.begin();		return I - V.begin();
}		}

BasicBlock *indexToBlock(unsigned Index) const { return V[Index]; }		BasicBlock *indexToBlock(unsigned Index) const { return V[Index]; }
};		};
} // end anonymous namespace		} // end anonymous namespace
Show All 18 Lines	class SuspendCrossingInfo {
BlockToIndexMapping Mapping;		BlockToIndexMapping Mapping;

struct BlockData {		struct BlockData {
BitVector Consumes;		BitVector Consumes;
BitVector Kills;		BitVector Kills;
bool Suspend = false;		bool Suspend = false;
bool End = false;		bool End = false;
bool KillLoop = false;		bool KillLoop = false;
bool Changed = false;
};		};
SmallVector<BlockData, SmallVectorThreshold> Block;		SmallVector<BlockData, SmallVectorThreshold> Block;

iterator_range<pred_iterator> predecessors(BlockData const &BD) const {
BasicBlock *BB = Mapping.indexToBlock(&BD - &Block[0]);
return llvm::predecessors(BB);
}

BlockData &getBlockData(BasicBlock *BB) {		BlockData &getBlockData(BasicBlock *BB) {
return Block[Mapping.blockToIndex(BB)];		return Block[Mapping.blockToIndex(BB)];
}		}

/// Compute the BlockData for the current function in one iteration.		// Perform "join" and "transfer" step for dataflow analysis.
/// Returns whether the BlockData changes in this iteration.		void visitBlock(const BasicBlock &BB, BlockData &B);
/// Initialize - Whether this is the first iteration, we can optimize
/// the initial case a little bit by manual loop switch.
template <bool Initialize = false> bool computeBlockData();

public:		public:
#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
void dump() const;		void dump() const;
void dump(StringRef Label, BitVector const &BV) const;		void dump(StringRef Label, BitVector const &BV) const;
#endif		#endif

SuspendCrossingInfo(Function &F, coro::Shape &Shape);		SuspendCrossingInfo(Function &F, coro::Shape &Shape);
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	for (size_t I = 0, N = Block.size(); I < N; ++I) {
dbgs() << B->getName() << ":\n";		dbgs() << B->getName() << ":\n";
dump(" Consumes", Block[I].Consumes);		dump(" Consumes", Block[I].Consumes);
dump(" Kills", Block[I].Kills);		dump(" Kills", Block[I].Kills);
}		}
dbgs() << "\n";		dbgs() << "\n";
}		}
#endif		#endif

template <bool Initialize> bool SuspendCrossingInfo::computeBlockData() {		void SuspendCrossingInfo::visitBlock(const BasicBlock &BB, BlockData &B) {
		ChuanqiXuUnsubmitted Not Done Reply Inline Actions Is it better to have only one `BasicBlock` arguments? Then we can get the BlockData and the Block index by `Mapping` and `Block`. ChuanqiXu: Is it better to have only one `BasicBlock` arguments? Then we can get the BlockData and the…
const size_t N = Mapping.size();		// Propagate Kills and Consumes from predecessors into B. This is a dataflow
bool Changed = false;		// "join" operation.
		for (const BasicBlock *Pred : predecessors(&BB)) {
for (size_t I = 0; I < N; ++I) {		unsigned PredIndex = Mapping.blockToIndex(Pred);
auto &B = Block[I];		const BlockData &PredData = Block[PredIndex];

// We don't need to count the predecessors when initialization.		B.Consumes \|= PredData.Consumes;
if constexpr (!Initialize)		B.Kills \|= PredData.Kills;
// If all the predecessors of the current Block don't change,
// the BlockData for the current block must not change too.		// If Pred is a suspend block, it should propagate kills into block
if (all_of(predecessors(B), [this](BasicBlock *BB) {		// B for every block Pred consumes.
return !Block[Mapping.blockToIndex(BB)].Changed;		if (PredData.Suspend)
})) {		B.Kills \|= PredData.Consumes;
B.Changed = false;
continue;
}

// Saved Consumes and Kills bitsets so that it is easy to see
// if anything changed after propagation.
auto SavedConsumes = B.Consumes;
auto SavedKills = B.Kills;

for (BasicBlock *PI : predecessors(B)) {
auto PrevNo = Mapping.blockToIndex(PI);
auto &P = Block[PrevNo];

// Propagate Kills and Consumes from predecessors into B.
B.Consumes \|= P.Consumes;
B.Kills \|= P.Kills;

// If block P is a suspend block, it should propagate kills into block
// B for every block P consumes.
if (P.Suspend)
B.Kills \|= P.Consumes;
}		}

		// Update block information. This is a dataflow "transfer" operation.
if (B.Suspend) {		if (B.Suspend) {
// If block S is a suspend block, it should kill all of the blocks it		// If block S is a suspend block, it should kill all of the blocks it
// consumes.		// consumes.
B.Kills \|= B.Consumes;		B.Kills \|= B.Consumes;
} else if (B.End) {		} else if (B.End) {
// If block B is an end block, it should not propagate kills as the		// If block B is an end block, it should not propagate kills as the
// blocks following coro.end() are reached during initial invocation		// blocks following coro.end() are reached during initial invocation
// of the coroutine while all the data are still available on the		// of the coroutine while all the data are still available on the
// stack or in the registers.		// stack or in the registers.
B.Kills.reset();		B.Kills.reset();
} else {		} else {
// This is reached when B block it not Suspend nor coro.end and it		// This is reached when B block it not Suspend nor coro.end and it
// need to make sure that it is not in the kill set.		// need to make sure that it is not in the kill set.
B.KillLoop \|= B.Kills[I];		unsigned BlockIndex = std::distance(Block.begin(), &B);
B.Kills.reset(I);		B.KillLoop \|= B.Kills[BlockIndex];
}		B.Kills.reset(BlockIndex);

if constexpr (!Initialize) {
B.Changed = (B.Kills != SavedKills) \|\| (B.Consumes != SavedConsumes);
Changed \|= B.Changed;
}		}
}		}

if constexpr (Initialize)
return true;

return Changed;
}

SuspendCrossingInfo::SuspendCrossingInfo(Function &F, coro::Shape &Shape)		SuspendCrossingInfo::SuspendCrossingInfo(Function &F, coro::Shape &Shape)
: Mapping(F) {		: Mapping(F) {
const size_t N = Mapping.size();		const size_t N = Mapping.size();
Block.resize(N);		Block.resize(N);

// Initialize every block so that it consumes itself		// Initialize every block so that it consumes itself
for (size_t I = 0; I < N; ++I) {		for (size_t I = 0; I < N; ++I) {
auto &B = Block[I];		auto &B = Block[I];
B.Consumes.resize(N);		B.Consumes.resize(N);
B.Kills.resize(N);		B.Kills.resize(N);
B.Consumes.set(I);		B.Consumes.set(I);
B.Changed = true;
}		}

// Mark all CoroEnd Blocks. We do not propagate Kills beyond coro.ends as		// Mark all CoroEnd Blocks. We do not propagate Kills beyond coro.ends as
// the code beyond coro.end is reachable during initial invocation of the		// the code beyond coro.end is reachable during initial invocation of the
// coroutine.		// coroutine.
for (auto *CE : Shape.CoroEnds)		for (auto *CE : Shape.CoroEnds)
getBlockData(CE->getParent()).End = true;		getBlockData(CE->getParent()).End = true;

// Mark all suspend blocks and indicate that they kill everything they		// Mark all suspend blocks and indicate that they kill everything they
// consume. Note, that crossing coro.save also requires a spill, as any code		// consume. Note, that crossing coro.save also requires a spill, as any code
// between coro.save and coro.suspend may resume the coroutine and all of the		// between coro.save and coro.suspend may resume the coroutine and all of the
// state needs to be saved by that time.		// state needs to be saved by that time.
auto markSuspendBlock = [&](IntrinsicInst *BarrierInst) {		auto markSuspendBlock = [&](IntrinsicInst *BarrierInst) {
BasicBlock *SuspendBlock = BarrierInst->getParent();		BasicBlock *SuspendBlock = BarrierInst->getParent();
auto &B = getBlockData(SuspendBlock);		auto &B = getBlockData(SuspendBlock);
B.Suspend = true;		B.Suspend = true;
B.Kills \|= B.Consumes;		B.Kills \|= B.Consumes;
};		};
for (auto *CSI : Shape.CoroSuspends) {		for (auto *CSI : Shape.CoroSuspends) {
markSuspendBlock(CSI);		markSuspendBlock(CSI);
if (auto *Save = CSI->getCoroSave())		if (auto *Save = CSI->getCoroSave())
markSuspendBlock(Save);		markSuspendBlock(Save);
}		}

computeBlockData</Initialize=/true>();		// Dataflow analysis for BlockData.
		std::deque<size_t> Worklist;
		BitVector InWorklist(N);

		// We first visit every block in reverse post order. This successfully
		// propagates the information forward through DAG-shaped parts of the CFG.
		// We put the destination blocks of backedges into a worklist to re-visit
		// later.
		ReversePostOrderTraversal<Function *> RPOT(&F);
		BitVector Visited(N);
		for (const BasicBlock *BB : RPOT) {
		size_t BBIndex = Mapping.blockToIndex(BB);
		BlockData &B = Block[BBIndex];
		Visited.set(BBIndex);

		visitBlock(*BB, B);

		// We need to re-visit any backedges later. A back-edge is an edge to a
		// block we have already visited in the RPOT.
		for (const BasicBlock *Succ : successors(BB)) {
		size_t SuccIndex = Mapping.blockToIndex(Succ);
		if (Visited.test(SuccIndex) && !InWorklist.test(SuccIndex)) {
		Worklist.push_back(SuccIndex);
		InWorklist.set(SuccIndex);
		}
		}
		}

while (computeBlockData())		// Re-visit blocks in worklist and keep propagating until we reach a fixpoint.
;		while (!Worklist.empty()) {
		size_t BBIndex = Worklist.front();
		Worklist.pop_front();
		InWorklist.reset(BBIndex);

		const BasicBlock *BB = Mapping.indexToBlock(BBIndex);
		BlockData &B = Block[BBIndex];
		BitVector SavedConsumes = B.Consumes;
		BitVector SavedKills = B.Kills;

		visitBlock(*BB, B);

		// We need to re-visit successors if there was any change.
		if ((B.Kills != SavedKills) \|\| (B.Consumes != SavedConsumes)) {
		for (const BasicBlock *Succ : successors(BB)) {
		size_t SuccIndex = Mapping.blockToIndex(Succ);
		if (!InWorklist.test(SuccIndex)) {
		Worklist.push_back(SuccIndex);
		InWorklist.set(SuccIndex);
		}
		}
		}
		}

LLVM_DEBUG(dump());		LLVM_DEBUG(dump());
}		}

namespace {		namespace {

// RematGraph is used to construct a DAG for rematerializable instructions		// RematGraph is used to construct a DAG for rematerializable instructions
// When the constructor is invoked with a candidate instruction (which is		// When the constructor is invoked with a candidate instruction (which is
▲ Show 20 Lines • Show All 2,782 Lines • Show Last 20 Lines