This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/Coroutines/
-
lib/
-
Transforms/
-
Coroutines/
5/10
CoroFrame.cpp

Differential D156850

[NFC][Coroutines] Use a reverse post-order to guide the computation about cross suspend infomation to reach a fixed point faster.
ClosedPublic

Authored by witstorm95 on Aug 1 2023, 10:33 PM.

Download Raw Diff

Details

Reviewers

ChuanqiXu
MatzeB
GorNishanov

Commits

rG132bb5cc5fd5: [NFC][Coroutines] Use a reverse post-order to guide the computation about cross…

Summary

Fixed https://github.com/llvm/llvm-project/issues/62348

Propagate cross suspend point information along reverse post-order.
It does not modify the original function, just selects a better traversal order.

Before the patch:

n: 20000
4.31user 0.11system 0:04.44elapsed 99%CPU (0avgtext+0avgdata 552352maxresident)k
0inputs+8848outputs (0major+126254minor)pagefaults 0swaps

n: 40000
11.24user 0.40system 0:11.66elapsed 99%CPU (0avgtext+0avgdata 1788404maxresident)k
0inputs+17600outputs (0major+431105minor)pagefaults 0swaps

n: 60000
21.65user 0.96system 0:22.62elapsed 99%CPU (0avgtext+0avgdata 3809836maxresident)k
0inputs+26352outputs (0major+934749minor)pagefaults 0swaps

n: 80000
37.05user 1.53system 0:38.58elapsed 99%CPU (0avgtext+0avgdata 6602396maxresident)k
0inputs+35096outputs (0major+1622584minor)pagefaults 0swaps

n: 100000
51.87user 2.67system 0:54.54elapsed 99%CPU (0avgtext+0avgdata 10210736maxresident)k
0inputs+43848outputs (0major+2518945minor)pagefaults 0swaps

After the patch:

n: 20000
3.08user 0.12system 0:03.21elapsed 99%CPU (0avgtext+0avgdata 551012maxresident)k
0inputs+8848outputs (0major+129349minor)pagefaults 0swaps

n: 40000
5.88user 0.33system 0:06.22elapsed 99%CPU (0avgtext+0avgdata 1789248maxresident)k
0inputs+17600outputs (0major+435096minor)pagefaults 0swaps

n: 60000
8.84user 0.77system 0:09.63elapsed 99%CPU (0avgtext+0avgdata 3807800maxresident)k
0inputs+26352outputs (0major+939119minor)pagefaults 0swaps

n: 80000
11.64user 1.58system 0:13.23elapsed 99%CPU (0avgtext+0avgdata 6604708maxresident)k
0inputs+35096outputs (0major+1629566minor)pagefaults 0swaps

n: 100000
15.21user 2.56system 0:17.79elapsed 99%CPU (0avgtext+0avgdata 10208828maxresident)k
8inputs+43848outputs (0major+2526611minor)pagefaults 0swaps

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

witstorm95 created this revision.Aug 1 2023, 10:33 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 10:33 PM

Herald added subscribers: ChuanqiXu, hiraditya. · View Herald Transcript

witstorm95 requested review of this revision.Aug 1 2023, 10:33 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 10:33 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

witstorm95 added reviewers: ChuanqiXu, MatzeB, GorNishanov.Aug 1 2023, 10:35 PM

This reads much better! Thanks.

How long would the patch compile for the case?

#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
117–119	The comment reads slightly odd. Also I feel it may not be necessary to require RPO in `computeBlockData` since it simply iterates the given range. So may be the following signature may be better: template <bool Initialize = false, class BBRangeTy> bool computeBlockData(const BBRangeTy &BBRange);
328–333	Maybe it is better to say: it is considered to be faster to use RPO traversal for forward-edges dataflow analysis. It may be best to cite this but this may not be required.

@ChuanqiXu The compilation time as follow,

n: 20000
6.25user 0.24system 0:06.50elapsed 99%CPU (0avgtext+0avgdata 1065512maxresident)k
0inputs+16816outputs (0major+256090minor)pagefaults 0swaps

n: 40000
12.75user 0.93system 0:13.68elapsed 99%CPU (0avgtext+0avgdata 3784604maxresident)k
8inputs+33536outputs (0major+931086minor)pagefaults 0swaps

n: 60000
19.73user 2.11system 0:21.86elapsed 99%CPU (0avgtext+0avgdata 8261644maxresident)k
0inputs+50256outputs (0major+2045101minor)pagefaults 0swaps

n: 80000
28.88user 3.73system 0:32.63elapsed 99%CPU (0avgtext+0avgdata 14499536maxresident)k
0inputs+66976outputs (0major+3597496minor)pagefaults 0swaps

n: 100000
37.95user 5.84system 0:43.79elapsed 99%CPU (0avgtext+0avgdata 22475716maxresident)k
0inputs+83696outputs (0major+5583555minor)pagefaults 0swaps

Accept these suggestions.

LGTM then. Thanks. Let's wait for @MatzeB.

This revision is now accepted and ready to land.Aug 2 2023, 12:00 AM

Harbormaster completed remote builds in B249679: Diff 546328.Aug 2 2023, 12:25 AM

MatzeB added inline comments.Aug 2 2023, 2:05 PM

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–230	Just put `ReversePostOrderTraversal<Function *>&` here. Adding a template parameter that only ever gets the same type just adds needless complexity.
284–285	Do we actually need `B.Changed`? Seems to me we only write and immediately read form it here, so a local variable would do instead?

MatzeB added inline comments.Aug 2 2023, 2:11 PM

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
240–245	Oh guess `B.Changed` is read here. But how does this work? Isn't `Changed` always `false` after the `Initialize == true` round? Wouldn't we just skip processing any block in the 2nd round then because of this?

MatzeB mentioned this in D156835: CoroFrame: Rework SuspendCrossingInfo analysis.Aug 2 2023, 4:15 PM

MatzeB added inline comments.Aug 2 2023, 4:18 PM

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
236–237	Well we should definitely check the predecessors even in Initialization now, in an RPO order we should have nearly all of them computed already (everything except for backedges).

@MatzeB I'm sorry it took so long to get back to you as I have no time to do this.

As you can see, there still has many redundant operations. Just like the last iteration is to check whether it has reached a fixed point and etc.

If we want to remove these redundant operation, there still are ways to do it. I means we can do it based on your previous patch(https://reviews.llvm.org/D156835). I will give your some advises to improve on your patch. Let we jump to D156835.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–230	I guess use BBRangeTy just want to make computeBlockData more general. It means you can specify another BBRange to guide the computation and no need to modificy here again.
236–237	Every block has marked `Changed` as true before initial iteration. So we don't need to check if the predecessors has changed(Always changed) in initialization.
240–245	In the initial iteration(`Initialize == true`), we will not update `Changed`.

As you can see, there still has many redundant operations. Just like the last iteration is to check whether it has reached a fixed point and etc.

Yes, but it is simple and fast enough. Hence I abandoned my other patch and hoped we can go just go with this. It nicely fixes the original issue and similar programs.

While a solution based on a priority queue should be better when programs become arbitrarily large, I'm not convinced the added complexity will help common "normal" sized functions and my main worry is that it adds more complexity IMO.

LGTM from me on this patch.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–230	Yes, it is more general. But more general is not automatically better especially if the chance of there ever being another type is really low. You have to consider that what mostly happens to code is that it is read by other people! What happens less often is that the code is written and even less often that the type is changed. In this case I think a fixed type helps readability (at the tiny cost of having to type more when we actually want to change the type in the future).

Yeah, I think the current patch is good enough. (I am not against further optimizations). Also Clang/LLVM prefers small changes than large changes. And it is should be rare for a program contains 100000+ co_await in practice.

I'd like to land this patch later.

Remove template parameter BBRangeTy.

This revision was landed with ongoing or failed builds.Aug 8 2023, 8:23 PM

Closed by commit rG132bb5cc5fd5: [NFC][Coroutines] Use a reverse post-order to guide the computation about cross… (authored by witstorm95, committed by Chuanqi Xu <chuaniq.xcq@alibaba-inc.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Chuanqi Xu <chuaniq.xcq@alibaba-inc.com> added a commit: rG132bb5cc5fd5: [NFC][Coroutines] Use a reverse post-order to guide the computation about cross….

Harbormaster completed remote builds in B251261: Diff 548441.Aug 9 2023, 12:13 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Coroutines/

CoroFrame.cpp

34 lines

Diff 548449

llvm/lib/Transforms/Coroutines/CoroFrame.cpp

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines public:

size_t size() const { return V.size(); } size_t size() const { return V.size(); }

BlockToIndexMapping(Function &F) { BlockToIndexMapping(Function &F) {

for (BasicBlock &BB : F) for (BasicBlock &BB : F)

V.push_back(&BB); V.push_back(&BB);

llvm::sort(V); llvm::sort(V);

} }

size_t blockToIndex(BasicBlock *BB) const { size_t blockToIndex(BasicBlock const *BB) const {

auto *I = llvm::lower_bound(V, BB); auto *I = llvm::lower_bound(V, BB);

assert(I != V.end() && *I == BB && "BasicBlockNumberng: Unknown block"); assert(I != V.end() && *I == BB && "BasicBlockNumberng: Unknown block");

return I - V.begin(); return I - V.begin();

} }

BasicBlock *indexToBlock(unsigned Index) const { return V[Index]; } BasicBlock *indexToBlock(unsigned Index) const { return V[Index]; }

}; };

} // end anonymous namespace } // end anonymous namespace

Show All 32 Lines iterator_range<pred_iterator> predecessors(BlockData const &BD) const {

return llvm::predecessors(BB); return llvm::predecessors(BB);

} }

BlockData &getBlockData(BasicBlock *BB) { BlockData &getBlockData(BasicBlock *BB) {

return Block[Mapping.blockToIndex(BB)]; return Block[Mapping.blockToIndex(BB)];

} }

/// Compute the BlockData for the current function in one iteration. /// Compute the BlockData for the current function in one iteration.

/// Returns whether the BlockData changes in this iteration.

/// Initialize - Whether this is the first iteration, we can optimize /// Initialize - Whether this is the first iteration, we can optimize

/// the initial case a little bit by manual loop switch. /// the initial case a little bit by manual loop switch.

template <bool Initialize = false> bool computeBlockData(); /// Returns whether the BlockData changes in this iteration.

template <bool Initialize = false>

bool computeBlockData(const ReversePostOrderTraversal<Function *> &RPOT);

ChuanqiXuUnsubmitted

Done

/// the initial case a little bit by manual loop switch.

- /// The parameter "RPOT" is a reverse post order.

/// Returns whether the BlockData changes in this iteration.

The comment reads slightly odd. Also I feel it may not be necessary to require RPO in computeBlockData since it simply iterates the given range. So may be the following signature may be better:

template <bool Initialize = false, class BBRangeTy>
bool computeBlockData(const BBRangeTy &BBRange);

ChuanqiXu: The comment reads slightly odd. Also I feel it may not be necessary to require RPO in…

public: public:

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)

void dump() const; void dump() const;

void dump(StringRef Label, BitVector const &BV) const; void dump(StringRef Label, BitVector const &BV) const;

#endif #endif

SuspendCrossingInfo(Function &F, coro::Shape &Shape); SuspendCrossingInfo(Function &F, coro::Shape &Shape);

▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines for (size_t I = 0, N = Block.size(); I < N; ++I) {

dbgs() << B->getName() << ":\n"; dbgs() << B->getName() << ":\n";

dump(" Consumes", Block[I].Consumes); dump(" Consumes", Block[I].Consumes);

dump(" Kills", Block[I].Kills); dump(" Kills", Block[I].Kills);

} }

dbgs() << "\n"; dbgs() << "\n";

} }

#endif #endif

template <bool Initialize> bool SuspendCrossingInfo::computeBlockData() { template <bool Initialize>

const size_t N = Mapping.size(); bool SuspendCrossingInfo::computeBlockData(

const ReversePostOrderTraversal<Function *> &RPOT) {

bool Changed = false; bool Changed = false;

MatzeBUnsubmitted

Not Done

Just put ReversePostOrderTraversal<Function *>& here. Adding a template parameter that only ever gets the same type just adds needless complexity.

MatzeB: Just put `ReversePostOrderTraversal<Function *>&` here. Adding a template parameter that only…

witstorm95AuthorUnsubmitted

Done

I guess use BBRangeTy just want to make computeBlockData more general. It means you can specify another BBRange to guide the computation and no need to modificy here again.

witstorm95: I guess use BBRangeTy just want to make computeBlockData more general. It means you can specify…

MatzeBUnsubmitted

Not Done

Yes, it is more general. But more general is not automatically better especially if the chance of there ever being another type is really low. You have to consider that what mostly happens to code is that it is read by other people! What happens less often is that the code is written and even less often that the type is changed. In this case I think a fixed type helps readability (at the tiny cost of having to type more when we actually want to change the type in the future).

MatzeB: Yes, it is more general. But more general is not automatically better especially if the chance…

for (size_t I = 0; I < N; ++I) { for (const BasicBlock *BB : RPOT) {

auto &B = Block[I]; auto BBNo = Mapping.blockToIndex(BB);

auto &B = Block[BBNo];

// We don't need to count the predecessors when initialization. // We don't need to count the predecessors when initialization.

if constexpr (!Initialize) if constexpr (!Initialize)

MatzeBUnsubmitted

Not Done

Well we should definitely check the predecessors even in Initialization now, in an RPO order we should have nearly all of them computed already (everything except for backedges).

MatzeB: Well we should definitely check the predecessors even in Initialization now, in an RPO order we…

witstorm95AuthorUnsubmitted

Done

Every block has marked Changed as true before initial iteration. So we don't need to check if the predecessors has changed(Always changed) in initialization.

witstorm95: Every block has marked `Changed ` as true before initial iteration. So we don't need to check…

// If all the predecessors of the current Block don't change, // If all the predecessors of the current Block don't change,

// the BlockData for the current block must not change too. // the BlockData for the current block must not change too.

if (all_of(predecessors(B), [this](BasicBlock *BB) { if (all_of(predecessors(B), [this](BasicBlock *BB) {

return !Block[Mapping.blockToIndex(BB)].Changed; return !Block[Mapping.blockToIndex(BB)].Changed;

})) { })) {

B.Changed = false; B.Changed = false;

continue; continue;

} }

MatzeBUnsubmitted

Not Done

Oh guess B.Changed is read here. But how does this work? Isn't Changed always false after the Initialize == true round? Wouldn't we just skip processing any block in the 2nd round then because of this?

MatzeB: Oh guess `B.Changed` is read here. But how does this work? Isn't `Changed` always `false` after…

witstorm95AuthorUnsubmitted

Done

In the initial iteration(Initialize == true), we will not update Changed.

witstorm95: In the initial iteration(`Initialize == true`), we will not update `Changed`.

// Saved Consumes and Kills bitsets so that it is easy to see // Saved Consumes and Kills bitsets so that it is easy to see

// if anything changed after propagation. // if anything changed after propagation.

auto SavedConsumes = B.Consumes; auto SavedConsumes = B.Consumes;

auto SavedKills = B.Kills; auto SavedKills = B.Kills;

for (BasicBlock *PI : predecessors(B)) { for (BasicBlock *PI : predecessors(B)) {

auto PrevNo = Mapping.blockToIndex(PI); auto PrevNo = Mapping.blockToIndex(PI);

auto &P = Block[PrevNo]; auto &P = Block[PrevNo];

// Propagate Kills and Consumes from predecessors into B. // Propagate Kills and Consumes from predecessors into B.

B.Consumes |= P.Consumes; B.Consumes |= P.Consumes;

B.Kills |= P.Kills; B.Kills |= P.Kills;

// If block P is a suspend block, it should propagate kills into block // If block P is a suspend block, it should propagate kills into block

// B for every block P consumes. // B for every block P consumes.

if (P.Suspend) if (P.Suspend)

B.Kills |= P.Consumes; B.Kills |= P.Consumes;

} }

if (B.Suspend) { if (B.Suspend) {

// If block S is a suspend block, it should kill all of the blocks it // If block B is a suspend block, it should kill all of the blocks it

// consumes. // consumes.

B.Kills |= B.Consumes; B.Kills |= B.Consumes;

} else if (B.End) { } else if (B.End) {

// If block B is an end block, it should not propagate kills as the // If block B is an end block, it should not propagate kills as the

// blocks following coro.end() are reached during initial invocation // blocks following coro.end() are reached during initial invocation

// of the coroutine while all the data are still available on the // of the coroutine while all the data are still available on the

// stack or in the registers. // stack or in the registers.

B.Kills.reset(); B.Kills.reset();

} else { } else {

// This is reached when B block it not Suspend nor coro.end and it // This is reached when B block it not Suspend nor coro.end and it

// need to make sure that it is not in the kill set. // need to make sure that it is not in the kill set.

B.KillLoop |= B.Kills[I]; B.KillLoop |= B.Kills[BBNo];

B.Kills.reset(I); B.Kills.reset(BBNo);

} }

if constexpr (!Initialize) { if constexpr (!Initialize) {

B.Changed = (B.Kills != SavedKills) || (B.Consumes != SavedConsumes); B.Changed = (B.Kills != SavedKills) || (B.Consumes != SavedConsumes);

Changed |= B.Changed; Changed |= B.Changed;

MatzeBUnsubmitted

Not Done

Do we actually need B.Changed? Seems to me we only write and immediately read form it here, so a local variable would do instead?

MatzeB: Do we actually need `B.Changed`? Seems to me we only write and immediately read form it here…

} }

if constexpr (Initialize)

return true;

return Changed; return Changed;

} }

SuspendCrossingInfo::SuspendCrossingInfo(Function &F, coro::Shape &Shape) SuspendCrossingInfo::SuspendCrossingInfo(Function &F, coro::Shape &Shape)

: Mapping(F) { : Mapping(F) {

const size_t N = Mapping.size(); const size_t N = Mapping.size();

Block.resize(N); Block.resize(N);

Show All 23 Lines auto markSuspendBlock = [&](IntrinsicInst *BarrierInst) {

B.Kills |= B.Consumes; B.Kills |= B.Consumes;

}; };

for (auto *CSI : Shape.CoroSuspends) { for (auto *CSI : Shape.CoroSuspends) {

markSuspendBlock(CSI); markSuspendBlock(CSI);

if (auto *Save = CSI->getCoroSave()) if (auto *Save = CSI->getCoroSave())

markSuspendBlock(Save); markSuspendBlock(Save);

} }

computeBlockData</*Initialize=*/true>(); // It is considered to be faster to use RPO traversal for forward-edges

// dataflow analysis.

while (computeBlockData()) ReversePostOrderTraversal<Function *> RPOT(&F);

computeBlockData</*Initialize=*/true>(RPOT);

while (computeBlockData</*Initialize*/ false>(RPOT))

; ;

ChuanqiXuUnsubmitted

Done

Maybe it is better to say: it is considered to be faster to use RPO traversal for forward-edges dataflow analysis. It may be best to cite this but this may not be required.

ChuanqiXu: Maybe it is better to say: it is considered to be faster to use RPO traversal for forward-edges…

LLVM_DEBUG(dump()); LLVM_DEBUG(dump());

} }

namespace { namespace {

// RematGraph is used to construct a DAG for rematerializable instructions // RematGraph is used to construct a DAG for rematerializable instructions

// When the constructor is invoked with a candidate instruction (which is // When the constructor is invoked with a candidate instruction (which is

▲ Show 20 Lines • Show All 2,782 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[NFC][Coroutines] Use a reverse post-order to guide the computation about cross suspend infomation to reach a fixed point faster.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 548449

llvm/lib/Transforms/Coroutines/CoroFrame.cpp

[NFC][Coroutines] Use a reverse post-order to guide the computation about cross suspend infomation to reach a fixed point faster.
ClosedPublic