This is an archive of the discontinued LLVM Phabricator instance.

[NFC][Coroutines] Use a reverse post-order to guide the computation about cross suspend infomation to reach a fixed point faster.
ClosedPublic

Authored by witstorm95 on Aug 1 2023, 10:33 PM.

Details

Summary

Fixed https://github.com/llvm/llvm-project/issues/62348

Propagate cross suspend point information along reverse post-order.
It does not modify the original function, just selects a better traversal order.

Before the patch:

n: 20000
4.31user 0.11system 0:04.44elapsed 99%CPU (0avgtext+0avgdata 552352maxresident)k
0inputs+8848outputs (0major+126254minor)pagefaults 0swaps

n: 40000
11.24user 0.40system 0:11.66elapsed 99%CPU (0avgtext+0avgdata 1788404maxresident)k
0inputs+17600outputs (0major+431105minor)pagefaults 0swaps

n: 60000
21.65user 0.96system 0:22.62elapsed 99%CPU (0avgtext+0avgdata 3809836maxresident)k
0inputs+26352outputs (0major+934749minor)pagefaults 0swaps

n: 80000
37.05user 1.53system 0:38.58elapsed 99%CPU (0avgtext+0avgdata 6602396maxresident)k
0inputs+35096outputs (0major+1622584minor)pagefaults 0swaps

n: 100000
51.87user 2.67system 0:54.54elapsed 99%CPU (0avgtext+0avgdata 10210736maxresident)k
0inputs+43848outputs (0major+2518945minor)pagefaults 0swaps

After the patch:

n: 20000
3.08user 0.12system 0:03.21elapsed 99%CPU (0avgtext+0avgdata 551012maxresident)k
0inputs+8848outputs (0major+129349minor)pagefaults 0swaps

n: 40000
5.88user 0.33system 0:06.22elapsed 99%CPU (0avgtext+0avgdata 1789248maxresident)k
0inputs+17600outputs (0major+435096minor)pagefaults 0swaps

n: 60000
8.84user 0.77system 0:09.63elapsed 99%CPU (0avgtext+0avgdata 3807800maxresident)k
0inputs+26352outputs (0major+939119minor)pagefaults 0swaps

n: 80000
11.64user 1.58system 0:13.23elapsed 99%CPU (0avgtext+0avgdata 6604708maxresident)k
0inputs+35096outputs (0major+1629566minor)pagefaults 0swaps

n: 100000
15.21user 2.56system 0:17.79elapsed 99%CPU (0avgtext+0avgdata 10208828maxresident)k
8inputs+43848outputs (0major+2526611minor)pagefaults 0swaps

Diff Detail

Event Timeline

witstorm95 created this revision.Aug 1 2023, 10:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 10:33 PM
witstorm95 requested review of this revision.Aug 1 2023, 10:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 10:33 PM

This reads much better! Thanks.

How long would the patch compile for the case?

#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
117–119

The comment reads slightly odd. Also I feel it may not be necessary to require RPO in computeBlockData since it simply iterates the given range. So may be the following signature may be better:

template <bool Initialize = false, class BBRangeTy>
bool computeBlockData(const BBRangeTy &BBRange);
327–332

Maybe it is better to say: it is considered to be faster to use RPO traversal for forward-edges dataflow analysis. It may be best to cite this but this may not be required.

@ChuanqiXu The compilation time as follow,

n: 20000
6.25user 0.24system 0:06.50elapsed 99%CPU (0avgtext+0avgdata 1065512maxresident)k
0inputs+16816outputs (0major+256090minor)pagefaults 0swaps

n: 40000
12.75user 0.93system 0:13.68elapsed 99%CPU (0avgtext+0avgdata 3784604maxresident)k
8inputs+33536outputs (0major+931086minor)pagefaults 0swaps

n: 60000
19.73user 2.11system 0:21.86elapsed 99%CPU (0avgtext+0avgdata 8261644maxresident)k
0inputs+50256outputs (0major+2045101minor)pagefaults 0swaps

n: 80000
28.88user 3.73system 0:32.63elapsed 99%CPU (0avgtext+0avgdata 14499536maxresident)k
0inputs+66976outputs (0major+3597496minor)pagefaults 0swaps

n: 100000
37.95user 5.84system 0:43.79elapsed 99%CPU (0avgtext+0avgdata 22475716maxresident)k
0inputs+83696outputs (0major+5583555minor)pagefaults 0swaps
witstorm95 marked 2 inline comments as done.Aug 1 2023, 11:21 PM

Accept these suggestions.

ChuanqiXu accepted this revision.Aug 2 2023, 12:00 AM

LGTM then. Thanks. Let's wait for @MatzeB.

This revision is now accepted and ready to land.Aug 2 2023, 12:00 AM
MatzeB added inline comments.Aug 2 2023, 2:05 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–228

Just put ReversePostOrderTraversal<Function *>& here. Adding a template parameter that only ever gets the same type just adds needless complexity.

283–284

Do we actually need B.Changed? Seems to me we only write and immediately read form it here, so a local variable would do instead?

MatzeB added inline comments.Aug 2 2023, 2:11 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
239–244

Oh guess B.Changed is read here. But how does this work? Isn't Changed always false after the Initialize == true round? Wouldn't we just skip processing any block in the 2nd round then because of this?

MatzeB added inline comments.Aug 2 2023, 4:18 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
235–236

Well we should definitely check the predecessors even in Initialization now, in an RPO order we should have nearly all of them computed already (everything except for backedges).

@MatzeB I'm sorry it took so long to get back to you as I have no time to do this.

As you can see, there still has many redundant operations. Just like the last iteration is to check whether it has reached a fixed point and etc.

If we want to remove these redundant operation, there still are ways to do it. I means we can do it based on your previous patch(https://reviews.llvm.org/D156835). I will give your some advises to improve on your patch. Let we jump to D156835.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–228

I guess use BBRangeTy just want to make computeBlockData more general. It means you can specify another BBRange to guide the computation and no need to modificy here again.

235–236

Every block has marked Changed as true before initial iteration. So we don't need to check if the predecessors has changed(Always changed) in initialization.

239–244

In the initial iteration(Initialize == true), we will not update Changed.

MatzeB accepted this revision.Aug 8 2023, 8:59 AM

As you can see, there still has many redundant operations. Just like the last iteration is to check whether it has reached a fixed point and etc.

Yes, but it is simple and fast enough. Hence I abandoned my other patch and hoped we can go just go with this. It nicely fixes the original issue and similar programs.

While a solution based on a priority queue should be better when programs become arbitrarily large, I'm not convinced the added complexity will help common "normal" sized functions and my main worry is that it adds more complexity IMO.

LGTM from me on this patch.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
227–228

Yes, it is more general. But more general is not automatically better especially if the chance of there ever being another type is really low. You have to consider that what mostly happens to code is that it is read by other people! What happens less often is that the code is written and even less often that the type is changed. In this case I think a fixed type helps readability (at the tiny cost of having to type more when we actually want to change the type in the future).

Yeah, I think the current patch is good enough. (I am not against further optimizations). Also Clang/LLVM prefers small changes than large changes. And it is should be rare for a program contains 100000+ co_await in practice.

I'd like to land this patch later.

witstorm95 updated this revision to Diff 548441.Aug 8 2023, 7:06 PM

Remove template parameter BBRangeTy.