This is an archive of the discontinued LLVM Phabricator instance.

CoroFrame: Rework SuspendCrossingInfo analysis
AbandonedPublic

Authored by MatzeB on Aug 1 2023, 4:07 PM.

Details

Summary

This changes the dataflow solver into a two-phase approach:

  • In a first pass visit every block in a reverse post-order resulting in a topological ordering for loop-free code. Put "backedges" into worklist.
  • Revisit blocks with worklist dataflow algorithm until fixpoint is reached.

This should be optimal for loop-free code and lead to faster convergence with loops present (compare to the current code).

This is inspired by https://reviews.llvm.org/D154695

Performance of artifical testcase in https://github.com/llvm/llvm-project/issues/62348 without this patch:

n: 20000
6.27user 0.22system 0:06.50elapsed 99%CPU (0avgtext+0avgdata 498664maxresident)k
0inputs+8824outputs (0major+115980minor)pagefaults 0swaps

n: 40000
18.21user 0.87system 0:19.10elapsed 99%CPU (0avgtext+0avgdata 1722016maxresident)k
0inputs+17576outputs (0major+425395minor)pagefaults 0swaps

n: 60000
35.72user 1.74system 0:37.47elapsed 99%CPU (0avgtext+0avgdata 3742560maxresident)k
0inputs+26456outputs (0major+936001minor)pagefaults 0swaps

n: 80000
58.58user 3.06system 1:01.65elapsed 99%CPU (0avgtext+0avgdata 6513700maxresident)k
0inputs+35328outputs (0major+1631166minor)pagefaults 0swaps

n: 100000
82.78user 4.64system 1:27.44elapsed 99%CPU (0avgtext+0avgdata 10117168maxresident)k
0inputs+44016outputs (0major+2552478minor)pagefaults 0swaps

with this change:

n: 20000
3.59user 0.22system 0:03.82elapsed 99%CPU (0avgtext+0avgdata 498804maxresident)k
0inputs+9016outputs (0major+119143minor)pagefaults 0swaps

n: 40000
8.57user 0.81system 0:09.39elapsed 99%CPU (0avgtext+0avgdata 1722108maxresident)k
0inputs+17832outputs (0major+431756minor)pagefaults 0swaps

n: 60000
12.70user 1.79system 0:14.50elapsed 99%CPU (0avgtext+0avgdata 3742640maxresident)k
0inputs+26328outputs (0major+945180minor)pagefaults 0swaps

n: 80000
19.23user 3.07system 0:22.32elapsed 99%CPU (0avgtext+0avgdata 6513620maxresident)k
0inputs+35200outputs (0major+1643901minor)pagefaults 0swaps

n: 100000
22.48user 4.71system 0:27.20elapsed 99%CPU (0avgtext+0avgdata 10117160maxresident)k
0inputs+43952outputs (0major+2571138minor)pagefaults 0swaps

and with https://reviews.llvm.org/D154695 which performs the same within measurement noise:

n: 20000
3.84user 0.22system 0:04.07elapsed 99%CPU (0avgtext+0avgdata 498536maxresident)k
0inputs+8824outputs (0major+116001minor)pagefaults 0swaps

n: 40000
8.33user 0.83system 0:09.18elapsed 99%CPU (0avgtext+0avgdata 1721764maxresident)k
0inputs+17704outputs (0major+425381minor)pagefaults 0swaps

n: 60000
12.68user 1.78system 0:14.47elapsed 99%CPU (0avgtext+0avgdata 3742440maxresident)k
0inputs+26328outputs (0major+936004minor)pagefaults 0swaps

n: 80000
18.10user 3.02system 0:21.13elapsed 99%CPU (0avgtext+0avgdata 6513456maxresident)k
0inputs+35328outputs (0major+1631150minor)pagefaults 0swaps

n: 100000
22.15user 4.56system 0:26.73elapsed 99%CPU (0avgtext+0avgdata 10116828maxresident)k
0inputs+43824outputs (0major+2552466minor)pagefaults 0swaps

Diff Detail

Event Timeline

MatzeB created this revision.Aug 1 2023, 4:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 4:07 PM
MatzeB requested review of this revision.Aug 1 2023, 4:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2023, 4:07 PM
MatzeB added a comment.Aug 1 2023, 4:08 PM

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

MatzeB edited the summary of this revision. (Show Details)Aug 1 2023, 4:13 PM
MatzeB edited the summary of this revision. (Show Details)Aug 1 2023, 4:21 PM
MatzeB edited the summary of this revision. (Show Details)
MatzeB edited the summary of this revision. (Show Details)

Looks pretty good. Thanks.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
217

Is it better to have only one BasicBlock arguments? Then we can get the BlockData and the Block index by Mapping and Block.

witstorm95 added a comment.EditedAug 1 2023, 7:55 PM

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

It work well at testcase in https://github.com/llvm/llvm-project/issues/62348. But if we add some loops to generate more complex CFG, It takes a long time on suspendCrossingInfo analysis. The modification about gen2.cpp as follow,

#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}

It takes 9 mins when n = 20000.

Publishing an alternative to D154695 here. I got somewhat nerd-snipet yesterday in trying to prove that RPOs work perfectly fine and lead to clean code...

It work well at testcase in https://github.com/llvm/llvm-project/issues/62348. But if we add some loops to generate more complex CFG, It takes a long time on suspendCrossingInfo analysis. The modification about gen2.cpp as follow,

#include <cassert>
#include <cstdio>
#include <cstdlib>

int main(int argc, char **argv) {
  assert(argc == 2);
  int n = atoi(argv[1]);
  printf("#include <cstdio>\n");
  printf("#include <vector>\n");
  printf("#include \"task.h\"\n");
  printf("extern bool count(int);\n");
  printf("task t() {\n");
  printf("std::vector<int> v(%d);\n", n);
  printf("int val = 0;\n");
  printf("for (int i = 0; i < %d; ++i) {;\n", n);
  for (int i = 1; i <= n; i++) printf("if (count(v[i])) val++;\n");
  printf("}\n");
  printf("printf(\"%%d\\n\", val);\n");
  printf("co_return;\n");
  printf("}\n");
  return 0;
}

It takes 9 mins when n = 20000.

https://reviews.llvm.org/D156850
I create another NFC patch to fixed this issue. And the result is,

n: 20000
6.29user 0.43system 0:06.72elapsed 99%CPU (0avgtext+0avgdata 1065792maxresident)k
0inputs+16816outputs (0major+256284minor)pagefaults 0swaps

n: 40000
13.17user 1.14system 0:14.31elapsed 99%CPU (0avgtext+0avgdata 3787144maxresident)k
0inputs+33536outputs (0major+932693minor)pagefaults 0swaps

n: 60000
20.32user 2.16system 0:22.50elapsed 99%CPU (0avgtext+0avgdata 8262256maxresident)k
0inputs+50256outputs (0major+2045494minor)pagefaults 0swaps

n: 80000
29.12user 3.85system 0:32.98elapsed 99%CPU (0avgtext+0avgdata 14499404maxresident)k
8inputs+66976outputs (0major+3596571minor)pagefaults 0swaps

n: 100000
37.64user 6.01system 0:43.65elapsed 99%CPU (0avgtext+0avgdata 22475628maxresident)k
0inputs+83696outputs (0major+5582433minor)pagefaults 0swaps
MatzeB added a comment.Aug 2 2023, 1:38 PM

Curious... With a single loop (and no loop nestings) I would have expected the worklist algorithm to converge quick enough (like maybe double the time of the loop-free program in practice). Would still be interested to learn why this goes wrong as I think of the worklist algorithm as the standard way to solve dataflow problems...

Though also happy to go with @witstorm change as I probably shouldn't spend much more time on this now that our builds are restored.

MatzeB added a comment.Aug 2 2023, 3:21 PM

I checked and the number of blocks visited only increases linearly. So that's all expected. There are some effects leading to quadratic compiletime as the Consume/Kill BitVectors also get bigger with number of basic blocks, but we can't avoid that and it's independent of dataflow solving order.

MatzeB added a comment.Aug 2 2023, 4:15 PM

I finally understand what happened. Turns out the variant loop runtime of the Worklist algorithm is dominated by some exception handling block having N predecessors for an input of genX input of size N. We updated every of those N predecessors and each time re-visisted that block and merged the info of all the N predecessors. So not sure what to make of this... it's surely a bit of an artifact of those artifically big inputs, but yeah we happen to avoid this problem by just doing multiple rounds of RPO over the whole functions as in https://reviews.llvm.org/D156850 so let's go with that.

MatzeB added a comment.Aug 2 2023, 4:23 PM

on another side note: The quadratic behavior for blocks with multiple predecessors would not happen for dataflow implementations "by-the-book" where the IN and OUT sets are saved separately. The fact that we compute the IN-set on the fly here but don't store it in the BlockData lead to the runtime explosion with the worklist algo...

witstorm95 added a comment.EditedAug 4 2023, 10:26 PM

The re-visit order still importance. I mean it need to keep RPOs order too. So we can use priority queue to replace deque for WorkList, then the previous problem can be solved.

I think this advise is enough. Maybe I need to create another patch to show my idea?