This is an archive of the discontinued LLVM Phabricator instance.

[Coroutines] Add an O(n) algorithm for computing the cross suspend point information.
ClosedPublic

Authored by witstorm95 on Jul 7 2023, 2:15 AM.

Details

Summary

Fixed https://github.com/llvm/llvm-project/issues/62348

Propagate cross suspend point information by visiting CFG.

Just only go through two times at most, you can get all the cross suspend point information.

Before the patch:

n: 20000
4.31user 0.11system 0:04.44elapsed 99%CPU (0avgtext+0avgdata 552352maxresident)k
0inputs+8848outputs (0major+126254minor)pagefaults 0swaps

n: 40000
11.24user 0.40system 0:11.66elapsed 99%CPU (0avgtext+0avgdata 1788404maxresident)k
0inputs+17600outputs (0major+431105minor)pagefaults 0swaps

n: 60000
21.65user 0.96system 0:22.62elapsed 99%CPU (0avgtext+0avgdata 3809836maxresident)k
0inputs+26352outputs (0major+934749minor)pagefaults 0swaps

n: 80000
37.05user 1.53system 0:38.58elapsed 99%CPU (0avgtext+0avgdata 6602396maxresident)k
0inputs+35096outputs (0major+1622584minor)pagefaults 0swaps

n: 100000
51.87user 2.67system 0:54.54elapsed 99%CPU (0avgtext+0avgdata 10210736maxresident)k
0inputs+43848outputs (0major+2518945minor)pagefaults 0swaps

After the patch:

n: 20000
3.17user 0.16system 0:03.33elapsed 100%CPU (0avgtext+0avgdata 551736maxresident)k
0inputs+8848outputs (0major+126192minor)pagefaults 0swaps

n: 40000
6.10user 0.42system 0:06.54elapsed 99%CPU (0avgtext+0avgdata 1787848maxresident)k
0inputs+17600outputs (0major+432212minor)pagefaults 0swaps

n: 60000
9.13user 0.89system 0:10.03elapsed 99%CPU (0avgtext+0avgdata 3809108maxresident)k
0inputs+26352outputs (0major+931280minor)pagefaults 0swaps

n: 80000
12.44user 1.57system 0:14.02elapsed 99%CPU (0avgtext+0avgdata 6603432maxresident)k
0inputs+35096outputs (0major+1624635minor)pagefaults 0swaps

n: 100000
16.29user 2.28system 0:18.59elapsed 99%CPU (0avgtext+0avgdata 10212808maxresident)k
0inputs+43848outputs (0major+2522200minor)pagefaults 0swaps

Diff Detail

Event Timeline

witstorm95 created this revision.Jul 7 2023, 2:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 2:15 AM
witstorm95 requested review of this revision.Jul 7 2023, 2:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 2:15 AM
witstorm95 edited the summary of this revision. (Show Details)Jul 7 2023, 2:37 AM
witstorm95 removed a project: Restricted Project.Jul 7 2023, 5:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 5:16 PM

Thanks for working on this.

While I haven't looked into the details, would you like to give the new algorithm an abstract explanation in the comments? Also I suggested to avoid using names like IndegreeTmp, VisitingTmp and Tmp. It is better to use informative words.

witstorm95 updated this revision to Diff 538502.Jul 9 2023, 8:23 PM

Add algorithm comments.

Sorry. I still can't understand the algorithm yet. e.g., why is it sufficient to iterate the graph 3 times? How do we identity loops in the algorithm? Could you elaborate it in the code?

Thanks for your suggestions, more clear now.

I got your point roughly. And it makes sense to me. But the patch itself is still not so readable. It will be better to refactor it to make it more informative.

For example,

  1. VisitingSave is not different with VisitingTmp really.
  2. We can't understand the meaning of things like TopVisiting. In case it is hard to give a very meaningful name, we should add a comment to describe the meaning of the variable.
  3. It is confusing that the variables with names like Indegree and HasLoop can change during the iteration. Since such names indicates that these variables tries to describe the attributes of the graph. So that they shouldn't change if the graph don't change.
  4. The names J and the first, second, third iterations are really confusing. We should split them into phases with formal name.
  5. Also it is bad to put things in a big loop. We can split them with different functionalities. e.g., we can have a phase to collect consume information and another phase to calculate the kill information.

Also for correctness, it is recommended to run this with folly library with coroutines enabled to make sure we don't get things wrong.

RKSimon resigned from this revision.Jul 10 2023, 2:55 AM

Sorry but I don't know this part of the codebase well enough to review

@ChuanqiXu I have refactor it as your suggestions. Do you think it better now ?

Also for correctness, it is recommended to run this with folly library with coroutines enabled to make sure we don't get things wrong.

I have tested folly library. The results are the same before and after this patch. But there still exists some fail. Maybe it's because I use WSL2. Here are the failed cases:
The following tests FAILED:

245 - heap_vector_types_test.HeapVectorTypes.GrowthPolicy (Failed)
1362 - HHWheelTimerTest.HHWheelTimerTest.FireOnce (Failed)
1366 - HHWheelTimerTest.HHWheelTimerTest.CancelTimeout (Failed)
1368 - HHWheelTimerTest.HHWheelTimerTest.SlowFast (Failed)
1369 - HHWheelTimerTest.HHWheelTimerTest.ReschedTest (Failed)
1370 - HHWheelTimerTest.HHWheelTimerTest.DeleteWheelInTimeout (Failed)
1371 - HHWheelTimerTest.HHWheelTimerTest.DefaultTimeout (Failed)
1375 - HHWheelTimerTest.HHWheelTimerTest.IntrusivePtr (Failed)
1487 - lang_exception_test.ExceptionTest.terminate_with_direct (Failed)
1488 - lang_exception_test.ExceptionTest.terminate_with_variadic (Failed)

And failed reason is 'unable to determine jiffies/second: failed to parse kernel release string "5.15.90.1-microsoft-standard-WSL2"'.

witstorm95 added a comment.EditedJul 10 2023, 10:13 PM

Also for correctness, it is recommended to run this with folly library with coroutines enabled to make sure we don't get things wrong.

I have tested folly library. The results are the same before and after this patch. But there still exists some fail. Maybe it's because I use WSL2. Here are the failed cases:
The following tests FAILED:

245 - heap_vector_types_test.HeapVectorTypes.GrowthPolicy (Failed)
1362 - HHWheelTimerTest.HHWheelTimerTest.FireOnce (Failed)
1366 - HHWheelTimerTest.HHWheelTimerTest.CancelTimeout (Failed)
1368 - HHWheelTimerTest.HHWheelTimerTest.SlowFast (Failed)
1369 - HHWheelTimerTest.HHWheelTimerTest.ReschedTest (Failed)
1370 - HHWheelTimerTest.HHWheelTimerTest.DeleteWheelInTimeout (Failed)
1371 - HHWheelTimerTest.HHWheelTimerTest.DefaultTimeout (Failed)
1375 - HHWheelTimerTest.HHWheelTimerTest.IntrusivePtr (Failed)
1487 - lang_exception_test.ExceptionTest.terminate_with_direct (Failed)
1488 - lang_exception_test.ExceptionTest.terminate_with_variadic (Failed)

And failed reason is 'unable to determine jiffies/second: failed to parse kernel release string "5.15.90.1-microsoft-standard-WSL2"'.

To make sure I enable coroutine, I write a case to check it. Here is code(test.cpp),

#include <folly/experimental/coro/Task.h>
#include <folly/experimental/coro/BlockingWait.h>
#include <folly/futures/Future.h>
#include <folly/executors/GlobalExecutor.h>
#include <folly/init/Init.h>
#include <iostream>

folly::coro::Task<int> slow() {

std::cout << "before sleep" << std::endl;
co_await folly::futures::sleep(std::chrono::seconds{1});
std::cout << "after sleep" << std::endl;
co_return 1;

}

int main(int argc, char** argv) {

std::cout << FOLLY_HAS_COROUTINES << std::endl;
folly::init(&argc, &argv);
folly::coro::blockingWait(
    slow().scheduleOn(folly::getGlobalCPUExecutor().get()));
return 0;

}

And the compile script b.sh is,

INSTALL_PATH=/tmp/fbcode_builder_getdeps-ZhomeZwitstormZgithubZfollyZbuildZfbcode_builder/installed/
clang++ test.cpp \

-I $INSTALL_PATH/folly/include/  \
-I $INSTALL_PATH/glog-*/include/ \
-I $INSTALL_PATH/gflags-*/include \
-I $INSTALL_PATH/boost-*/include/ \
-I $INSTALL_PATH/fmt-*/include \
-I $INSTALL_PATH/double-conversion-*/include \
-I $INSTALL_PATH/libevent-*/include \
-L $INSTALL_PATH/folly/lib/  \
-L $INSTALL_PATH/glog-*/lib/ \
-L $INSTALL_PATH/gflags-*/lib\
-L $INSTALL_PATH/boost-*/lib/ \
-L $INSTALL_PATH/fmt-*/lib\
-L $INSTALL_PATH/double-conversion-*/lib\
-L $INSTALL_PATH/libevent-*/lib\
-std=c++20 \
-stdlib=libc++ \
-lpthread -lm -ldl -lfolly -lfolly_test_util -lfollybenchmark \
-lglog -lgflags -levent -ldouble-conversion \
-lfmt -lunwind -lboost_context

export LD_LIBRARY_PATH=$INSTALL_PATH/glog-6R0Ow7ztX3g6RnwHaeTiDLzaaoxXwt87lAWO5PSwHzU/lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$INSTALL_PATH/gflags-AZEDFvV8PiCcmWJd_mLRhEQ1irNZvVX2HbvgCsFWJ_Q//lib/:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$INSTALL_PATH/libevent-nN5usJQWSTX11BbXrm3tORQxwAjvOLlXhgNZrK8zRS4/lib/:$LD_LIBRARY_PATH
./a.out

And the result is,
1
before sleep
after sleep

Thanks for running the out-of-tree tests. I know it is not easy. The failure tests look not related to coroutines. Another method to test this is to add an assert(false) in your code and re-run tests. So that you know you won't get things wrong.

Then let's focus on readability in this review. It looks better now but there is still confusing points like:

if constexpr (Iteration > 1)

Also we'd better to not use the term Indegree so that we can avoid:

Indegrees[I]--;

I suggest to replace the current computeBlockData with 2 functions: collectConsumingBlock<bool SearchBackEdges>, collectKillingBlock. Then the semantics should be much more clear.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
123

In the algorithm, I prefer to use the term back edge than loop. Since loop has many more meanings and attributes so that the back edges would be more precise.

Also with the new term back edge, we can describe the algorithm much more simpler:

  1. In the initial iterations, all the consume information along the forward edges are collected.
  2. (If there is any back edges), we can iterate again to get all the consume information along the back edges.
  3. We can compute the Kill informations by the consume informations.

In this way, it is much easier to understand too.

I suggest to replace the current computeBlockData with 2 functions: collectConsumingBlock<bool SearchBackEdges>, collectKillingBlock. Then the semantics should be much more clear.

For code reuse, I don't really think split computeBlockData to 2 functions is good idea.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
123

@ChuanqiXu Thanks for your comment.

I takes some time to figure out what means about terms like back edge and forward edge. Here is DFS Edge Classification. It's really suitable for topological sorting algorithm?

If we use that tems, I have some questions about your description of this algorithm:

In the initial iterations, all the consume information along the forward edges are collected.

The Consume info should along the tree edge and back edge(if exists) are collected. And the same time, the Kill info should be collected too as we don't know there exists back edges or not in initial iteration.

(If there is any back edges), we can iterate again to get all the consume information along the back edges.

The Consume info should along the tree edge and back edge are collected again and the Kill info should be collected too as you can't compute the all Kill informations in the third iteration only.

For code reuse, I don't really think split computeBlockData to 2 functions is good idea.

My major concern is about the readability. The so called 1,2,3 iterations are confusing. It is magic numbers and reader may have a hard time to read them.

Then out of curiosity, are the algorithm capable to handle the following case?

J -> I
I -> S
S -> I

S is a suspend block. And I.Kill[J] should be false since the path from J to I that contains the suspend block must repeat node I. So it doesn't fit into the definition of Kill. I feel the algorithm doesn't get this right but I am super sure. It is better to give it a test case too.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
123

I takes some time to figure out what means about terms like back edge and forward edge. Here is DFS Edge Classification. It's really suitable for topological sorting algorithm?

Sorry for that it is confusing. This is a suggestion instead of a requirement. Different areas have different conventions for terminologies. For example, I didn't see the definition of tree edge and cross edge in the link you give before.

In compiler engineerings, the term 'loop' has many more meanings than in pure algorithms. So that the instinct reaction that I had when I see your patch was that you did something wrong. Since I felt if you want loop informations, you should use the existing components to query the loop information. But then I realized all you need about is the back edges.

you can't compute the all Kill informations in the third iteration only.

It is the third step and it is not necessarily an iteration. For example, in your current implementations, it may still access a node multiple times if it has a lot of predecessors.

For code reuse, I don't really think split computeBlockData to 2 functions is good idea.

My major concern is about the readability. The so called 1,2,3 iterations are confusing. It is magic numbers and reader may have a hard time to read them.

Then out of curiosity, are the algorithm capable to handle the following case?

J -> I
I -> S
S -> I

S is a suspend block. And I.Kill[J] should be false since the path from J to I that contains the suspend block must repeat node I. So it doesn't fit into the definition of Kill. I feel the algorithm doesn't get this right but I am super sure. It is better to give it a test case too.

Thanks. This is a good example. But this problem exists previous algorithm too(before this patch). Let me illustrate it for you,

Pick a visiting order: J, I, S

Initial status:

J Consumes: J         Kills:           Suspend: false   End: false     KillLoop: false    Changed: true
I Consumes: I         Kills:           Suspend: false   End: false     KillLoop: false    Changed: true
S Consumes: S         Kills: S         Suspend: true    End: false     KillLoop: false    Changed: true

Excuting computeBlockData</*Initialize=*/true>()

Visit J, J has no predecessors.           Status of J Consumes: J         Kills:           Suspend: false   End: false     KillLoop: false    Changed: true
Visit I, I has predecessors J, S. 
  So propagate J to I. After propagation, status of I Consumes: J, I         Kills:           Suspend: false   End: false     KillLoop: false    Changed: true
  So propagate S to I. After propagation, status of I Consumes: J, I, S         Kills: S         Suspend: false   End: false     KillLoop: false    Changed: true
                           After visit I, status of I Consumes: J, I, S         Kills: S         Suspend: false   End: false     KillLoop: false    Changed: true
Visit S, S has predecessors I.
   So propagate I to S. After propagation status of S Consumes: J, I, S         Kills: J, I, S         Suspend: true    End: false     KillLoop: false    Changed: true

After execute computeBlockData</*Initialize=*/true>(), the status is:

J Consumes: J               Kills:           Suspend: false   End: false     KillLoop: false    Changed: true
I Consumes: J, I, S         Kills: S         Suspend: false   End: false     KillLoop: false    Changed: true
S Consumes: J, I, S         Kills: J, I, S   Suspend: true    End: false     KillLoop: false    Changed: true

Executing while (computeBlockData());

Visit J, J has no predecessors.            Status of J Consumes: J         Kills:           Suspend: false   End: false     KillLoop: false    Changed: false
Visit I, I has predecessors J, S and predecessors have a change. 
   So propagate J to I. After propagation, status of I Consumes: J, I, S         Kills: S          Suspend: false   End: false     KillLoop: false    Changed: true
   So propagate S to I. After propagation, status of I Consumes: J, I, S         Kills: J, S         Suspend: false   End: false     KillLoop: true    Changed: true
                            After visit I, status of I Consumes: J, I, S         Kills: J, S         Suspend: false   End: false     KillLoop: true    Changed: true
 Visit S, S has predecessors I and predecessors have a change.
   So propagate I to S, after propagation status of S Consumes: J, I, S         Kills: J, I, S         Suspend: true    End: false     KillLoop: false    Changed: false

After execute computeBlockData</*Initialize=*/true>() firstly, the status is:

J Consumes: J               Kills:              Suspend: false   End: false     KillLoop: false    Changed: false
I Consumes: J, I, S         Kills: J, S         Suspend: false   End: false     KillLoop: true     Changed: true
S Consumes: J, I, S         Kills: J, I, S      Suspend: true    End: false     KillLoop: false    Changed: false

S is a suspend block. And I.Kill[J] should be false since the path from J to I that contains the suspend block must repeat node I. So it doesn't fit into the definition of Kill.

I think current result will be enough to prove it or there is something wrong with the proof process ?

@ChuanqiXu I takes some time to figure out where the cross suspend point information is used and what it does . It used to determine whether the stack variable need to be placed on Coroutine frame. So the current problem does not affect the correctness of the coroutine programs.

If we stick to the definition about KIlls, we may get the wrong result in some cases. For example(llvm/test/Transforms/Coroutines/ArgAddr.ll),

define nonnull ptr @f(i32 %n) presplitcoroutine {
; CHECK-LABEL: @f(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    [[ID:%.*]] = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr @f.resumers)
; CHECK-NEXT:    [[N_ADDR:%.*]] = alloca i32, align 4
; CHECK-NEXT:    store i32 [[N:%.*]], ptr [[N_ADDR]], align 4
; CHECK-NEXT:    [[CALL:%.*]] = tail call ptr @malloc(i32 24)
; CHECK-NEXT:    [[TMP0:%.*]] = tail call noalias nonnull ptr @llvm.coro.begin(token [[ID]], ptr [[CALL]])
; CHECK-NEXT:    store ptr @f.resume, ptr [[TMP0]], align 8
; CHECK-NEXT:    [[DESTROY_ADDR:%.*]] = getelementptr inbounds [[F_FRAME:%.*]], ptr [[TMP0]], i32 0, i32 1
; CHECK-NEXT:    store ptr @f.destroy, ptr [[DESTROY_ADDR]], align 8
; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds [[F_FRAME]], ptr [[TMP0]], i32 0, i32 2
; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[N_ADDR]], align 4
; CHECK-NEXT:    store i32 [[TMP2]], ptr [[TMP1]], align 4
;
entry:
  %id = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr null);
  %n.addr = alloca i32
  store i32 %n, ptr %n.addr ; this needs to go after coro.begin
  %0 = tail call i32 @llvm.coro.size.i32()
  %call = tail call ptr @malloc(i32 %0)
  %1 = tail call noalias nonnull ptr @llvm.coro.begin(token %id, ptr %call)
  call void @ctor(ptr %n.addr)
  br label %for.cond

for.cond:
  %2 = load i32, ptr %n.add
  %dec = add nsw i32 %2, -1
  store i32 %dec, ptr %n.addr
  call void @print(i32 %2)
  %3 = call i8 @llvm.coro.suspend(token none, i1 false)
  %conv = sext i8 %3 to i32
  switch i32 %conv, label %coro_Suspend [
  i32 0, label %for.cond
  i32 1, label %coro_Cleanup
  ]

coro_Cleanup:
  %4 = call ptr @llvm.coro.free(token %id, ptr nonnull %1)
  call void @free(ptr %4)
  br label %coro_Suspend

coro_Suspend:
  call i1 @llvm.coro.end(ptr null, i1 false)
  ret ptr %1
}

Apparently, stack variable %n.addr should be placed on Coroutine frame as it be used in for.cond block that contains suspend point. Right ? But If we stick to the definition about KIlls, we can't get this result. Why? Because splitAround will split for.cond into 4 blocks(for.cond, CoroSave, CoroSuspend, AfterSuspend) before the cross suspend point information is computed. And splitAround will lead some info lost. Here is CFG about function f,

Code of each block as follow,

entry:
  %id = call token @llvm.coro.id(i32 0, ptr null, ptr null, ptr null)
  %n.addr = alloca i32, align 4
  store i32 %n, ptr %n.addr, align 4
  %0 = tail call i32 @llvm.coro.size.i32()
  %call = tail call ptr @malloc(i32 %0)
  %1 = tail call noalias nonnull ptr @llvm.coro.begin(token %id, ptr %call)
  call void @ctor(ptr %n.addr)
  br label %for.cond

for.cond:                                         ; preds = %AfterCoroSuspend, %entry
  %2 = load i32, ptr %n.addr, align 4
  %dec = add nsw i32 %2, -1
  store i32 %dec, ptr %n.addr, align 4
  call void @print(i32 %2)
  br label %CoroSave

coro_Suspend:                                     ; preds = %coro_Cleanup, %AfterCoroSuspend
  br label %CoroEnd

coro_Cleanup:                                     ; preds = %AfterCoroSuspend
  %5 = call ptr @llvm.coro.free(token %id, ptr nonnull %1)
  call void @free(ptr %5)
  br label %coro_Suspend

CoroSave:                                         ; preds = %for.cond
  %3 = call token @llvm.coro.save(ptr %1)
  br label %CoroSuspend

CoroSuspend:                                      ; preds = %CoroSave
  %4 = call i8 @llvm.coro.suspend(token %3, i1 false)
  br label %AfterCoroSuspend

AfterCoroSuspend:                                 ; preds = %CoroSuspend
  %conv = sext i8 %4 to i32
  switch i32 %conv, label %coro_Suspend [
    i32 0, label %for.cond
    i32 1, label %coro_Cleanup
  ]

CoroEnd:                                          ; preds = %coro_Suspend
  %6 = call i1 @llvm.coro.end(ptr null, i1 false)
  br label %AfterCoroEnd

AfterCoroEnd:                                     ; preds = %CoroEnd
  ret ptr %1

Apparently, for.cond.Kills[Entry] is false, it means %n.addr not crossing a suspend point. So %n.addr will not be placed on Coroutine frame. But it should be placed on Coroutine frame actually.

If I don't understand it correctly, pelase help to point out.

Apparently, for.cond.Kills[Entry] is false, it means %n.addr not crossing a suspend point.

IIUC, for.cond.Kills[Entry] should be true since there is a path from Entry to for.cond without repeating Entry, right?

@ChuanqiXu I takes some time to figure out where the cross suspend point information is used and what it does . It used to determine whether the stack variable need to be placed on Coroutine frame. So the current problem does not affect the correctness of the coroutine programs.

Yes, it is only about optimizations. So it is always correct to mark all blocks as killed and the analysis will be extremely fast in this way.

For the patch itself, I think it'd better to improve the readability. And also it would be better to have some (may be not so formal) proof to the correctness and precision.

@ChuanqiXu Thanks for your comments. I will improve it.

IIUC, for.cond.Kills[Entry] should be true since there is a path from Entry to for.cond without repeating Entry, right?

The definition about Kill is,

//   Kills: a bit vector which contains a set of indices of blocks that can                                                                                                                                                                     
//          reach block 'i' but there is a path crossing a suspend point                                                                                                                                                                       
//          not repeating 'i' (path to 'i' without cycles containing 'i'). 
}

So for.cond.Kills[Entry] means whether exists a path from Entry to for.cond crossing a suspend point not repeating for.cond.

@ChuanqiXu Thanks for your comments. I will improve it.

IIUC, for.cond.Kills[Entry] should be true since there is a path from Entry to for.cond without repeating Entry, right?

The definition about Kill is,

//   Kills: a bit vector which contains a set of indices of blocks that can                                                                                                                                                                     
//          reach block 'i' but there is a path crossing a suspend point                                                                                                                                                                       
//          not repeating 'i' (path to 'i' without cycles containing 'i'). 
}

So for.cond.Kills[Entry] means whether exists a path from Entry to for.cond crossing a suspend point not repeating for.cond.

Oh, you're right. There are some problems about the definitions. Let's correct it in other patches.

@ChuanqiXu I have refactor it again as your suggestions. Some redundant operation has been removed.

I have tested folly library again and enabled coro tests(experimental/coro/test). The results are the same before and after this patch. But there still exists some fail. Here are the failed cases:

245 - heap_vector_types_test.HeapVectorTypes.GrowthPolicy (Failed)
1362 - HHWheelTimerTest.HHWheelTimerTest.FireOnce (Failed)
1366 - HHWheelTimerTest.HHWheelTimerTest.CancelTimeout (Failed)
1368 - HHWheelTimerTest.HHWheelTimerTest.SlowFast (Failed)
1369 - HHWheelTimerTest.HHWheelTimerTest.ReschedTest (Failed)
1371 - HHWheelTimerTest.HHWheelTimerTest.DefaultTimeout (Failed)
1375 - HHWheelTimerTest.HHWheelTimerTest.IntrusivePtr (Failed)
1376 - HHWheelTimerTest.HHWheelTimerTest.GetTimeRemaining (Failed)
1378 - HHWheelTimerTest.HHWheelTimerTest.Level1 (Failed)
1487 - lang_exception_test.ExceptionTest.terminate_with_direct (Failed)
1488 - lang_exception_test.ExceptionTest.terminate_with_variadic (Failed)
3016 - coro_asyncscope_test.AsyncScopeTest.DontThrowOnJoin (Failed)
3078 - coro_async_scope_test.AsyncScopeTest.DontThrowOnJoin (Failed)
3090 - coro_async_stack_test.AsyncStackTest.MixedStackWalk (SEGFAULT)

As you can see, some coro tests failed. Do you know the problem ?

BTW, I feel the grammar of the comment reads slightly bad. Could you try to improve it? I can't help a lot since I am not native speaker.

As you can see, some coro tests failed. Do you know the problem ?

I don't have an idea. Folly is pretty complex. Maybe you can try to choose another stable version for folly.

llvm/lib/Transforms/Coroutines/CoroFrame.cpp
43

Generally we prefer llvm::DenseSet instead of unordered_set. You can refer to the LLVM Programmers manual.

149

It is better to explain the meaning of the parameters too.

152

nit: we generally prefer the style.

330

nit:

331

Such comment is meaningless.

333

So it is better to explain what's the meaning for *indegree*. I know the definition of in degree. But I feel it has different semantics in this algorithm. For example, what's the information represented by indegree 0? Why the indegree of a node change while we don't touch the graph? So I think you'd better to give it a different name and a corresponding explanation.

337–354

Same with above.

350–364

This is pretty clear. We don't need such comments.

375
389

nit

412

@ChuanqiXu Thanks a lot. These suggestions very helpful. Sorry for var's name problems, I have no talent for making names.

For folly library, same result as before in llvm-project tag llvmorg-16.0.0 and llvmorg-16.0.6.

witstorm95 marked 11 inline comments as done.Jul 26 2023, 5:03 AM

Accept these suggestions.

ChuanqiXu added inline comments.Jul 26 2023, 7:22 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
151

ditto in other places.

331

Now we can understand it is a dynamic attribute instead of a static one. Then it is not odd that it may change.

393–406
407–408

@ChuanqiXu Thank you for your suggestions. I've already fixed it.

witstorm95 marked 4 inline comments as done.Jul 26 2023, 10:57 PM

Accept these suggestions.

ChuanqiXu accepted this revision.Jul 27 2023, 12:30 AM

LGTM. Thanks for your effort!

I noticed that this is your first time to contribute for LLVM. Then you can leave your Name and Email Address then I'll land it for you.

This revision is now accepted and ready to land.Jul 27 2023, 12:30 AM

Thanks for your review and comment.

My name and email as follow,
Name: witstorm95
Email: witstorm@163.com

witstorm95 edited the summary of this revision. (Show Details)Jul 27 2023, 12:57 AM

I forgot to format it earlier. Sorry about that.

This revision was landed with ongoing or failed builds.Jul 27 2023, 2:29 AM
This revision was automatically updated to reflect the committed changes.
MatzeB added a subscriber: MatzeB.Jul 31 2023, 5:25 PM

We're seeing crashes in our builds where clang -flto=thin seems to produce IR that cannot be loaded anymore:

Stderr: Instruction does not dominate all uses!
  %47 = getelementptr inbounds %_ZN5folly8channels6detail22TransformProcessorBaseIN8facebook6falcon6client18FalconNotificationENS5_10BlobEntityENS5_29FalconClientEntityTransformerIS7_NS5_25FalconTransformerSettingsILb0ELb0ELb0EEEEEE13processValuesENS1_5QueueINS_3TryIS6_EEEE.Frame, ptr %0, i64 0, i32 3, i32 0, i32 1
  store i64 %70, ptr %47, align 8, !dbg !220318, !tbaa !81281
Instruction does not dominate all uses!

Unfortunately it seems to be indeterministic (so it's hard to 100% blame this change, but given all our errors happen with load/stores from xxx.Frame types this seemed likely). I'm unfortunately also not seeing any reports from ASAN that would explain the indeterminism.

Can't you use a reverse-port-order (see PostOrderIterator.h / ReversePostOrderTraversal) to get a topological sorting of CFG blocks? That is a simpler algorithm than the one here and it already exists as a helper function in LLVM!

With current trunk I see invalid IR produced every 2nd or 3rd build. With this change reverted I was able to >10 builds in a row without invalid IR. I'd like to revert this.

Unfortunately creating a reproducer is tricky given the nondeterminism so llvm-reduce doesn't work reliably. Would it be acceptable to revert this for the time being even without a reproducer?

I don't think I grasped the full algorithm employed here. But give this is just a standard dataflow problem we are dealing with here, this seems too complicated to me. I can't shake the feeling that the code would be simpler with just visiting every block once in RPOT to deal with those artifical inputs from the task and then keep doing the normal worklist algorithm till the fixpoint (= the code before this change).

With current trunk I see invalid IR produced every 2nd or 3rd build. With this change reverted I was able to >10 builds in a row without invalid IR. I'd like to revert this.

Unfortunately creating a reproducer is tricky given the nondeterminism so llvm-reduce doesn't work reliably. Would it be acceptable to revert this for the time being even without a reproducer?

I don't think I grasped the full algorithm employed here. But give this is just a standard dataflow problem we are dealing with here, this seems too complicated to me. I can't shake the feeling that the code would be simpler with just visiting every block once in RPOT to deal with those artifical inputs from the task and then keep doing the normal worklist algorithm till the fixpoint (= the code before this change).

So the problem arose when you applied this patch ? If it is, then it's a problem with this patch, and I need to find out why. Could you provide more info about this?

ChuanqiXu added a comment.EditedJul 31 2023, 6:45 PM

Would it be acceptable to revert this for the time being even without a reproducer?

Reverted. @witstorm95 it is common to revert patches within LLVM with post commit review (or crash report). Let's reland this one after we figure it out. @MatzeB given the patch matters for compilation speed, it is pretty helpful to provide a reproducer.

So the problem arose when you applied this patch ? If it is, then it's a problem with this patch, and I need to find out why. Could you provide more info about this?

Unfortunately creating a reproducer will require more time, as currently I only see it happening for internal code (and reducing big codebases is takes time). I should be able to give you a reproducer in the next days.

@witstorm95 it is common to revert patches within LLVM with post commit review (or crash report).

I mean reverting is always allowed when a reproducer exists. I don't have a real reproducer yet that I can publish so this is a judgement call / nice ask here. So thanks for reverting, this does help our testing! I'll see to provide a reproducer or fix to the diff.

@MatzeB Thanks for your report. I have reproduced it by compiling folly library and enabling experimental/channels/test. The algorithm of this patch cannot handle super complex CFG with many loops.

Can't you use a reverse-port-order (see PostOrderIterator.h / ReversePostOrderTraversal) to get a topological sorting of CFG blocks? That is a simpler algorithm than the one here and it already exists as a helper function in LLVM!

Thanks for your reminding. I will try this.

MatzeB added a comment.Aug 1 2023, 3:28 PM

I did spent some more time on the patch yesterday. I believe the indeterminism was somewhat triggered by the existing code in BlockToIndexMapping: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Coroutines/CoroFrame.cpp#L63 it sorts the blocks by their memory address and as memory address can fluctuate between runs (caused by ASLR etc.) iterating over the Mapping resulting in a nondeterministic order. This doesn't matter when the algorithm always reaches the same solution regardless of the order, but this wasn't the case after the rewrite it seems.

I also did prepare a proposal patch to use initial RPO pass + fixpoint worklist algorithm. Cleaning this up right now for submission / further discussion...

MatzeB added a comment.Aug 1 2023, 4:10 PM

Tried to show the RPO variant here: https://reviews.llvm.org/D156835

Tried to show the RPO variant here: https://reviews.llvm.org/D156835

It seems better than this patch and reduce many reduntant operation.