For schedule generation we assumed that the reverse post order traversal used by the domain generation is sufficient, however it is not. Once a loop is discovered we have to completely traverse it before we can generate the schedule for any block/region that is only reachable through a loop exiting block. To this end, we add a "loop stack" that will keep track of loops we discovered during the traversal but have not yet traversed completely. We will never visit a basic block (or region) outside the most recent (thus smallest) loop in the loop stack but instead queue such blocks (or regions) in a waiting list. If the waiting list is not empty and (might) contain blocks from the most recent loop in the loop stack the next block/region to visit is drawn from there, otherwise from the reverse post order iterator.
Details
- Reviewers
Meinersbur grosser - Commits
- rGb6ee445f69da: Merged: https://llvm.org/svn/llvm-project/polly/trunk@259354
rGc2fd8b411df9: ScopInfo: Correct schedule construction
rPLO259354: ScopInfo: Correct schedule construction
rL259709: Merged: https://llvm.org/svn/llvm-project/polly/trunk@259354
rL259354: ScopInfo: Correct schedule construction
Diff Detail
- Repository
- rL LLVM
Event Timeline
I did not see any lnt compile time changes for this patch. There were two execution time regression which I believe are either noise or wrong schedules that caused a better result.
Hi Johannes,
thanks for working on this. I have some two in-line comments.
Tobias
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
3472 ↗ | (On Diff #43332) | This piece of code looks a little complex. At the very least it requires some in-source comments. From my point of view, this code would possibly be easier to understand if we find a way to somehow (more explicitly) walk down the loop tree and just do the ReversePostOrderTraversal to process the statements within the loop bodies. We had such a tree walk already earlier, but it was removed when you generalized the schedule construction. Would adding back such a tree walk be difficult due to the generality of the CFGs we allow? One of the reasons this might be difficult is if we aim to support irreducible control flow (http://llvm.org/PR25909). Due to LoopInfo working with natural loops which can not be formed from irregular control flow, we do not correctly model irregular control flow before and after this patch and I believe we should also not try to do so. Do you have any intentions to support such control flow with your patch? |
test/ScopInfo/wrong_schedule.ll | ||
66 ↗ | (On Diff #43332) | Could you drop the control flow outside the scop? Most of it seems unnecessary to reproduce the test case and makes it harder to see the actual issue when looking at the IR itself. |
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
3472 ↗ | (On Diff #43332) | I could outline it and/or add a comment, is that OK? I am unsure how your want to achieve a reverse post order traversal of all blocks while staying in loops first more explicitly. What we had used the Region::element_XXX() iterator. I was not aware of any guarantees on the traversal order when you use this iterator (it does not even have a comment). If you know of any tree traversal that will
we probably use it. I am unsure why you talk about irreducible control flow now. The patches comes with an reducible CFG test case for which we generated a plain wrong schedule. The irreducible case will be broken as long as LoopInfo does not offer information about a irreducible loop, there is nothing I can do about that in the ScopInfo anyway (except write an own loop info ...). |
Hi Johannes,
thanks for the quick reply
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
3472 ↗ | (On Diff #43332) | I do not yet have a good idea regarding the tree traversal, but would like to think about it a little. Hence, I try to understand how general the scops are that you expect to handle. Can you point me to a test case for "traverse loops (even if they are not covered by a sub-region)"? I just talked about irreducible control flow to understand if you plan to support it. It seems we agree that we do not want to handle it (but somehow detect it in the scop detection). From this I conclude that we could theoretically walk over the loop info tree and then for each loop enumerate the basic block it contains (which might allow Documentation will clearly help me to understand this issue. Especially, if it would contain some difficult cases you thought about while implementing. |
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
3472 ↗ | (On Diff #43332) | Take your time. This is a bug that is in Polly but (for some reason) does not cause to much trouble. Generally all CFGs should be fine iff LoopInfo and ScalarEvolution play along. As LoopInfo *breaks* for irreducible CFGs and ScalarEvolution for some other cases (e.g., piecewiese AddRecs) we should not detetect them as SCoPs. I think the example below should illustrate some interstring problems: pre_header: store br loop_header loop_header: store switch [loop_body1, loop_body2, loop_exit1, loop_exit2] loop_body1: store br loop_header loop_body2: store br loop_header loop_exit1: store br after_loop loop_exit2: store br after_loop after_loop store br ... We have to guarantee the blocks are visited in the following order (/ means "or"): pre_header loop_header loop_body1/loop_body2 [not loop_exit1/loop_exit2] loop_body1/loop_body2 [not loop_exit1/loop_exit2] loop_exit1/loop_exit2 loop_exit1/loop_exit2 after_loop While reverse post order can traverse the CFG that way, it is not unique and might choose: pre_header loop_header loop_body1/loop_body2 [not loop_exit1/loop_exit2] loop_exit1/loop_exit2 [BAD!] loop_body1/loop_body2 loop_exit1/loop_exit2 after_loop |
For reference test/ScopInfo/multiple_exiting_blocks_two_loop.ll is a test case that contains a scop with a loop that does not have a corresponding region.
Johannes, I looked through the code and I have some difficulties to get a picture of what the exiting code does both in detail but also on a high-level. I started to think through it, but if you could add some documentation to the current code that would clearly help.
Some questions I have:
What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?
I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?
mapToDimension(__isl_take isl_union_set *Domain, int N) { assert(N > 0); if (!Domain) return nullptr; assert(!isl_union_set_is_empty(Domain)); struct MapToDimensionDataTy Data; auto *Space = isl_union_set_get_space(Domain); auto *PwAff = isl_union_pw_multi_aff_empty(Space); Data = {N, PwAff}; auto Res = isl_union_set_foreach_set(Domain, &mapToDimension_AddSet, &Data); assert(Res == isl_stat_ok); isl_union_set_free(Domain); return isl_multi_union_pw_aff_from_union_pw_multi_aff(Data.Res); }
How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?
It seems LSchedule can become nullptr, as the following patch asserts:
@@ -3464,6 +3468,7 @@ void Scop::buildSchedule( isl_schedule *LSchedule = LSchedulePair.first; unsigned NumVisited = LSchedulePair.second; + assert(LSchedule); while (L && NumVisited == L->getNumBlocks()) {
This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?
Maybe you can consider these questions when documenting the code. Thank you.
I can do that.
Some questions I have:
What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?
Wait. This was in there for months and is core part of the schedule generation. Maybe We should talk about two different things here. The existing code and what I propose to change to fix it. Apparently neither is clear.
The second component of the schedule pair counts how many blocks of a loop have been visited. If this number is equal to the number of blocks in the loop we add this loop to the schedule tree. This is similar to what we did before (in the old schedule generation) but extended to the case that the loop is not perfectely covered by regions.
I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?
- Except the first conditional the mapToDimensions was implemented by you.
- In the current implementation N should always be positive.
- I am not 100% sure why the empty check was in there... maybe it is not needed anymore, maybe it occures somewhere in lnt. If lnt is green i'm fine with you removing that check.
mapToDimension(__isl_take isl_union_set *Domain, int N) { assert(N > 0); if (!Domain) return nullptr; assert(!isl_union_set_is_empty(Domain)); struct MapToDimensionDataTy Data; auto *Space = isl_union_set_get_space(Domain); auto *PwAff = isl_union_pw_multi_aff_empty(Space); Data = {N, PwAff}; auto Res = isl_union_set_foreach_set(Domain, &mapToDimension_AddSet, &Data); assert(Res == isl_stat_ok); isl_union_set_free(Domain); return isl_multi_union_pw_aff_from_union_pw_multi_aff(Data.Res); }
How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?
Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.
It seems LSchedule can become nullptr, as the following patch asserts:
@@ -3464,6 +3468,7 @@ void Scop::buildSchedule( isl_schedule *LSchedule = LSchedulePair.first; unsigned NumVisited = LSchedulePair.second; + assert(LSchedule); while (L && NumVisited == L->getNumBlocks()) {This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?
Initially (when a loop is discovered) this is a nullptr as there is no schedule for the loop yet. This nullptr is combined "in sequence" with some real schedule and then stored back in the first component of LSchedule.
Maybe you can consider these questions when documenting the code. Thank you.
I will add comments to the whole thing as it is apperently not clear how schedules are actually build.
Thank you.
Some questions I have:
What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?
Wait. This was in there for months and is core part of the schedule generation. Maybe We should talk about two different things here. The existing code and what I propose to change to fix it. Apparently neither is clear.
Right. Let's start with the old code, then the new code is most likely a lot easier to understand.
The second component of the schedule pair counts how many blocks of a loop have been visited. If this number is equal to the number of blocks in the loop we add this loop to the schedule tree. This is similar to what we did before (in the old schedule generation) but extended to the case that the loop is not perfectely covered by regions.
I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?
- Except the first conditional the mapToDimensions was implemented by you.
- In the current implementation N should always be positive.
- I am not 100% sure why the empty check was in there... maybe it is not needed anymore, maybe it occures somewhere in lnt. If lnt is green i'm fine with you removing that check.
LNT is green. I committed an improved version of this change in r256208.
How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?
Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.
OK, now I see. Adding this in the documentation would clearly be useful.
It seems LSchedule can become nullptr, as the following patch asserts:
@@ -3464,6 +3468,7 @@ void Scop::buildSchedule( isl_schedule *LSchedule = LSchedulePair.first; unsigned NumVisited = LSchedulePair.second; + assert(LSchedule); while (L && NumVisited == L->getNumBlocks()) {This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?
Initially (when a loop is discovered) this is a nullptr as there is no schedule for the loop yet. This nullptr is combined "in sequence" with some real schedule and then stored back in the first component of LSchedule.
OK. I re-factored the code slightly in r256209 to not perform any computation on the nullptr. This makes the code more
readable for me.
Best,
Tobias
How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?
Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.
OK, now I see. Adding this in the documentation would clearly be useful.
Hi Johannes,
did you still plan to add documentation about the schedule generation that clarifies the points you explained right before Christmas.
All the best and a happy new year,
Tobias
Hi Johannes,
did you still plan to add documentation about the schedule generation that clarifies the points you explained right before Christmas.
Done in the newest version of the patch.
Hi Johannes,
thanks for the detailed comments. They clearly help to both review the new code and to understand more of the existing code. I have a couple of follow up questions inline.
Best,
Tobias
include/polly/ScopInfo.h | ||
---|---|---|
1410 ↗ | (On Diff #44034) | Could you add one more sentence to clarify for the reader how the region tree traversal works on an abstract level? To my understanding for each call of buildSchedule we walk over a given region @R. @R consists of RegionNodes that are either BBs or itself (non-trivial) Regions, which are all connected through control flow possibly containing loops. By walking over the AST in reverse-post-order we ensure which condition precisely? Also, can we state something about the loops processed in a region. Do we assume reducible control flow? Can/do we exploit the additional properties this gives us? |
1426 ↗ | (On Diff #44034) | Nice. This comment clearly helps me to better understand the algorithm used. Now I need to relate the actual implementation to the above description. One thing I am not fully sure how it works is the LoopSchedules map (and the new LoopStack). Can we give an invariant on the state of these when a recursive call to buildSchedule returns? Specifically, why do we need to return a full map of schedules? Will there possibly be always exactly one schedule in the map void buildSchedule( Region *R, DenseMap<Loop *, std::pair<isl_schedule *, unsigned>> &LoopSchedules) { } to isl_schedule *buildSchedule(Region *R) { DenseMap<Loop *, std::pair<isl_schedule *, unsigned>> &LocalLoopSchedules; ... } |
lib/Analysis/ScopInfo.cpp | ||
3448 ↗ | (On Diff #44034) | This comment clarifies a question I was wondering about before. For me this would become even more clear if we split off the non-recursive from the recursive part of the code. To try how if this actually works, I implemented such a patch ( ). If you agree this is useful I could commit it ahead of time. |
3458 ↗ | (On Diff #44034) | queued |
3493 ↗ | (On Diff #44034) | These different cases, iterators and lists look confusing to me. I wonder if this code would look more clear, if we would just add all RNs into a work list, to which we append RNs that need to be delayed? Moving all RNs to a work-list clearly slows down the common case, but I have doubts we see this in any profile. |
test/ScopInfo/wrong_schedule.ll | ||
67 ↗ | (On Diff #44034) | This patch still needs some simplifications (either manually or you could use my bugpoint patch). |
Hi Johannes,
I played around with this code a little bit more and it seems we can pull out the tree traversal to remove the single-non-affine-region special case and generally simplify the code. This helps me to better understand that we _either_ walk the tree in case we have an affine-region-node or we build schedule information for a ScopStmt.
I commited a refactoring in r256931. It is different to your's but I think it makes your point even more clear. I hope you agree.
include/polly/ScopInfo.h | ||
---|---|---|
1414 ↗ | (On Diff #44107) |
I do not follow completely. What abstract level except "recursively in reverse post-order wrt. sub-regions" is there?
Yes.
That we visit predecessors of a block before the block itself (same as for domain generation). This is necessary since we can then add the schedule for a loop or block in sequence to what we already have and do not need to find a place where to insert it.
I do not understand. What should I state about the loops?
LoopInfo does not work on irreducible control => we will currently crash or do undefined things on irreducible control. If LoopInfo would work (in some sence) this code would probably work too (but I do not give a gurarantee as it heavily depends on the way irreducible control loops are modeled.).
None crossed my mind so far but I do not know if there are none. |
1430 ↗ | (On Diff #44107) |
Mh, invariants I can think of:
I can add them if it helps.
There is always "one active loop", thus one could rewrite it somehow to only pass the information for that one. However, we still need to pass the current schedule for that active loop as well as the number of visited blocks to the recursive calls and we need to update them during the recursion. One local map + std::pair<isl_schedule, unsigned>& argument should do the trick but I do not see how this is more efficent or easier. |
lib/Analysis/ScopInfo.cpp | ||
3438 ↗ | (On Diff #44107) | Done. |
3498 ↗ | (On Diff #44107) | But a workl list doesn't magically remove the different cases and iterators. The same logic is needed (if it is even that simple to decide if we need to delay a region node in case we do not yet build the schedule). |
Hi Johannes,
for our discussion, here a code snippet that explains my worklist approach. It basically replaces the deque you anyway already added.
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
3463 ↗ | (On Diff #44107) | queued |
3498 ↗ | (On Diff #44107) | What do you think of the following piece of code? It passes for me all tests and at least avoids the need for LastRNWaiting as well as explicit iterators. ReversePostOrderTraversal<Region *> RTraversal(R); std::deque<RegionNode *> WorkList(RTraversal.begin(), RTraversal.end()); while (!WorkList.empty()) { RegionNode *RN = WorkList.front(); WorkList.pop_front(); Loop *L = getRegionNodeLoop(RN, LI); if (!getRegion().contains(L)) L = OuterScopLoop; Loop *LastLoop = LoopStack.back(); if (LastLoop != L) { if (!LastLoop->contains(L)) { WorkList.push_back(RN); continue; } LoopStack.push_back(L); } |
Hi Johannes,
I just wanted to put down my latest findings here:
- The simplified code I proposed is incorrect, as the elements we push back are pushed back to the end of the worklist which will result in an invalid work list when leaving the loop.
- I believe your patch makes the LoopSchedules map unnecessary. As we now always process a stack of loops. Consequently, we could use a single stack of type:
struct BuildScheduleInfo { Loop *L; isl_schedule *Schedule; unsigned *NumBBsProcessed; } SmallVectorImpl<struct BuildScheduleInfo*> &LoopStack
instead of the LoopSchedules map. I think using such a stack is preferable, as as it indicates nicely that there is only one active loop at a time.
After having thought about this code now for a while I think I understood all the corner cases. Thanks for your patience!