Page MenuHomePhabricator

[FIX] Schedule generation PR25879
ClosedPublic

Authored by jdoerfert on Dec 20 2015, 9:16 AM.

Details

Summary
For schedule generation we assumed that the reverse post order
traversal used by the domain generation is sufficient, however it is
not. Once a loop is discovered we have to completely traverse it
before we can generate the schedule for any block/region that is only
reachable through a loop exiting block.

To this end, we add a "loop stack" that will keep track of loops we
discovered during the traversal but have not yet traversed completely.
We will never visit a basic block (or region) outside the most recent
(thus smallest) loop in the loop stack but instead queue such blocks
(or regions) in a waiting list. If the waiting list is not empty and
(might) contain blocks from the most recent loop in the loop stack the
next block/region to visit is drawn from there, otherwise from the
reverse post order iterator.

Diff Detail

Repository
rL LLVM

Event Timeline

jdoerfert updated this revision to Diff 43332.Dec 20 2015, 9:16 AM
jdoerfert retitled this revision from to [FIX] Schedule generation PR25879.
jdoerfert added reviewers: grosser, Meinersbur.
jdoerfert updated this object.
jdoerfert added a subscriber: Restricted Project.

I did not see any lnt compile time changes for this patch. There were two execution time regression which I believe are either noise or wrong schedules that caused a better result.

grosser edited edge metadata.Dec 21 2015, 12:10 AM

Hi Johannes,

thanks for working on this. I have some two in-line comments.

Tobias

lib/Analysis/ScopInfo.cpp
3472 ↗(On Diff #43332)

This piece of code looks a little complex. At the very least it requires some in-source comments.

From my point of view, this code would possibly be easier to understand if we find a way to somehow (more explicitly) walk down the loop tree and just do the ReversePostOrderTraversal to process the statements within the loop bodies. We had such a tree walk already earlier, but it was removed when you generalized the schedule construction. Would adding back such a tree walk be difficult due to the generality of the CFGs we allow?

One of the reasons this might be difficult is if we aim to support irreducible control flow (http://llvm.org/PR25909). Due to LoopInfo working with natural loops which can not be formed from irregular control flow, we do not correctly model irregular control flow before and after this patch and I believe we should also not try to do so. Do you have any intentions to support such control flow with your patch?

test/ScopInfo/wrong_schedule.ll
66 ↗(On Diff #43332)

Could you drop the control flow outside the scop? Most of it seems unnecessary to reproduce the test case and makes it harder to see the actual issue when looking at the IR itself.

jdoerfert added inline comments.Dec 21 2015, 3:07 AM
lib/Analysis/ScopInfo.cpp
3472 ↗(On Diff #43332)

I could outline it and/or add a comment, is that OK? I am unsure how your want to achieve a reverse post order traversal of all blocks while staying in loops first more explicitly. What we had used the Region::element_XXX() iterator. I was not aware of any guarantees on the traversal order when you use this iterator (it does not even have a comment). If you know of any tree traversal that will

  1. traverse blocks in reverse post order
  2. traverse loops (even if they are not covered by a sub-region) first once the header was visisted.

we probably use it.

I am unsure why you talk about irreducible control flow now. The patches comes with an reducible CFG test case for which we generated a plain wrong schedule. The irreducible case will be broken as long as LoopInfo does not offer information about a irreducible loop, there is nothing I can do about that in the ScopInfo anyway (except write an own loop info ...).

Hi Johannes,

thanks for the quick reply

lib/Analysis/ScopInfo.cpp
3472 ↗(On Diff #43332)

I do not yet have a good idea regarding the tree traversal, but would like to think about it a little.

Hence, I try to understand how general the scops are that you expect to handle. Can you point me to a test case for "traverse loops (even if they are not covered by a sub-region)"?

I just talked about irreducible control flow to understand if you plan to support it. It seems we agree that we do not want to handle it (but somehow detect it in the scop detection). From this I conclude that we could theoretically walk over the loop info tree and then for each loop enumerate the basic block it contains (which might allow
us to structure this without introducing a worker list). Now, the issue you point out is that we need to iterate over the blocks in a given order. I have no solution yet, but will think about it later today.

Documentation will clearly help me to understand this issue. Especially, if it would contain some difficult cases you thought about while implementing.

jdoerfert added inline comments.Dec 21 2015, 4:47 AM
lib/Analysis/ScopInfo.cpp
3472 ↗(On Diff #43332)

Take your time. This is a bug that is in Polly but (for some reason) does not cause to much trouble.

Generally all CFGs should be fine iff LoopInfo and ScalarEvolution play along. As LoopInfo *breaks* for irreducible CFGs and ScalarEvolution for some other cases (e.g., piecewiese AddRecs) we should not detetect them as SCoPs.

I think the example below should illustrate some interstring problems:

pre_header:
  store
  br loop_header

loop_header:
  store
  switch [loop_body1, loop_body2, loop_exit1, loop_exit2]

loop_body1:
  store
  br loop_header

loop_body2:
  store
  br loop_header

loop_exit1:
  store
  br after_loop

loop_exit2:
  store
  br after_loop

after_loop
  store
  br ...

We have to guarantee the blocks are visited in the following order (/ means "or"):

pre_header
loop_header
loop_body1/loop_body2         [not loop_exit1/loop_exit2]
loop_body1/loop_body2         [not loop_exit1/loop_exit2]   
loop_exit1/loop_exit2
loop_exit1/loop_exit2
after_loop

While reverse post order can traverse the CFG that way, it is not unique and might choose:

pre_header
loop_header
loop_body1/loop_body2         [not loop_exit1/loop_exit2]
loop_exit1/loop_exit2         [BAD!]
loop_body1/loop_body2
loop_exit1/loop_exit2
after_loop

For reference test/ScopInfo/multiple_exiting_blocks_two_loop.ll is a test case that contains a scop with a loop that does not have a corresponding region.

Johannes, I looked through the code and I have some difficulties to get a picture of what the exiting code does both in detail but also on a high-level. I started to think through it, but if you could add some documentation to the current code that would clearly help.

Some questions I have:

What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?

I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?

mapToDimension(__isl_take isl_union_set *Domain, int N) {                        
  assert(N > 0);                                                                 
                                                                                 
  if (!Domain)                                                                   
    return nullptr;                                                              
                                                                                 
  assert(!isl_union_set_is_empty(Domain));                                       
                                                                                 
  struct MapToDimensionDataTy Data;                                              
                                                                                 
  auto *Space = isl_union_set_get_space(Domain);                                 
  auto *PwAff = isl_union_pw_multi_aff_empty(Space);                             
                                                                                 
  Data = {N, PwAff};                                                             
                                                                                 
  auto Res = isl_union_set_foreach_set(Domain, &mapToDimension_AddSet, &Data);   
                                                                                 
  assert(Res == isl_stat_ok);                                                    
                                                                                 
  isl_union_set_free(Domain);                                                    
  return isl_multi_union_pw_aff_from_union_pw_multi_aff(Data.Res);               
}

How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?

It seems LSchedule can become nullptr, as the following patch asserts:

@@ -3464,6 +3468,7 @@ void Scop::buildSchedule(
 
     isl_schedule *LSchedule = LSchedulePair.first;
     unsigned NumVisited = LSchedulePair.second;
+    assert(LSchedule);
     while (L && NumVisited == L->getNumBlocks()) {

This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?

Maybe you can consider these questions when documenting the code. Thank you.

For reference test/ScopInfo/multiple_exiting_blocks_two_loop.ll is a test case that contains a scop with a loop that does not have a corresponding region.

Johannes, I looked through the code and I have some difficulties to get a picture of what the exiting code does both in detail but also on a high-level. I started to think through it, but if you could add some documentation to the current code that would clearly help.

I can do that.

Some questions I have:

What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?

Wait. This was in there for months and is core part of the schedule generation. Maybe We should talk about two different things here. The existing code and what I propose to change to fix it. Apparently neither is clear.

The second component of the schedule pair counts how many blocks of a loop have been visited. If this number is equal to the number of blocks in the loop we add this loop to the schedule tree. This is similar to what we did before (in the old schedule generation) but extended to the case that the loop is not perfectely covered by regions.

I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?

  1. Except the first conditional the mapToDimensions was implemented by you.
  2. In the current implementation N should always be positive.
  3. I am not 100% sure why the empty check was in there... maybe it is not needed anymore, maybe it occures somewhere in lnt. If lnt is green i'm fine with you removing that check.
mapToDimension(__isl_take isl_union_set *Domain, int N) {                        
  assert(N > 0);                                                                 
                                                                                 
  if (!Domain)                                                                   
    return nullptr;                                                              
                                                                                 
  assert(!isl_union_set_is_empty(Domain));                                       
                                                                                 
  struct MapToDimensionDataTy Data;                                              
                                                                                 
  auto *Space = isl_union_set_get_space(Domain);                                 
  auto *PwAff = isl_union_pw_multi_aff_empty(Space);                             
                                                                                 
  Data = {N, PwAff};                                                             
                                                                                 
  auto Res = isl_union_set_foreach_set(Domain, &mapToDimension_AddSet, &Data);   
                                                                                 
  assert(Res == isl_stat_ok);                                                    
                                                                                 
  isl_union_set_free(Domain);                                                    
  return isl_multi_union_pw_aff_from_union_pw_multi_aff(Data.Res);               
}

How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?

Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.

It seems LSchedule can become nullptr, as the following patch asserts:

@@ -3464,6 +3468,7 @@ void Scop::buildSchedule(
 
     isl_schedule *LSchedule = LSchedulePair.first;
     unsigned NumVisited = LSchedulePair.second;
+    assert(LSchedule);
     while (L && NumVisited == L->getNumBlocks()) {

This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?

Initially (when a loop is discovered) this is a nullptr as there is no schedule for the loop yet. This nullptr is combined "in sequence" with some real schedule and then stored back in the first component of LSchedule.

Maybe you can consider these questions when documenting the code. Thank you.

I will add comments to the whole thing as it is apperently not clear how schedules are actually build.

For reference test/ScopInfo/multiple_exiting_blocks_two_loop.ll is a test case that contains a scop with a loop that does not have a corresponding region.

Johannes, I looked through the code and I have some difficulties to get a picture of what the exiting code does both in detail but also on a high-level. I started to think through it, but if you could add some documentation to the current code that would clearly help.

I can do that.

Thank you.

Some questions I have:

What is LSchedulePair.second/NumVisited used for? It seems when the condition NumVisited == L->getNumBlocks() holds you assume all basic blocks of the loop have been processed. Then, you can start to walk up the loop tree and add the other loop dimensions?

Wait. This was in there for months and is core part of the schedule generation. Maybe We should talk about two different things here. The existing code and what I propose to change to fix it. Apparently neither is clear.

Right. Let's start with the old code, then the new code is most likely a lot easier to understand.

The second component of the schedule pair counts how many blocks of a loop have been visited. If this number is equal to the number of blocks in the loop we add this loop to the schedule tree. This is similar to what we did before (in the old schedule generation) but extended to the case that the loop is not perfectely covered by regions.

I tried to understand the implementation of mapToDomain, which is simple in general but seems to do some additional stuff for certain corner cases that do not occur in our tests. Is the following a equivalent implementation? Under which conditions can Domain become 'NULL' (I checked and it indeed becomes NULL)? I removed the check for isl_union_set_is_empty, as this never happens in any test case and instead assert. Is this correct?

  1. Except the first conditional the mapToDimensions was implemented by you.
  2. In the current implementation N should always be positive.
  3. I am not 100% sure why the empty check was in there... maybe it is not needed anymore, maybe it occures somewhere in lnt. If lnt is green i'm fine with you removing that check.

LNT is green. I committed an improved version of this change in r256208.

How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?

Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.

OK, now I see. Adding this in the documentation would clearly be useful.

It seems LSchedule can become nullptr, as the following patch asserts:

@@ -3464,6 +3468,7 @@ void Scop::buildSchedule(
 
     isl_schedule *LSchedule = LSchedulePair.first;
     unsigned NumVisited = LSchedulePair.second;
+    assert(LSchedule);
     while (L && NumVisited == L->getNumBlocks()) {

This is a little surprising. What does it mean if LSchedule is nullptr? When does this happen?

Initially (when a loop is discovered) this is a nullptr as there is no schedule for the loop yet. This nullptr is combined "in sequence" with some real schedule and then stored back in the first component of LSchedule.

OK. I re-factored the code slightly in r256209 to not perform any computation on the nullptr. This makes the code more
readable for me.

Best,
Tobias

How does the content of LoopSchedules change throught the walk? Can you give some invariant. E.g. when buildSchedule returns, what does LoopSchedules contain?

Initially, it is empty. Nothing is in there and LSchedulePair for every loop will be <nullptr, 0>. Then we will build up isl_schedules and add the number of blocks visited to change both components until all blocks of a loop have been visited and its schedule is integrated into the parent loops schedule. During the recursive construction LoopSchedules is constructed not completely but almost "bottom up". While the schedules for the inner-most loops are constructed the schedule of outer loops can be constructed as well but inner loops should be traversed completely before outer ones and their schedule will then be integrated into the outer loop.

OK, now I see. Adding this in the documentation would clearly be useful.

Hi Johannes,

did you still plan to add documentation about the schedule generation that clarifies the points you explained right before Christmas.

All the best and a happy new year,
Tobias

jdoerfert updated this revision to Diff 44034.Jan 5 2016, 11:43 AM
jdoerfert edited edge metadata.

Added documentation

Hi Johannes,

did you still plan to add documentation about the schedule generation that clarifies the points you explained right before Christmas.

Done in the newest version of the patch.

Hi Johannes,

thanks for the detailed comments. They clearly help to both review the new code and to understand more of the existing code. I have a couple of follow up questions inline.

Best,
Tobias

include/polly/ScopInfo.h
1410 ↗(On Diff #44034)

Could you add one more sentence to clarify for the reader how the region tree traversal works on an abstract level?

To my understanding for each call of buildSchedule we walk over a given region @R. @R consists of RegionNodes that are either BBs or itself (non-trivial) Regions, which are all connected through control flow possibly containing loops. By walking over the AST in reverse-post-order we ensure which condition precisely?

Also, can we state something about the loops processed in a region. Do we assume reducible control flow? Can/do we exploit the additional properties this gives us?

1426 ↗(On Diff #44034)

Nice. This comment clearly helps me to better understand the algorithm used. Now I need to relate the actual implementation to the above description.

One thing I am not fully sure how it works is the LoopSchedules map (and the new LoopStack). Can we give an invariant on the state of these when a recursive call to buildSchedule returns? Specifically, why do we need to return a full map of schedules? Will there possibly be always exactly one schedule in the map
that is returned, e.g. the one for the current region R? Or is there an inherent reason (besides irreducible control flow) that prevents us from using function local state.

void buildSchedule(
      Region *R, 
      DenseMap<Loop *, std::pair<isl_schedule *, unsigned>> &LoopSchedules) {

}

to

isl_schedule *buildSchedule(Region *R) {
  DenseMap<Loop *, std::pair<isl_schedule *, unsigned>>
    &LocalLoopSchedules; ...
}
lib/Analysis/ScopInfo.cpp
3448 ↗(On Diff #44034)

This comment clarifies a question I was wondering about before. For me this would become even more clear if we split off the non-recursive from the recursive part of the code. To try how if this actually works, I implemented such a patch (

). If you agree this is useful I could commit it ahead of time.

3458 ↗(On Diff #44034)

queued

3493 ↗(On Diff #44034)

These different cases, iterators and lists look confusing to me. I wonder if this code would look more clear, if we would just add all RNs into a work list, to which we append RNs that need to be delayed? Moving all RNs to a work-list clearly slows down the common case, but I have doubts we see this in any profile.

test/ScopInfo/wrong_schedule.ll
67 ↗(On Diff #44034)

This patch still needs some simplifications (either manually or you could use my bugpoint patch).

Hi Johannes,

I played around with this code a little bit more and it seems we can pull out the tree traversal to remove the single-non-affine-region special case and generally simplify the code. This helps me to better understand that we _either_ walk the tree in case we have an affine-region-node or we build schedule information for a ScopStmt.

jdoerfert updated this revision to Diff 44107.Jan 6 2016, 5:04 AM

Rebase after refactoring

Hi Johannes,

I played around with this code a little bit more and it seems we can pull out the tree traversal to remove the single-non-affine-region special case and generally simplify the code. This helps me to better understand that we _either_ walk the tree in case we have an affine-region-node or we build schedule information for a ScopStmt.

I commited a refactoring in r256931. It is different to your's but I think it makes your point even more clear. I hope you agree.

jdoerfert marked 2 inline comments as done.Jan 6 2016, 5:30 AM
jdoerfert added inline comments.
include/polly/ScopInfo.h
1414 ↗(On Diff #44107)

Could you add one more sentence to clarify for the reader how the region tree traversal works on an abstract level?

I do not follow completely. What abstract level except "recursively in reverse post-order wrt. sub-regions" is there?

To my understanding for each call of buildSchedule we walk over a given region @R. @R consists of RegionNodes that are either BBs or itself non-trivial) Regions, which are all connected through control flow possibly containing loops.

Yes.

By walking over the AST in reverse-post-order we ensure which condition precisely?

That we visit predecessors of a block before the block itself (same as for domain generation). This is necessary since we can then add the schedule for a loop or block in sequence to what we already have and do not need to find a place where to insert it.

Also, can we state something about the loops processed in a region.

I do not understand. What should I state about the loops?

Do we assume reducible control flow?

LoopInfo does not work on irreducible control => we will currently crash or do undefined things on irreducible control. If LoopInfo would work (in some sence) this code would probably work too (but I do not give a gurarantee as it heavily depends on the way irreducible control loops are modeled.).

Can/do we exploit the additional properties this gives us?

None crossed my mind so far but I do not know if there are none.

1430 ↗(On Diff #44107)

One thing I am not fully sure how it works is the LoopSchedules map (and the new LoopStack). Can we give an invariant on the state of these when a recursive call to buildSchedule returns?

Mh, invariants I can think of:

  • LoopStack, contains loops that are contained in each other, hence LoopStack[i]->getParentLoop() == LoopStack[i+1] (if there are at least i+1 elements).
  • If a loop was never in the LoopStack it is not mapped in the LoopSchedules map (except the loop that surrounds the SCoP, or nullptr if none).
  • If a loop is in the LoopStack it isl_schedule in the LoopSchedules is valid and the number of visited blocks in that loop is less than the number of blocks.
  • If a loop leaves the LoopStack its mapping in the LoopSchedules becomes invalid, thus the isl_schedule ptr is dangling and the second component (number of visited blocks) is equal to the number of blocks in the loop.

I can add them if it helps.

Specifically, why do we need to return a full map of schedules? Will there possibly be always exactly one schedule in the map that is returned, e.g. the one for the current region R? Or is there an inherent reason (besides irreducible control flow) that prevents us from using function local state.

There is always "one active loop", thus one could rewrite it somehow to only pass the information for that one. However, we still need to pass the current schedule for that active loop as well as the number of visited blocks to the recursive calls and we need to update them during the recursion. One local map + std::pair<isl_schedule, unsigned>& argument should do the trick but I do not see how this is more efficent or easier.

lib/Analysis/ScopInfo.cpp
3438 ↗(On Diff #44107)

Done.

3498 ↗(On Diff #44107)

But a workl list doesn't magically remove the different cases and iterators. The same logic is needed (if it is even that simple to decide if we need to delay a region node in case we do not yet build the schedule).

Hi Johannes,

for our discussion, here a code snippet that explains my worklist approach. It basically replaces the deque you anyway already added.

lib/Analysis/ScopInfo.cpp
3463 ↗(On Diff #44107)

queued

3498 ↗(On Diff #44107)

What do you think of the following piece of code? It passes for me all tests and at least avoids the need for LastRNWaiting as well as explicit iterators.

ReversePostOrderTraversal<Region *> RTraversal(R);                             
std::deque<RegionNode *> WorkList(RTraversal.begin(), RTraversal.end());       
                                                                               
while (!WorkList.empty()) {                                                    
  RegionNode *RN = WorkList.front();                                           
  WorkList.pop_front();                                                        
                                                                               
  Loop *L = getRegionNodeLoop(RN, LI);                                         
  if (!getRegion().contains(L))                                                
    L = OuterScopLoop;                                                         
                                                                               
  Loop *LastLoop = LoopStack.back();                                           
  if (LastLoop != L) {                                                         
    if (!LastLoop->contains(L)) {                                              
      WorkList.push_back(RN);                                                  
      continue;                                                                
    }                                                                          
    LoopStack.push_back(L);                                                    
  }

Hi Johannes,

I just wanted to put down my latest findings here:

  1. The simplified code I proposed is incorrect, as the elements we push back are pushed back to the end of the worklist which will result in an invalid work list when leaving the loop.
  1. I believe your patch makes the LoopSchedules map unnecessary. As we now always process a stack of loops. Consequently, we could use a single stack of type:
struct BuildScheduleInfo {
  Loop *L;
  isl_schedule *Schedule;
  unsigned *NumBBsProcessed;
}
SmallVectorImpl<struct BuildScheduleInfo*> &LoopStack

instead of the LoopSchedules map. I think using such a stack is preferable, as as it indicates nicely that there is only one active loop at a time.

After having thought about this code now for a while I think I understood all the corner cases. Thanks for your patience!

This revision was automatically updated to reflect the committed changes.