This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Allow PHI nodes in exit blocks
AbandonedPublic

Authored by Meinersbur on Aug 8 2015, 11:29 AM.

Details

Reviewers
grosser
Summary

In case there are dependences from multiple scalars inside the region to a single PHI node in the exit block of the region, it is unclear which value needs to be stored in the PHI node.

Example:

bb:
  br %cond, stmtA, stmt B,

stmtA:
  br merge

stmtB:
  br merge

merge:
  %val = phi [%valA, %stmtA], [%valB, stmtB]
  ret %val

For the region (bb -> merge), the value that needs to be stored in %val is %valA, if the PHI node is reached through %stmtA, and %valB, if the PHI node is reached through %stmtB. Hence, at code generation time we can not just store the value of a specific scalar in the PHI node, but need to "choose" between the values of multiple scalars. For this we use the same approach as we use for PHI nodes in general, we add virtual writes to a PHI-node location and then read back the original value in the exiting block.

This patch removes the check in ScopDetection for PHI nodes in the exit node and teaches TempScopInfo to generate access information for them. The PHI nodes are later found and handled by CodeGeneration to reload the writes into a scalar.

Modification of the IR before CodeGeneration is avoided because we do not want the IR changed if there are Scop-level optimizations to be applied or we do analysis only.

Diff Detail

Event Timeline

Meinersbur updated this revision to Diff 31589.Aug 8 2015, 11:29 AM
Meinersbur retitled this revision from to [Polly] Make TempScopInfo handle PHI nodes in exit block.
Meinersbur updated this object.
Meinersbur added a reviewer: grosser.
Meinersbur added a project: Restricted Project.
Meinersbur added a subscriber: pollydev.

This all seems like a really huge hack in every level of Polly just to allow these PHI nodes. If we really want them in the region why not split the basic block first and change the region tree? That transformation is easily reversable and would remove code from all main Polly parts

This all seems like a really huge hack in every level of Polly just to allow these PHI nodes. If we really want them in the region why not split the basic block first and change the region tree? That transformation is easily reversable and would remove code from all main Polly parts

This is how I already tried to argue with Tobias. There is actually code in IndependentBlock that does this already, but never called.

According Tobias we are actively working towards making no changes to the IR before code generation, in case Polly is used for analysis only.

This all seems like a really huge hack in every level of Polly just to allow these PHI nodes. If we really want them in the region why not split the basic block first and change the region tree? That transformation is easily reversable and would remove code from all main Polly parts

This is how I already tried to argue with Tobias. There is actually code in IndependentBlock that does this already, but never called.

According Tobias we are actively working towards making no changes to the IR before code generation, in case Polly is used for analysis only.

Ok, some thoughts from me:

  1. I would argue that we can modify the CFG if it is trivially reversable.
  2. I am still not sure why we need the PHI nodes in the region. Why not make code gen able to handle PHI nodes in the exit? Is there a benefit in this way I do not see or a major problem in the other one?
  3. I would like good reasons to change 4-5 layers of Polly code and add special cases all over the place.

Can we discuss this on wednesday or in a thread?

grosser edited edge metadata.Aug 9 2015, 1:50 PM

First, thank you Michael for pushing all these changes out. I did not yet manage to
look into all of them yet, but will do so tomorrow (evening?).

Hi Johannes,

thanks for joining the discussion. Michael and me discussed this topic already on the phone, but it would probably have better to either do this via email or to send out meeting notes at least. Sorry for having missed this.

Let me first give you the missing context.

Michael found two weeks ago that Polly does not detect a lot of SCoPs if executed late in the pass pipeline and that in many cases Polly bails out due to PHI nodes in the exit basic blocks. The reason why there are so many PHI nodes in exit basic blocks is that LLVM's -lcssa pass introduces PHI nodes after loops for each value that is defined inside a loop and used after.

For example:

define i64 @foo() {                                                              
start:                                                                           
  br label %loop                                                                 
                                                                                 
loop:                                                                            
  %indvar = phi i64 [0, %start], [%indvar.next, %loop]                           
  %indvar.next = add i64 %indvar, 1                                              
  %val = add i64 %indvar, %indvar                                                
  %cmp = icmp eq i64 %indvar.next, 100                                           
  br i1 %cmp, label %loop, label %end                                            
                                                                                 
end:                                                                             
  ret i64 %val                                                                   
}

opt -lcssa

define i64 @foo() {
start:
  br label %loop

loop:                                             ; preds = %loop, %start
  %indvar = phi i64 [ 0, %start ], [ %indvar.next, %loop ]
  %indvar.next = add i64 %indvar, 1
  %val = add i64 %indvar, %indvar
  %cmp = icmp eq i64 %indvar.next, 100
  br i1 %cmp, label %loop, label %end

end:                                              ; preds = %loop
  %val.lcssa = phi i64 [ %val, %loop ]
  ret i64 %val.lcssa
}

As you see we now have a new %val.lcssa phi node in the code, that prevents us from detecting the SCoP.

This is case 1) of PHI-node in exit node. It has the important property that only a _single_ edge from inside the region goes into the PHI node into the exit node.

However, there _may_ also be cases where more than one edge goes from inside
the scop to the exit block.

define @foo() {

entry:
   %X
   br label %scop_start

scop_start
   br i1 %cond, label %stmtA, label %stmtB

stmtA:
  %A
  br label %exit

stmtB
  %B
  br label %exit

exit:
  %val = phi i64 [%A, %stmtA], [%B, %stmtB], [%X, %entry]
  ret i64 %val

}

This case 2) is more complicated as the value that needs to be written into %val, can be coming from multiple places from inside the SCoP and to my understanding the main reason for the changes TempScop and ScopInfo is the need to model certain data-dependences to support this second case.

Now, are there alternatives

A) Split the block ``before'' Polly

A) Do not handle case 2), but just focus on case 1)

  1. I would argue that we can modify the CFG if it is trivially reversable.

Both of you already suggested option A). For option A) we need to realize that we can actually not split before Polly, but only in between ScopDetection and TempScop/ScopInfo. At an earlier point we had a large set of transformations in between the two passes and unfortunately some of them triggered surprising bugs due to them invalidating ScopDetection unexpectedly. (I remember that some seemingly unrelated IR changes invalided the BasicBlock alias analysis). To be fully save, we now _always_ rerun ScopDetection before TempScopInfo to ensure the SCoP we look at is still valid. At some point, I would like to get rid of this additional verification pass and the easiest way to get confidence that nothing breaks is to just _not_ touch anything in between. (There are also issues with multiple scops that affect each other, but this may possibly be resolved with an updated detection->modeling->transformation order when we are moving to the new pass manager). Another reason for avoiding transformations is that I would like Polly to be usable as an analysis only. Many of the concepts in the pass manager rely on analysis to not touch the IR and I am sure chandler is going to exploit a lot more of this freedom in the new pass manager. Hence, I would like to not introduce a hack that may cause trouble (or debugging costs) later on.

I am still not sure why we need the PHI nodes in the region. Why not make code gen able to handle PHI nodes in the exit? Is there a benefit in this way I do not see or a major problem in the other one?

Support for case 2) is why this can not be handled in codegen only. I did not find the time to look through all the
patches, but this does not seem to be clear from the proposed commit messages at least.

I would like good reasons to change 4-5 layers of Polly code and add special cases all over the place.

Right. When Michael and me discussed this two weeks ago, I was also concerned about the amount of changes and the additional code complexity. My suggestion was at this point to just focus on case 1) which seems to be the common case and only to worry about case 2) if we find cases where supporting it actually matters. To
my understanding we did not yet find such cases, but Michael was convinced he can handle both cases without introducing too much complexity such that even
without such supporting cases it is better to just implement it once and for all
rather than to wait for use cases that make it actually necessary.

I did not yet manage to look at the patches in detail, but the change seems to be at least non-trivial now. Michael, did you happen to come across a motivating example
that would justify adding support for case 2)?

AFAIU Johannes would have preferred to have this conversation at the weekly phone conference, but here we go...

...
As you see we now have a new %val.lcssa phi node in the code, that prevents us from detecting the SCoP.

This is case 1) of PHI-node in exit node. It has the important property that only a _single_ edge from inside the region goes into the PHI node into the exit node.

This is because there is an LCSSA pass before and LICM tries to preserve that property. I.e. There will nearly always be PHI nodes in the exit loop if the top-most element is a loop.

However, there _may_ also be cases where more than one edge goes from inside
the scop to the exit block.

During my first experiments with LICM I saw them very often. They nearly always occur if there is a reduction-to-scalar in the scop. In my first designs on de-LICM I did not consider that a loop body may read and write the same value, which typically occurs in reductions. During a Wednesday's phone call Tobias presented me several code snippets which I should consider and show exactly this behavior.

A typical example is a simple reduction:

for (size_t i = 0; i<n; i+=1)
  sum += A[i];
use(sum);

Polly has functionality for detecting reductions. If we say that we do not support such kind of loops because of multi-incoming-edge in the exit node, reduction detection becomes somewhat useless.

Additionally, it is very difficult to users of Polly to explain why Polly does not recognize their loop.

This case 2) is more complicated as the value that needs to be written into %val, can be coming from multiple places from inside the SCoP and to my understanding the main reason for the changes TempScop and ScopInfo is the need to model certain data-dependences to support this second case.

Now, are there alternatives

A) Split the block ``before'' Polly

A) Do not handle case 2), but just focus on case 1)

  1. I would argue that we can modify the CFG if it is trivially reversable.

[...]. At an earlier point we had a large set of transformations in between the two passes and unfortunately some of them triggered surprising bugs due to them invalidating ScopDetection unexpectedly. (I remember that some seemingly unrelated IR changes invalided the BasicBlock alias analysis). To be fully save, we now _always_ rerun ScopDetection before TempScopInfo to ensure the SCoP we look at is still valid. At some point, I would like to get rid of this additional verification pass and the easiest way to get confidence that nothing breaks is to just _not_ touch anything in between. [...]

Isn't CodeGeneration the larger problem? We cannot make it not modify the code and multiple CodeGenerations have to be serialized between each other, so one of them may touch the other scop.

I am still not sure why we need the PHI nodes in the region. Why not make code gen able to handle PHI nodes in the exit? Is there a benefit in this way I do not see or a major problem in the other one?

Support for case 2) is why this can not be handled in codegen only. I did not find the time to look through all the
patches, but this does not seem to be clear from the proposed commit messages at least.

A CodeGen-only solution cannot handle the situation with two incoming edges. For the passes before the use of the PHI node is just an exposed value, since used outside the region. To select which of values the PHI node will have we also need to know which exiting edge it was using. This information is not preserved in the scop.

I would like good reasons to change 4-5 layers of Polly code and add special cases all over the place.

Right. When Michael and me discussed this two weeks ago, I was also concerned about the amount of changes and the additional code complexity. My suggestion was at this point to just focus on case 1) which seems to be the common case and only to worry about case 2) if we find cases where supporting it actually matters. To
my understanding we did not yet find such cases, but Michael was convinced he can handle both cases without introducing too much complexity such that even
without such supporting cases it is better to just implement it once and for all
rather than to wait for use cases that make it actually necessary.

I did not yet manage to look at the patches in detail, but the change seems to be at least non-trivial now. Michael, did you happen to come across a motivating example
that would justify adding support for case 2)?

I have multiple arguments to this point:

  • CodeGeneration is already complicated enough. This approach does not change _anything_ in BlockGenerator. Changes there would have to consider all the generators (Scalar, Vector, (GPU?)), making such a solution even harder.
  • The largest change (http://reviews.llvm.org/D11867) is because simplifyRegion and executeScopConditionally did not properly update RegionInfo. They did not care for setRegionFor sometimes did not always set the correct region entries/exits or made assumptions that do not always hold (There can be PHI nodes in blocks with a single incoming edge). Sorry, I have no test cases; those were usually only triggered in preliminary modifications, e.g. I added more invocations of RegionInfo::verifyAnalysis. I requested adding a check of RegionInfo::getRegionInfo to the verification, so these changes are necessary in every case.
  • I would not argue in number of "layers" touched to evaluate the complexity a patch. Renaming llvm::BasicBlock to something else would touch all the layers but does not change the complexity at all. Unless the solution we come up with is to reinstate IndependentBlock::splitExitBlock, there is no way around modifying ScopDetection. It is the very same preliminary patch from Johannes he sent me to look at and he proposes to change CodeGeneration himself. Leaves 2 passes that this patch set modifies: ScopInfo and TempScopInfo (Which will be just one because I am already working on merging them).
  • I never argued that any change would be "trivial". AFAI remember it was something like "not too complicated".
  • I'd also argue that this change is not too complicated. All it does is "redefining" the PHI nodes of the exit block to be inside the regions. Since there was no abstraction for that yet, the individual places had to be changed. There are only 5 of them. All of them follow this very same idea. In this sense there are no "special cases" added, there is just a redefinition which is easy to understand.

Adding handling for PHI nodes in the exit block in BlockGenerator would be a special case only for such PHI nodes. In this patch we make the existing code handle the new situation.

  • IMHO supporting only case 1) but not case 2) would be an ugly workaround of the symptom. I'd very much prefer to make the problem go away altogether.
  • The examples (all -O3 -polly-position=before-vectorizer)

:

for (size_t i = 0; i<n; i+=1)
  sum += A[i];
use(sum)

or

for (long long j = 0; j<n; j+=1) {
  A[j] = B[j] + 1;
  t = 7;
}
use(t);

or

  float x = 0;
  if (n > 2) {
	  for (int i = 0; i<n; i+=1) {
		A[i] = B[i] + 1;
	  }
	  x = 3;
  }
  return x;

Not all have any parallelism in the classic sense (had to add at least one loop or otherwise ScopDetection would bail out).

Just one of these patterns have to appear and the loop would not be recognized anymore. Maybe inner loops, but those could again make use of one of these patterns) e.g.:

for (long long i = 0; i<n; i+=1) {
  for (long long j = 0; j<n; j+=1) {
    A[i] =+ B[j];
  }
  t = 7;
}
use(t);

(I should add some of these to test cases)

Motivated enough?

Some comments on the actual patch.

include/polly/TempScopInfo.h
272

We use references for Regions here if they cannot be null.

284

We use references for Regions here if they cannot be null.

lib/Analysis/TempScopInfo.cpp
179

I do not understand this condition? Why is it needed and what is the implication?

300

Can we remove the `buildExitPHIAccessFunctions` part if we add here something like:

if (&R == &SR)
  buidAccessFunctions(R, R->getExit(), /* only PHI's */  true);

and an `onlyPHIs` parameter in the buildAccessFunctions method?

355

typo

AFAIU Johannes would have preferred to have this conversation at the weekly phone conference, but here we go...

I think it does not hurt to get started. Also, like this we actually document our discussion for other interested people.

However, there _may_ also be cases where more than one edge goes from inside
the scop to the exit block.

During my first experiments with LICM I saw them very often. They nearly always occur if there is a reduction-to-scalar in the scop. In my first designs on de-LICM I did not consider that a loop body may read and write the same value, which typically occurs in reductions. During a Wednesday's phone call Tobias presented me several code snippets which I should consider and show exactly this behavior.

A typical example is a simple reduction:

for (size_t i = 0; i<n; i+=1)
  sum += A[i];
use(sum);

Here is the CFG I get when translating this into a function that actually compiles and running it with "polly-clang /tmp/test.c -O3 -mllvm -polly -mllvm -polly-position=before-vectorizer -mllvm -polly-show"

Polly has functionality for detecting reductions. If we say that we do not support such kind of loops because of multi-incoming-edge in the exit node, reduction detection becomes somewhat useless.

Additionally, it is very difficult to users of Polly to explain why Polly does not recognize their loop.

You are right, in the above case we get this multi-exit edges from region that go into a PHI node. Now, even if we would not support them, Polly would still detect the scop 'for.body -> for.cond.cleanup', but it would miss the conditional branch. My current feeling is that such conditional branches around a single loop are probably not so important, as most optimizations I can think of will be done on the loop itself. Even without support for 2) we would still detect the loop.

Now, maybe including the branch allows us to derive some context information about the loop? E.g. that it is executed at least once?

Looking for an example where we would actually loose loops, I came up with the following:

float foo(long n, float *A) {

float sum = 0;                                                                 
  for (long i = 0; i<n; i+=1)                                                  
  sum += A[i];                                                                 
  for (long i = 0; i<n; i+=1)                                                  
  sum += A[i];                                                                 
return sum;

}

If we do not support the multi-from-scop-edge PHI nodes, we will not be able to form a single scop and consequently will not be able to fuse the loops. I think this is a good motivating example, no?

This case 2) is more complicated as the value that needs to be written into %val, can be coming from multiple places from inside the SCoP and to my understanding the main reason for the changes TempScop and ScopInfo is the need to model certain data-dependences to support this second case.

Now, are there alternatives

A) Split the block ``before'' Polly

A) Do not handle case 2), but just focus on case 1)

  1. I would argue that we can modify the CFG if it is trivially reversable.

[...]. At an earlier point we had a large set of transformations in between the two passes and unfortunately some of them triggered surprising bugs due to them invalidating ScopDetection unexpectedly. (I remember that some seemingly unrelated IR changes invalided the BasicBlock alias analysis). To be fully save, we now _always_ rerun ScopDetection before TempScopInfo to ensure the SCoP we look at is still valid. At some point, I would like to get rid of this additional verification pass and the easiest way to get confidence that nothing breaks is to just _not_ touch anything in between. [...]

Isn't CodeGeneration the larger problem? We cannot make it not modify the code and multiple CodeGenerations have to be serialized between each other, so one of them may touch the other scop.

Sure it is. Still, this does not make the other problem go away.

I am still not sure why we need the PHI nodes in the region. Why not make code gen able to handle PHI nodes in the exit? Is there a benefit in this way I do not see or a major problem in the other one?

Support for case 2) is why this can not be handled in codegen only. I did not find the time to look through all the
patches, but this does not seem to be clear from the proposed commit messages at least.

A CodeGen-only solution cannot handle the situation with two incoming edges. For the passes before the use of the PHI node is just an exposed value, since used outside the region. To select which of values the PHI node will have we also need to know which exiting edge it was using. This information is not preserved in the scop.

Right, that is what you explained me earlier and I do understand this now. However, this was not clear from the proposed commit messages. I think it will become clear to Johannes as well.

I would like good reasons to change 4-5 layers of Polly code and add special cases all over the place.

Right. When Michael and me discussed this two weeks ago, I was also concerned about the amount of changes and the additional code complexity. My suggestion was at this point to just focus on case 1) which seems to be the common case and only to worry about case 2) if we find cases where supporting it actually matters. To
my understanding we did not yet find such cases, but Michael was convinced he can handle both cases without introducing too much complexity such that even
without such supporting cases it is better to just implement it once and for all
rather than to wait for use cases that make it actually necessary.

I did not yet manage to look at the patches in detail, but the change seems to be at least non-trivial now. Michael, did you happen to come across a motivating example
that would justify adding support for case 2)?

I have multiple arguments to this point:

  • CodeGeneration is already complicated enough. This approach does not change _anything_ in BlockGenerator. Changes there would have to consider all the generators (Scalar, Vector, (GPU?)), making such a solution even harder.

Sorry that this has not become clear. I think the solution you have chosen is the best possible given that I asked to not touch the IR beforehand. I know you put quite some thoughts into why you choose the approach you did. It might be worth sharing these considerations in your commit messages.

  • The largest change (http://reviews.llvm.org/D11867) is because simplifyRegion and executeScopConditionally did not properly update RegionInfo. They did not care for setRegionFor sometimes did not always set the correct region entries/exits or made assumptions that do not always hold (There can be PHI nodes in blocks with a single incoming edge). Sorry, I have no test cases; those were usually only triggered in preliminary modifications, e.g. I added more invocations of RegionInfo::verifyAnalysis. I requested adding a check of RegionInfo::getRegionInfo to the verification, so these changes are necessary in every case.

Yes, this is a very good point. I did not look at all patches in detail yet, but it seems you did a lot of cleaning/bug-fixing on the way. So probably many of the changes you propose are not only needed for the PHI node handling, but are by themselves beneficial and don't really increase code complexity just for PHI nodes. I already started to go through them, such that we can get them committed quickly.

  • I never argued that any change would be "trivial". AFAI remember it was something like "not too complicated".

That's what I wrote ;):

"Michael was convinced he can handle both cases without introducing too much complexity"

The solution not being trivial, just requires some thoughts/motivations.

  • I'd also argue that this change is not too complicated. All it does is "redefining" the PHI nodes of the exit block to be inside the regions. Since there was no abstraction for that yet, the individual places had to be changed. There are only 5 of them. All of them follow this very same idea. In this sense there are no "special cases" added, there is just a redefinition which is easy to understand.

Adding handling for PHI nodes in the exit block in BlockGenerator would be a special case only for such PHI nodes. In this patch we make the existing code handle the new situation.

I did not look into the details of your patches. I obviously dislike any complexity increase, but my current feeling is that it will be hard to go with less complexity and that the complexity added is in the end not too bad.

I think the first point to address is to clarify the motivation of these patches and the implementation choices. (Which we are doing here). I think we are on a good way.

  • IMHO supporting only case 1) but not case 2) would be an ugly workaround of the symptom. I'd very much prefer to make the problem go away altogether.

OK.

  • The examples (all -O3 -polly-position=before-vectorizer)

:

for (size_t i = 0; i<n; i+=1)
  sum += A[i];
use(sum)

or

for (long long j = 0; j<n; j+=1) {
  A[j] = B[j] + 1;
  t = 7;
}
use(t);

or

  float x = 0;
  if (n > 2) {
	  for (int i = 0; i<n; i+=1) {
		A[i] = B[i] + 1;
	  }
	  x = 3;
  }
  return x;

Not all have any parallelism in the classic sense (had to add at least one loop or otherwise ScopDetection would bail out).

Just one of these patterns have to appear and the loop would not be recognized anymore. Maybe inner loops, but those could again make use of one of these patterns) e.g.:

for (long long i = 0; i<n; i+=1) {
  for (long long j = 0; j<n; j+=1) {
    A[i] =+ B[j];
  }
  t = 7;
}
use(t);

(I should add some of these to test cases)

Motivated enough?

All these examples will just loose the outermost condition if we do not support 2). As long as we do not know any optimization/transformation why including this outer condition is indeed useful, just detecting the inner loops seems still OK to me.

Now, I think the loop fusion example I gave above is a good motivation.

Best,
Tobias

Just a test if my reply gets added to phabricator.

Tobias

Another phabricator test (This time with inline comments)

Another phabricator test II (This time with inline comments)

Hallo

Test

Another phabricator test III (This time with inline comments)

Hallo

Test

grosser added a comment.

In http://reviews.llvm.org/D11870#220094, @Meinersbur wrote:

AFAIU Johannes would have preferred to have this conversation at the weekly phone conference, but here we go...

This time with inline comments.

Tobias

A typical example is a simple reduction:

for (size_t i = 0; i<n; i+=1)
  sum += A[i];
use(sum);

Here is the CFG I get when translating this into a function that actually compiles and running it with "polly-clang /tmp/test.c -O3 -mllvm -polly -mllvm -polly-position=before-vectorizer -mllvm -polly-show"

Thanks for the nice graphs.

Polly has functionality for detecting reductions. If we say that we do not support such kind of loops because of multi-incoming-edge in the exit node, reduction detection becomes somewhat useless.

Additionally, it is very difficult to users of Polly to explain why Polly does not recognize their loop.

You are right, in the above case we get this multi-exit edges from region that go into a PHI node. Now, even if we would not support them, Polly would still detect the scop 'for.body -> for.cond.cleanup', but it would miss the conditional branch. My current feeling is that such conditional branches around a single loop are probably not so important, as most optimizations I can think of will be done on the loop itself. Even without support for 2) we would still detect the loop.

Mmh, you are right I only looked whether the PHI nodes have more than one edge and whether it the region could be made smaller, not where the edges come from.

I think Polly could support such cases easily if ScopDetection::isValidExit would not just look for the existence of a PHI node, but instead count the number of edges from the region. If it is exactly one, we are good to go. I'd still expect some problems in how CodeGeneration currently simplifies loops.

Looking for an example where we would actually loose loops, I came up with the following:

float foo(long n, float *A) {

float sum = 0;                                                                 
  for (long i = 0; i<n; i+=1)                                                  
  sum += A[i];                                                                 
  for (long i = 0; i<n; i+=1)                                                  
  sum += A[i];                                                                 
return sum;

}

If we do not support the multi-from-scop-edge PHI nodes, we will not be able to form a single scop and consequently will not be able to fuse the loops. I think this is a good motivating example, no?

Yes. Thank you for coming up with a better example than me.

I am still not sure why we need the PHI nodes in the region. Why not make code gen able to handle PHI nodes in the exit? Is there a benefit in this way I do not see or a major problem in the other one?

Support for case 2) is why this can not be handled in codegen only. I did not find the time to look through all the
patches, but this does not seem to be clear from the proposed commit messages at least.

A CodeGen-only solution cannot handle the situation with two incoming edges. For the passes before the use of the PHI node is just an exposed value, since used outside the region. To select which of values the PHI node will have we also need to know which exiting edge it was using. This information is not preserved in the scop.

Right, that is what you explained me earlier and I do understand this now. However, this was not clear from the proposed commit messages. I think it will become clear to Johannes as well.

An addendum: The handling of scalars gets around this by adding writes to the scop statement. This creates output dependencies (WAW) between them and ensures that the last write is the value we look for. Without these output dependencies, isl is free to reorder the statements such that the wrong definition might be the last one.

I would like good reasons to change 4-5 layers of Polly code and add special cases all over the place.

Right. When Michael and me discussed this two weeks ago, I was also concerned about the amount of changes and the additional code complexity. My suggestion was at this point to just focus on case 1) which seems to be the common case and only to worry about case 2) if we find cases where supporting it actually matters. To
my understanding we did not yet find such cases, but Michael was convinced he can handle both cases without introducing too much complexity such that even
without such supporting cases it is better to just implement it once and for all
rather than to wait for use cases that make it actually necessary.

I did not yet manage to look at the patches in detail, but the change seems to be at least non-trivial now. Michael, did you happen to come across a motivating example
that would justify adding support for case 2)?

I have multiple arguments to this point:

  • CodeGeneration is already complicated enough. This approach does not change _anything_ in BlockGenerator. Changes there would have to consider all the generators (Scalar, Vector, (GPU?)), making such a solution even harder.

Sorry that this has not become clear. I think the solution you have chosen is the best possible given that I asked to not touch the IR beforehand. I know you put quite some thoughts into why you choose the approach you did. It might be worth sharing these considerations in your commit messages.

We still have to convince Johannes.

  • I never argued that any change would be "trivial". AFAI remember it was something like "not too complicated".

That's what I wrote ;):

"Michael was convinced he can handle both cases without introducing too much complexity"

The solution not being trivial, just requires some thoughts/motivations.

Sorry for the strawman.

I did not look into the details of your patches. I obviously dislike any complexity increase, but my current feeling is that it will be hard to go with less complexity and that the complexity added is in the end not too bad.

It's probably me as implementer which should dislike complexity increase the most. The least complex solution would be modifying the IR. This one should be the second least complex that does the job.

Now, I think the loop fusion example I gave above is a good motivation.

Thanks again for it :-)

Meinersbur added a comment.

In http://reviews.llvm.org/D11870#220194, @grosser wrote:

In http://reviews.llvm.org/D11870#220094, @Meinersbur wrote:

You are right, in the above case we get this multi-exit edges from region that go into a PHI node. Now, even if we would not support them, Polly would still detect the scop 'for.body -> for.cond.cleanup', but it would miss the conditional branch. My current feeling is that such conditional branches around a single loop are probably not so important, as most optimizations I can think of will be done on the loop itself. Even without support for 2) we would still detect the loop.

Mmh, you are right I only looked whether the PHI nodes have more than one edge and whether it the region could be made smaller, not where the edges come from.

I think Polly could support such cases easily if ScopDetection::isValidExit would not just look for the existence of a PHI node, but instead count the number of edges from the region. If it is exactly one, we are good to go. I'd still expect some problems in how CodeGeneration currently simplifies loops.

This is precisely what I was trying to suggest earlier. I assume it requires some changes, but
those should be mostly local to the region simplification in the code generation.

Actually, if this change is simple, it might be worth adding this (and a test case as an intermediate step).

Right, that is what you explained me earlier and I do understand this now. However, this was not clear from the proposed commit messages. I think it will become clear to Johannes as well.

An addendum: The handling of scalars gets around this by adding writes to the scop statement. This creates output dependencies (WAW) between them and ensures that the last write is the value we look for. Without these output dependencies, isl is free to reorder the statements such that the wrong definition might be the last one.

Johannes had some comments here. I let you guys figure this one out.

Best,
Tobias

Hi Michael,

i think we should add a comment and an example that explains why we need to model exiting PHI nodes. Here another
one that might be useful:

Here some possible text:

In case there are dependences from multiple scalars inside the region to a single PHI node in the exit block of the region, it is unclear which value needs to be stored in the PHI node.

Example:

bb:
  br %cond, stmtA, stmt B,

stmtA:
  br merge

stmtB:
  br merge

merge:
  %val = phi [%valA, %stmtA], [%valB, stmtB]
  ret %val

For the region (bb -> merge), the value that needs to be stored in %val is %valA, if the PHI node is reached through %stmtA, and %valB, if the PHI node is reached through %stmtB. Hence, at code generation time we can not just store the value of a specific scalar in the PHI node, but need to "choose" between the values of multiple scalars. For this we use the same approach as we use for PHI nodes in general, we add virtual writes to a PHI-node location and then read back the original value in the exiting block.

Now thinking more about this patch, I wonder if we are not modeling slightly too much. Would it not be sufficient to model the _writes_ to the PHI node location, but to not emit a ScopStmt for the ExitBlock node, the reads in this PHI node nor the scalar write in this PHI node. Instead, we just iterate during code generation through all exit PHI nodes, add a read from the scalar location that belongs to them and replace the multiple exiting edges in this PHI node with the scalar we just read?

Also, I think it makes sense to merge all patches and just track the entire patch in this phabricator review. This review has the most discussion happening and none of these patches can be committed (or even reasoned about in isolation).

Best,
Tobias

lib/Analysis/TempScopInfo.cpp
304

they belong

314

This does not seem to be needed, if we only model the writes to the exit PHI nodes, but not their reads.

352

This does not seem to be needed, if we only model the writes to the exit PHI nodes, but not their reads.

356

point

lib/Support/SCEVValidator.cpp
494

Is this really needed? To my understanding, there is no way a SCEV expression that is used
in the scop will reference any of the PHI nodes in the exit block.

Meinersbur marked 7 inline comments as done.Aug 12 2015, 3:16 PM

i think we should add a comment and an example that explains why we need to model exiting PHI nodes. Here another
one that might be useful:

Thanks for the support.

Now thinking more about this patch, I wonder if we are not modeling slightly too much. Would it not be sufficient to model the _writes_ to the PHI node location, but to not emit a ScopStmt for the ExitBlock node, the reads in this PHI node nor the scalar write in this PHI node. Instead, we just iterate during code generation through all exit PHI nodes, add a read from the scalar location that belongs to them and replace the multiple exiting edges in this PHI node with the scalar we just read?

Yes, I think it would be sufficient to model only the writes s.t. the write dependencies cause output dependencies (WAW) and that the correct value is written last. We already discussed this and Johannes was not agreeing with me.

However, implementing this would add a many special, potentially buggy, code just for this PHis while we can handle such PHIs like any other with already working and tested code. Also, what would be the advantage of omitting a single read MemoryAccess at the end?

lib/Analysis/TempScopInfo.cpp
300

I refactored this part, has less specialized code now.

314

refactored

lib/Support/SCEVValidator.cpp
494

I am not sure enough to remove this condition. Are you?

Could it be the induction variable of a parent loop?

i think we should add a comment and an example that explains why we need to model exiting PHI nodes. Here another
one that might be useful:

Thanks for the support.

Now thinking more about this patch, I wonder if we are not modeling slightly too much. Would it not be sufficient to model the _writes_ to the PHI node location, but to not emit a ScopStmt for the ExitBlock node, the reads in this PHI node nor the scalar write in this PHI node. Instead, we just iterate during code generation through all exit PHI nodes, add a read from the scalar location that belongs to them and replace the multiple exiting edges in this PHI node with the scalar we just read?

Yes, I think it would be sufficient to model only the writes s.t. the write dependencies cause output dependencies (WAW) and that the correct value is written last. We already discussed this and Johannes was not agreeing with me.

However, implementing this would add a many special, potentially buggy, code just for this PHis while we can handle such PHIs like any other with already working and tested code. Also, what would be the advantage of omitting a single read MemoryAccess at the end?

My hope was that the code actually does not need to change so much, but that we can just drop some of what is proposed for inclusion today. We would need no changes to ScopInfo, less changes to TempScopInfo and hopefully no copying of PHIs in CodeGeneration. This idea was meant as a hint of how this code could possibly simplified. If this does not notably reduce complexity, then this idea is probably not so useful.

Best,
Tobias

PS. In your comments you said you addressed some of the comments. Can you possibly upload a revised version of this patch for me to look (and possibly play around)?

lib/Support/SCEVValidator.cpp
494

I am pretty certain none of these conditions is needed. I propose to drop them if we can not find a test case which requires them.

My reasoning. Any value that is part of a SCEV needs to dominate the location at which this SCEV is evaluated. The values in the exit block
of a scop do not dominate any of the values inside the scop. There is one very funny case which I did not think fully throuhg, which is a scop in
the backedge of a larger loop, where the exit of this scop is the header of the larger loop and where some of the PHI nodes in this header are
again used in the scop. However, if such a case actually can be created, I am not yet convinced that what you do is right here. Before we
put some random instructions in, I propose to create a test-case first or, if this fails, to just place an assert that warns us in case we encounter
such piece of code.

grosser added a comment.
Now thinking more about this patch, I wonder if we are not modeling slightly too much. Would it not be sufficient to model the _writes_ to the PHI node location, but to not emit a ScopStmt for the ExitBlock node, the reads in this PHI node nor the scalar write in this PHI node. Instead, we just iterate during code generation through all exit PHI nodes, add a read from the scalar location that belongs to them and replace the multiple exiting edges in this PHI node with the scalar we just read?

That would mean 3 changes:

  1. Remove the ScopDetection exit phi stuff (as before)
  2. Create write accesses to the PHI location at the predecessors blocks of the exit and link them to the operand of the PHI for that predecessor.

Right. To my understanding this is already available in Michael's changes.
The only change to TempScopInfo that needs to remain is the following:

void TempScopInfo::buildExitPHIAccessFunctions(Region *R) {
​ AccFuncSetType Functions;
​ BasicBlock *ExitBB = R->getExit();
​ assert(!R->contains(ExitBB));

​ for (auto I = ExitBB->begin(); isa<PHINode>(I); ++I) {
​ // TODO: Maybe we can ignore trivial (one predecessor) Phi nodes
​ auto PHI = cast<PHINode>(I);
​ buildPHIAccesses(PHI, *R, Functions, nullptr);
​ }
​}

  1. After code generation (which should already allocate and write the operands to unique locations per exit PHI) iterate over all exit PHI nodes and replace the incoming edges that are in the SCoP with the reloaded the value.

Interesting. So you say the writes to unique values will already happen?
Then the question is really only how the PHI node survives the splitting
and how we can replace its operands. That sounds too simple to be true?

Tobias

Meinersbur retitled this revision from [Polly] Make TempScopInfo handle PHI nodes in exit block to [Polly] Allow PHI nodes in exit blocks.Aug 13 2015, 10:48 AM
Meinersbur updated this object.
Meinersbur edited edge metadata.
Meinersbur marked 2 inline comments as done.

Merged all 4 patches, fixed updating the RegionInfo in case the exit block is also the entry of a different region.

Added Tobias 2nd example as test case.

Without D12014 and fixing inner_scev.ll, this breaks some test-suite programs. Will try to fix the latter and possibly undo the change in SCEVValidator before committing.

My hope was that the code actually does not need to change so much, but that we can just drop some of what is proposed for inclusion today. We would need no changes to ScopInfo,

It's just 3 lines w/o comments/assertions!

less changes to TempScopInfo

I don't think so.

and hopefully no copying of PHIs in CodeGeneration.

But a lot of special handling. Why is copying PHIs bad? We copy a lot of instructions in CodeGeneration.

This idea was meant as a hint of how this code could possibly simplified. If this does not notably reduce complexity, then this idea is probably not so useful.

I have a solution ready now, and think this is more complicated. Is it worth investigating? If yes, why?

PS. In your comments you said you addressed some of the comments. Can you possibly upload a revised version of this patch for me to look (and possibly play around)?

Sorry, was searching bugs in test-suite cases that I wanted to fix in the new upload. Was harder than expected.

lib/Analysis/ScopInfo.cpp
862 ↗(On Diff #32074)

Not going to commit this

  1. Create write accesses to the PHI location at the predecessors blocks of the exit and link them to the operand of the PHI for that predecessor.

Right. To my understanding this is already available in Michael's changes.
The only change to TempScopInfo that needs to remain is the following:

void TempScopInfo::buildExitPHIAccessFunctions(Region *R) {
​ AccFuncSetType Functions;
​ BasicBlock *ExitBB = R->getExit();
​ assert(!R->contains(ExitBB));

​ for (auto I = ExitBB->begin(); isa<PHINode>(I); ++I) {
​ // TODO: Maybe we can ignore trivial (one predecessor) Phi nodes
​ auto PHI = cast<PHINode>(I);
​ buildPHIAccesses(PHI, *R, Functions, nullptr);
​ }
​}

I removed that function in the most recent patch.

  1. After code generation (which should already allocate and write the operands to unique locations per exit PHI) iterate over all exit PHI nodes and replace the incoming edges that are in the SCoP with the reloaded the value.

Interesting. So you say the writes to unique values will already happen?

By TempScopInfo::buildPHIAccesses, invoked on the PHI node.

Then the question is really only how the PHI node survives the splitting
and how we can replace its operands. That sounds too simple to be true?

The previous exit node by simplifyRegion has been moved behind polly.merge_new_and_old. Its not really relevant since we just have to look at the exposed value list.

One can make this work by just "behaving" as if the last read for the PHI node was there. I still think its more complicated because the PHI node is referenced in IRAccess (e.g. as "BaseAddress") and may assume it is in the region.

Meinersbur updated this revision to Diff 32156.Aug 14 2015, 8:42 AM
Meinersbur updated this object.

Do not create a ScopStmt for the exit block anymore. IRAccesses for it are still generated but not used.

This passes the regression tests, but not the test-suite (maybe for the same reasons as inner_scev.ll)

grosser accepted this revision.Aug 15 2015, 1:40 AM
grosser edited edge metadata.

Hi Michael,

very nice! This patch has now become very small and easy to understand. Thank you for refactoring it once again. I added just a few minor comments, but nothing serious. Feel free to incorporate them following your own judgement.

Best,
Tobias

include/polly/CodeGen/BlockGenerators.h
395

space after 'by'

lib/Analysis/TempScopInfo.cpp
319

To avoid spreading the knowledge about how to handle the
exit nodes, you could give this function a parameter "OnlyPHINodes" and then set it to true when calling it for the
exit node. Like this this piece of code does not need to know
about exit-node specialities and the comment above
buildAccessFunctions can explain everything.

362

Would this be a good place for the large comment that explains why we need to model access functions in the exit
node?

lib/CodeGen/BlockGenerators.cpp
455

I do not think this change is needed. If I drop this no test case
fails and we also will never iterate over the instructions of an
exit node, hence 'Base' will always be contained in the region.

458

It seems you are also running this code on ExitingBlocks that do _not_ result from region simplification. PHI nodes in these BasicBlocks never need this code to be run, no? If we keep a note that we simplified the region (and that the exiting node is now not modeled) and only then run this code, this code below could possiby be a little bit shorter and we would also not need to reason about if this code actually does the right thing
for ExitNodes that do not result from simplification.

test/Isl/CodeGen/OpenMP/single_loop.ll
46 ↗(On Diff #32156)

This change is unrelated. It seems to revert 244954 which was needed to adjust this test case to a recent change in upstream LLVM.

This revision is now accepted and ready to land.Aug 15 2015, 1:40 AM

I looked into the solution Tobias and I talked about. It is entirely feasible to let the regular Scalar and Escape magic in the codegen handle these PHI nodes.

My patch is available here (http://reviews.llvm.org/D12051) but it does not yet pass all unit tests (I have to change them as they are changed here I guess) and 3 lnt tests. 2 of the lnt tests are unrelated errors, the third is unspecified yet. The one error was actually discovered a few days ago (the sdiv prameter problem) and for the other one I will write a tests case soon. Anyway, my point is that we do not need that much special code in all Polly passes after all.

jdoerfert added a comment.

I looked into the solution Tobias and I talked about. It is entirely feasible to let the regular Scalar and Escape magic in the codegen handle these PHI nodes.

My patch is available here (http://reviews.llvm.org/D12051) but it does not yet pass all unit tests (I have to change them as they are changed here I guess) and 3 lnt tests. 2 of the lnt tests are unrelated errors, the third is unspecified yet. The one error was actually discovered a few days ago (the sdiv prameter problem) and for the other one I will write a tests case soon. Anyway, my point is that we do not need that much special code in all Polly passes after all.

Johannes, did you have a look at Michael's latest patch? He implemented a codegen-only approach,
which indeed seems to be optimal in terms of code-complexity. I would be interested to read
your opinion (in a review?).

Best,
Tobias

grosser added a comment.

jdoerfert added a comment.

I looked into the solution Tobias and I talked about. It is entirely feasible to let the regular Scalar and Escape magic in the codegen handle these PHI nodes.

My patch is available here (http://reviews.llvm.org/D12051) but it does not yet pass all unit tests (I have to change them as they are changed here I guess) and 3 lnt tests. 2 of the lnt tests are unrelated errors, the third is unspecified yet. The one error was actually discovered a few days ago (the sdiv prameter problem) and for the other one I will write a tests case soon. Anyway, my point is that we do not need that much special code in all Polly passes after all.

Johannes, did you have a look at Michael's latest patch? He implemented a codegen-only approach,
which indeed seems to be optimal in terms of code-complexity. I would be interested to read
your opinion (in a review?).

Briefly, but I can tonight. I saw that it is codegen-only but it still
treats exit PHI nodes different from everything else while they are the
same. Anyway, I'll write something more tonight.

I added some comments to this commit. Additionally, I would like to include at least the cases from http://reviews.llvm.org/D12051 .

Because we actually have 2 "codegen only" solutions to this problem now, there are a two new questions I will only ask but not answer:

  1. Should we model the exit PHI node operands always or only if we actually need to (hence split the region exit later and work on the PHI nodes)
  2. Should we use the general scalar codegen and escaping scalar system for these PHI nodes or should we handle them separatly? There is actually a middle ground where the operands are stored by the ordinary codegen into a PHI operand alloca automatically and only the merging needs to be done explicitly later on or delegated to the scalar escape system.
include/polly/CodeGen/BlockGenerators.h
395

The comment is not what I would have expected as it does not describe the function but somehow tries to justify it's existens.

Maybe somethin like:

PHI nodes in the exiting block have been in the region exit block before region simplification (a pre transformation run before code generation). As the region exit is not part of the region they have not been modeled in the SCoP, hence general code generarion did not consider these PHI nodes. However, they have to be "replicated" in the optimized version of the SCoP and these results need to be merged with the original PHI nodes in order to get the correct live out scalar values.

lib/CodeGen/BlockGenerators.cpp
455

I think we have to options here. Either we treat the exit PHI nodes as "almost regular" PHI nodes than we need some special handling here or we add the special handling later in the pipeline and skip them here. As this patch choose the latter option I am not so sure that we can drop the above code, however we probably could if we later use the location created for the phi operands here in the special handling of the exit PHI nodes.

458

I think Tobias idea is good but only one possibility. What comes to mind are at least three possibilites:

  1. We could track if we simplified a region.
  2. We could only model the exit PHI nodes if we will simplify the region.
  3. We could always run the code and distinguish here what to do.

I am voting for 1) or 2) while I slightly favour 2).

Regarding the code here, it looks like the handleOutsideUses code somewhere and does basically the same thing, hence we should probably try to reuse a maybe generalized version of the handleOutsideUses here instead of copying it.

test/ScopDetect/multidim_indirect_access.ll
2–3

We do not have independent blocks anymore. This is not for this commit to change but we should just remove the one run of this test case and the comment here.

7–8

Is this comment still valid afterwards?

Briefly, but I can tonight. I saw that it is codegen-only but it still
treats exit PHI nodes different

Only of my goals was to _not_ trat such PHis differently, which I weakened since Tobias think it was modeling too much.

It's not codegen-only. There are two changes in TempScopInfo.

D12051 has even more special handling for exit nodes PHIs in TempScopInfo.

from everything else while they are the
same.

There are big differences in codegen. D12051 tries to handle exit node PHIs while creating block. This one identifies the PHIs afterwards as escaping uses.

lib/Analysis/TempScopInfo.cpp
319

I don't see any advantage of that. The unit of cohesion is the class, not the method.

lib/CodeGen/BlockGenerators.cpp
455

"Base" is not the instruction to copy (which is "Inst"), but the virtual address accessed. In case of scalars the PHI node itself is abused as address and can well be the one in the exit node. I know for certain that this condition hits.

However, it might be still a mistake. The code below is responsible to write the value to the "virtual PHI address" for this incoming block. Not writing it means undef value.

458

polly::simplifyRegion is not about "modeling exit nodes". It is just to ensure that there is a single exiting edge. Adding such thing would take away the general purpose of that function and I still think it belongs closer to RegionInfo than to polly.

Having robust functions that do work in general cases is a good thing.

test/Isl/CodeGen/OpenMP/single_loop.ll
46 ↗(On Diff #32156)

I just noticed something breaks and I fixed it, assuming of of my changes induced some metadata reordering. Presumbly I updated Polly, but not LLVM.

Only of my goals was to _not_ trat such PHis differently, which I weakened since Tobias think it was modeling too much.

Question now is if we want to treat them with the machinery we have or with handleExitingPHIs afterwards.

It's not codegen-only. There are two changes in TempScopInfo.

Agreed, but that is only relevant if these changes could be avoided somehow.

D12051 has even more special handling for exit nodes PHIs in TempScopInfo.

You mean the part that only exit PHI nodes are modelt we actually need to model?

from everything else while they are the same.

There are big differences in codegen. D12051 tries to handle exit node PHIs while creating block. This one identifies the PHIs afterwards as escaping uses.

I would phrase it differently:

D12051 handles exit nodes PHIs the same way other escape users are handled. Same as all other escaping values it demotes the escaping alue when it occures in the new SCoP and waits for the SCoP finalization to reload and merge the values correctly.

When I think about it, I am actually a little puzzled how/when/where the values are stored in the PHI operand alloca here. Where do we generate the store instructions for the exit PHI operands?

Btw. How many lnt tests fail (some are unrelated I know but it would be good to know.)?

lib/CodeGen/BlockGenerators.cpp
455

The code below is responsible to write the value to the "virtual PHI address" for this incoming block.

That's what I was hinting at with:

we probably could [remove this code] if we later use the location created for the phi operands here in the special handling of the exit PHI nodes.

Dear all,

I think this patch is already very close to optimal. Johannes seems to have some last ideas how the exit PHI-node generation could be improved which I don't yet fully understand. As he has written the PHI node generation he probably knows this better as me anyhow. To resolve the deadlock of multiple patch submissions, i propose to continue with this review and to only use Johannes' patch to explain/illustrate ideas. It would be great if you guys could work out together how the final submission should look like.

The blocking srem issue seems to be close to be resolved, based on the work both of you did, so it would be nice to get this in quickly, too.

Best,
Tobias

include/polly/CodeGen/BlockGenerators.h
395

The comment johannes suggested here seems indeed useful.

lib/Analysis/TempScopInfo.cpp
319

Sure, but if I want to learn about how exit-node PHI nodes are
code generated it is easier if I can look at one place and one comment, rather than having to look through the whole class
to find the places they need special handling.

Anyhow, this is just a minor stylistic comment. If you feel strong about this, feel free to leave it as it is.

lib/CodeGen/BlockGenerators.cpp
458

@Meinsburg: I was just proposing that polly::simplifyRegion reports if it simplified the exit node (or we detect it otherwise). This still seems to be general purpose.

From this information we can then derive that we only need to run this code if the simplification actually happened and can get rid of the code
that makes sure we do not do anything if we hit PHI-nodes that are modeled as part of the region.

Again, this is getting very detailed. If you feel strong, I am fine leaving the code as it is and possibly improve it later.

Meinersbur updated this revision to Diff 32398.Aug 18 2015, 3:35 AM
Meinersbur edited edge metadata.

The last diff was from a while ago so though I update it.

This is based on Johannes' patch D12066. It passes regression tests and all lnt tests except those that fail already on trunk:
http://lab.llvm.org:8011/builders/perf-x86_64-penryn-O3-polly/builds/2719/
Note that for some reason when I run lnt locally on trunk, I get 4 failed tests. I will try running on the buildbots.

There are some unnecessary changes, out-to-date comments and one fix that I will commit separately. I suggest to wait for the next diff before commenting on individual lines.

Only of my goals was to _not_ trat such PHis differently, which I weakened since Tobias think it was modeling too much.

Question now is if we want to treat them with the machinery we have or with handleExitingPHIs afterwards.

IMHO it is a good idea to keep the complexity out of the general case when those only occur at the end.

D12051 has even more special handling for exit nodes PHIs in TempScopInfo.

You mean the part that only exit PHI nodes are modelt we actually need to model?

Never mind, I see they are about equivalent. I think changing buildPHIAccesses might be preferred over changing SCEVInRegionDependences.

When I think about it, I am actually a little puzzled how/when/where the values are stored in the PHI operand alloca here. Where do we generate the store instructions for the exit PHI operands?

In generateScalarStores when it encounters a PHI MemoryAccess.

Meinersbur updated this revision to Diff 32398.
Meinersbur added a comment.

The last diff was from a while ago so though I update it.

This is based on Johannes' patch http://reviews.llvm.org/D12066. It passes regression tests and all lnt tests except those that fail already on trunk:
http://lab.llvm.org:8011/builders/perf-x86_64-penryn-O3-polly/builds/2719/
Note that for some reason when I run lnt locally on trunk, I get 4 failed tests. I will try running on the buildbots.

Do not worry about these 4 crashes. They are issues in LLVM and should hopefully be fixed quickly.

Best,
Tobias

Meinersbur added inline comments.Aug 18 2015, 4:17 AM
lib/CodeGen/BlockGenerators.cpp
452–453

The change has been dropped as it is not necessary anymore.

458

I ensured that there is a distinct block just for such PHIs. That block will be empty if there is nothing to be done. I may have dropped that part prematurely in the last patch.

lib/CodeGen/Utils.cpp
24 ↗(On Diff #32398)

I made this function public because I was using it in handleExitingPHIs. That use is optional, hence this doesn't need to become public.

lib/Support/SCEVValidator.cpp
494

It actually is needed or lnt test will fail. However, I can instead pre-check whether the PHI is in the exit block as in D12051.

test/ScopDetect/multidim_indirect_access.ll
2–3

OK, I will update this test case.

Do not worry about these 4 crashes. They are issues in LLVM and should hopefully be fixed quickly.

OK, thanks for letting me know. Do you know why?

Meinersbur added a comment.

In http://reviews.llvm.org/D11870#226538, @grosser wrote:

Do not worry about these 4 crashes. They are issues in LLVM and should hopefully be fixed quickly.

OK, thanks for letting me know. Do you know why?

I attached you one of the problematic commits. The other was meanwhile already reverted.

Tobias

Rebase to trunk

Still not the clean version I want to have.

Fails one lnt test:
FAIL: MultiSource/Benchmarks/mafft/pairlocalalign.execution_time (499 of 1494)

I would suggest to either simplify this a lot or use http://reviews.llvm.org/D12051 as it e.g., does not change the CFG at all (that is the major difference I see atm plus a few smaller design choices). If you have good reasons we should pursue with this patch please tell me.

The problem with pairalign share both so we need to fix that anyway.

include/polly/CodeGen/BlockGenerators.h
426

This change is unrelated and adds a lot of diff lines.

include/polly/Support/ScopHelper.h
93

This change is unrelated and adds a lot of diff lines.

150

I think this and the other big splitting change are actually not necessary. Maybe you could compare the output with
http://reviews.llvm.org/D12051 ?

lib/CodeGen/BlockGenerators.cpp
486

I wouldn't bet on it. Textual order != execution order, I had to fix a bug like this the other week.

Meinersbur updated this revision to Diff 33727.Sep 1 2015, 2:29 PM

Clean-up and rebasing to trunk which includes the patch for scalar dependencies (r246414)

All 1494 LNT tests passed.

Meinersbur updated this revision to Diff 33742.Sep 1 2015, 3:15 PM
Meinersbur marked 26 inline comments as done.

Addressed comments

There is some more room for generating less code:

  • If the original region has just a single exiting node most of the mechanism here is not necessary and can take a shortcut path.
  • If the escaping value is in a .phiops alloca, it is reloaded and then stored again into a .s2a alloca to satisfy the existing code.

I suggest to work on these post-commit, if requested.

The function createExitingBlockForExitNodePHIs is relatively central to my idea of an implementation as it preserves already stored references to the exit node and its PHI instructions and therefore avoids surprises. In the code generation part of Polly we can normalize the IR to avoid additional condition-checking in the code generation itself. Otherwise, we should go for D12051 instead.

include/polly/CodeGen/BlockGenerators.h
426

It is required because getNewValue gets called in a context without ScopStmt available. All that's needed of ScopStmt in the body is the region. However, to address your concern, I changed it to a Scop object to reduce the impact in other parts. On request I can also commit it separately.

include/polly/Support/ScopHelper.h
150

IMHO it is the cleaner approach than to expect PHI nodes outside of the region to have a specific form. Also, there are existing references to the PHI nodes on the llvm::Scop structure which we should preserve.

lib/Analysis/TempScopInfo.cpp
166

This condition has switched location with

if (UseParent == ParentBB)
  continue;

to ensure that if the exit block has a use of a PHI (i.e. intra-block use-def), that AnyCrossStmtUse is set and a scalar access is generated.

352

I don't understand this one

lib/CodeGen/BlockGenerators.cpp
486

Can you elaborate on the bug?
Do you mean the iteration order? There is a lot of other code in LLVM's core (e.g. BasicBlock::getFirstNonPHI or SimplifyCFG) that assumes that instruction lists are iterated in execution order, so I guess we can assume this as well.

lib/Support/SCEVValidator.cpp
494

Addendum: I tried this and it changed the output (test case non-affine-loop-condition-dependent-access_3.ll) because the pre-check would be overly pessimistic in case the PHI results in a constant. Hence I prefer this solution. LNT passes both variants, though.

test/ScopDetect/multidim_indirect_access.ll
6

The error message changes with this patch

7–8

I reformulated this in a previous commit such that it is not implementation-specific anymore.

Meinersbur updated this revision to Diff 34106.Sep 5 2015, 6:24 AM

rebase to r246926

Hi Michael,

nice that this patch now works flawless!

I have still a couple of things/items that I would like to check in more detail, but am a little low on time ATM. As none of these is likely to affect the general structure of the patch, I propose that after cross-checking with Johannes, we commit this as this is and then enhance this in-tree, if needed.

Some smaller comments inline.

Best,
Tobias

include/polly/CodeGen/BlockGenerators.h
426

Committing it separately would be nice.

include/polly/Support/ScopHelper.h
150

Similar to Johannes, I would also prefer to not split here. However, as splitting seems to cause other issues, I am fine leaving the splitting in for now
and taking the larger but known correct patch.

After this patch is committed, we can always improve on what we have at the moment. Maybe this also untangles the current two-patch review, as we
can get a first version in and then discuss additional improvements individually.

lib/Analysis/ScopDetection.cpp
972

Nice!

1019

Nice!

lib/Analysis/TempScopInfo.cpp
166

OK, but please ensure that at least one test case covers this case.

lib/Support/SCEVValidator.cpp
494

Interesting. It would be good to add a test case to ensure we do not regress here. With the current patch, I can uncomment this code and no test starts failing.

jdoerfert added inline comments.Sep 7 2015, 8:30 AM
include/polly/Support/ScopHelper.h
150

@grosser I don't get your comment. If splitting causes issues then we take the other patch that does not split. If we agree splitting is not what we want we also take the other patch. What is the reason we should go with a larger patch for now?

Meinersbur marked 2 inline comments as done.Sep 8 2015, 5:15 AM
Meinersbur added inline comments.
include/polly/CodeGen/BlockGenerators.h
426

kk, will commit this before the main commit

include/polly/Support/ScopHelper.h
150

If not done using this function. simplifyRegion at latest will also split the exit block, i.e. D12051 also "splits".
However, createExitingBlockForExitNodePHIs does this in a more predictable fashion:

  • It splits unconditionally, i.e. follow-up code has less cases to handle.
  • It keeps the original PHI node references inside the scop. The polly::Scop data structure already stores references to these PHIs. D12051 has (IMHO) fragile code to search for where the PHIs have been moved to.

If you do not share my opinion of a more structured approach, I'd agree to let Johannes do it his way.

lib/Analysis/TempScopInfo.cpp
166

I'll think of one

lib/Support/SCEVValidator.cpp
494

I'll think of one

Meinersbur marked 2 inline comments as done.

@grosser I don't get your comment. If splitting causes issues then we take the other patch that does not split. If we agree splitting is not what we want we also take the other patch. What is the reason we should go with a larger patch for now?

If not done using this function. simplifyRegion at latest will also split the exit block, i.e. D12051 also "splits".

We currently split some blocks but in D12051 there is no additional
splitting going on, hence I do not agree with the sentence above.
Especially, since we talk about the ~250 lines of additional
splitting code here.

However, createExitingBlockForExitNodePHIs does this in a more predictable fashion:

  • It splits unconditionally, i.e. follow-up code has less cases to handle.

True, but it also adds yet another function in another source file that
does some region "simplification".

  • It keeps the original PHI node references inside the scop. The polly::Scop data structure already stores references to these PHIs.

SCoPs always "somehow" reference these PHIs, in your commit as well as
in D12051. In the latter they are handled by representing the operands
as scalars that escape into the PHI node (which isn't a part of the
SCoP).

D12051 has (IMHO) fragile code to search for where the PHIs have been moved to.

What part are your referring too? There is no search going on so I
again disagree with the sentence above. Please be more specific if you
believe there is something fragile or a search.

If you do not share my opinion of a more structured approach, I'd agree to let Johannes do it his way.

This is just condescending. While you believe that your approach is more
structured I think it should be up for discussion. I presented in the
review of this patch arguments against some of the decisions and a
counter proposal in D12051. However, except the claims above I haven't
seen any reasons agains D12051 and for this patch.

Meinersbur abandoned this revision.Sep 8 2015, 7:36 AM

Meinersbur marked 2 inline comments as done.

@grosser I don't get your comment. If splitting causes issues then we take the other patch that does not split. If we agree splitting is not what we want we also take the other patch. What is the reason we should go with a larger patch for now?

If not done using this function. simplifyRegion at latest will also split the exit block, i.e. D12051 also "splits".

We currently split some blocks but in D12051 there is no additional
splitting going on, hence I do not agree with the sentence above.

The block is split, by either simplifyRegionExit or createExitingBlockForExitNodePHIs. but still at most once. Hence, no additional splitting.

Especially, since we talk about the ~250 lines of additional
splitting code here.

Most of the these lines are comments.

However, of course any code needs to justify its existence and I elaborated why I think it is. Now, simplifyRegionExit becomes dead code (i.e. it is replaced) and maybe we should remove it and call simplifyRegionEntry instead directly (I still think the function simplifyRegion is more general purpose and belongs into RegionInfo).

However, createExitingBlockForExitNodePHIs does this in a more predictable fashion:

  • It splits unconditionally, i.e. follow-up code has less cases to handle.

True, but it also adds yet another function in another source file that
does some region "simplification".

If it is better then I don't see the amount of code as an argument against it.

  • It keeps the original PHI node references inside the scop. The polly::Scop data structure already stores references to these PHIs.

SCoPs always "somehow" reference these PHIs, in your commit as well as
in D12051. In the latter they are handled by representing the operands
as scalars that escape into the PHI node (which isn't a part of the
SCoP).

It matters where these references are pointing to. TempScopInfo created them as belonging to a block or subregion, but in order to simplify the region, the PHIs are scattered into and out of the region. That's what I tried to avoid.

D12051 has (IMHO) fragile code to search for where the PHIs have been moved to.

What part are your referring too? There is no search going on so I
again disagree with the sentence above. Please be more specific if you
believe there is something fragile or a search.

When I wrote this I thought about the function getSingleInRegionPHIOperandPHI which I cannot find anymore in the latest update. Some description about major changes would have been nice. I am just today back from my holidays and still tired from traveling.

If you do not share my opinion of a more structured approach, I'd agree to let Johannes do it his way.

This is just condescending. While you believe that your approach is more
structured I think it should be up for discussion.

I am sorry for not having made clear that it is just my opinion about how "structured" a solution is, of course heavily influenced by the fact that I wrote this one and put more thoughts into it.

I presented in the
review of this patch arguments against some of the decisions and a
counter proposal in D12051. However, except the claims above I haven't
seen any reasons agains D12051 and for this patch.

I also haven't seen arguments for D12051 and against D11870 that convinced me. We should just accept that we can have different opinions.

As you seem to strongly prefer D12051 which also has become much better, and it is more important to me to have a working solution than to discuss which one, I hereby retract this diff.

If you are interested in a discussion about your and my arguments, I invite you to a discussion in or after the Wednesday's phone call.