This is an archive of the discontinued LLVM Phabricator instance.

[WIP][Polly] SSA Codegen
Needs ReviewPublic

Authored by jdoerfert on Dec 22 2015, 11:34 AM.

Details

Summary

TODO


Performance-Regressions-Compile-Time : Δ : Previous : Current : σ
S/B/P/l/k/3mm/3mm : 15.28% : 0.8641 : 0.9961 : 0.0048

Performance-Improvements-Compile-Time : Δ : Previous : Current : σ
S/R/C/matrixTranspose : -17.39% : 0.0920 : 0.0760 : 0.0061
M/B/A/C/CrystalMk : -6.69% : 1.0760 : 1.0040 : 0.0141
S/B/P/l/k/cholesky/cholesky : -5.95% : 0.3360 : 0.3160 : 0.0056

Performance-Improvements-Execution-Time : Δ : Previous : Current : σ
M/B/F/m/mason : -5.56% : 0.2160 : 0.2040 : 0.0037
M/B/A/C/CrystalMk : -2.56% : 7.6645 : 7.4685 : 0.0031
M/A/v/viterbi : -2.02% : 2.7722 : 2.7162 : 0.0060

  1. Resolve these errors:
  2. - Error parsing field "Subscribers": The objects you have listed include objects which do not exist (-------------------------------------------------------------------------------, Performance-Regressions-Compile-Time, :, Δ, Previous, Current, σ, S/B/P/l/k/3mm/3mm, 15.28%, 0.8641, 0.9961, 0.0048, Performance-Improvements-Compile-Time, S/R/C/matrixTranspose, -17.39%, 0.0920, 0.0760, 0.0061, M/B/A/C/CrystalMk, -6.69%, 1.0760, 1.0040, 0.0141, S/B/P/l/k/cholesky/cholesky, -5.95%, 0.3360, 0.3160, 0.0056, Performance-Improvements-Execution-Time, M/B/F/m/mason, -5.56%, 0.2160, 0.2040, 0.0037, -2.56%, 7.6645, 7.4685, 0.0031, M/A/v/viterbi, -2.02%, 2.7722, 2.7162, 0.0060).

Diff Detail

Event Timeline

jdoerfert updated this revision to Diff 43467.Dec 22 2015, 11:34 AM
jdoerfert retitled this revision from to [WIP][Polly] SSA Codegen.
jdoerfert added reviewers: grosser, Meinersbur.
jdoerfert updated this object.
jdoerfert added a subscriber: Restricted Project.
jdoerfert updated this revision to Diff 44102.Jan 6 2016, 2:31 AM

Allow changes in textual order for scalar dependences + 2 test cases

jdoerfert updated this revision to Diff 44103.Jan 6 2016, 2:35 AM

Actually add test cases and remove left-over debug messages

jdoerfert updated this revision to Diff 45834.Jan 24 2016, 1:14 PM

Fix last lnt compile time errors, running performance numbers on clean build now

The idea behind this code generation is that we have only 3 distinct points were we need to insert PHI nodes in order to merge (possibly) different definitions of a scalar value.

  1. After a conditional.
  2. After a loop when it was guarded by a conditional.
  3. In the loop header.

To be able to place these PHI nodes we keep scalar mappings that might be needed after a statement. Hence, we copy the mappings from the BBMap to the current ScalarMap. These ScalarMaps are stacked according to the nesting of the AST. Whenever we hit a merge point we will check if the two incoming ScalarMaps (e.g., from the different conditional branches, or the mappings prior and after the loop) to see if a scalar was mapped and if so mapped to different values. In such a case we place a PHI. Howerver, in loops we need the PHI even before we finish building the loop. To this end, scalars that are used but not defined in a loop, or originally loop carried scalars are mapped to a new loop carried scalar in the optimized code. To make the scalars usable in a statement we use the current ScalarMap to initialize the BBMap and reuse the existing lookup logic to find a suitable mapping for a needed scalar.

lib/CodeGen/BlockGenerators.cpp
159

Debug left-over

grosser edited edge metadata.Jan 25 2016, 4:32 AM

Hi Johannes,

I did not get along to go through this in detail, but wanted to provide some first feedback.

First, thanks a lot for looking into ways to make Polly generate good scalar code without further preprocessing!

Here some first comments on the general approach:

  1. (you already know about this) I am concerned that this approach might be more complex and/or may have more corner cases we need to worry about. Aditya mentioned that in their implementation in graphite, they fail to generate code in various situations. My understanding is that you are not aware of any such limitations in your code and you already checked several of the examples Aditya provided. It would probably be good for Aditya and Sebastian to have a look at this patch, as they have experience with SSA code generation and might be able to

provide you with some test cases.

  1. It is interesting to see that this patch actually removes a lot of code. So maybe I am wrong about point 1)
  1. An alternative approach might be to just call PromoteMemToRegisters on the scalar allocs we introduced. This would allow us to keep the same mental model, but to still generate good code.

Some questions I have before i get into reviewing this code:

a) Does your patch need to make any effort to eliminate dead PHI nodes? Will we have a lot of PHI nodes that might need to be dead-code eliminated after your patch? More precisely, would we now need to run a phase of DCE after Polly instead of Mem2Reg.

b) Are there any remaining issues or inherent limitations you are aware in comparison to the old approach?

c) I know you are still running performance numbers, which will be interesting to look at. Otherwise, is this patch ready to be reviewed.

Sebastian, Aditya (and ZIno as well),

this is Johannes' new SSA based code generation. You might have useful input on this change. This change is especially important as it affects larger parts of the back-end code generation.

Performance Improvements - Compile Time Δ Previous Current σ
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky -4.40% 0.3640 0.3480 0.0047
MultiSource/Benchmarks/Prolangs-C/football/football -1.26% 1.9161 1.8920 0.0089

  1. (you already know about this) I am concerned that this approach might be more complex and/or may have more corner cases we need to worry about. Aditya mentioned that in their implementation in graphite, they fail to generate code in various situations. My understanding is that you are not aware of any such limitations in your code and you already checked several of the examples Aditya provided. It would probably be good for Aditya and Sebastian to have a look at this patch, as they have experience with SSA code generation and might be able to

provide you with some test cases.

Thest cases would be good yes. And yes, I am not aware of general limitations or bugs.

  1. It is interesting to see that this patch actually removes a lot of code. So maybe I am wrong about point 1)

In my opinion it is not (so much) about the amount of code but the complexity. And most of the new code is rather straight forward (e.g., everything in ScopHelper.cpp, merging after conditionals and after the loop, and the code that is produced by Polly).

  1. An alternative approach might be to just call PromoteMemToRegisters on the scalar allocs we introduced. This would allow us to keep the same mental model, but to still generate good code.

At some point you want to run the vectorizer after Polly. Why require post-transformation if good code generation is not harder than what we have at the moment?

Some questions I have before i get into reviewing this code:

a) Does your patch need to make any effort to eliminate dead PHI nodes? Will we have a lot of PHI nodes that might need to be dead-code eliminated after your patch? More precisely, would we now need to run a phase of DCE after Polly instead of Mem2Reg.

It does produce dead PHI nodes, though, judging from my experiments not many. Additionally, these dead nodes should not harm vectorization or anything else and will be removed eventually. I did not implement an AST SSA generation that does not generate dead PHIs (e.g., http://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf) since I wanted it to be simple. In our special casse one could easily remember all placed PHIs and use the number of users + the operands to remove all of them after code generation in an efficent way (without another pass).

b) Are there any remaining issues or inherent limitations you are aware in comparison to the old approach?

Not that I am aware of.

c) I know you are still running performance numbers, which will be interesting to look at. Otherwise, is this patch ready to be reviewed.

Performance numbers are back. Both on parkas2, almost identical except 2 small compile time improvements.

sebpop edited edge metadata.Jan 27 2016, 2:27 PM

In general I like the change, though I do not understand the use of undef values in several places in the patch.

include/polly/CodeGen/BlockGenerators.h
308

What is a loop carried value?

313

Could we call these "loop phi" nodes instead of "loop carried phi" nodes?

There is possibly confusion in reading the abbreviated LCPHI: the existing convention is to read it Loop Closed Phi, as in LCSSA.

lib/CodeGen/BlockGenerators.cpp
136

This comment is confusing: either remove it, or move it one line down and fix it to say "outside the scop".

1055

Unfinished comment?

sebpop added inline comments.Jan 27 2016, 2:27 PM
lib/CodeGen/BlockGenerators.cpp
425

Why are we forcing copy of instructions that can be synthesized?
Aren't these synthesizable instructions discarded in the first place by not being added to the scalar memory accesses of the stmt?

lib/CodeGen/IslNodeBuilder.cpp
549

I don't understand how these undef values are supposed to work.
Are you cleaning them up in a later pass?
I see in some of the testcases that they are still left in the IR after Polly.

Here is what I think the two major improvements of this patch over the existing codegens:

  • the use of the scalar references for the stmt being translated: these are actually the only scalars we care about for cross BB dependences, so they need renaming and phi nodes at CFG junctions,
  • the use of the structure of the generated code to place the new phi nodes.

The more I think about this algorithm, the more I like it ;-)
Thanks Johannes!

Meinersbur edited edge metadata.Jan 28 2016, 9:22 AM

There is also an advantage in that we can verify the generated IR statically, st wrong code (e.g. forget to initialize a value) is noticed much earlier by the IR verifier before generating a wrong program that only gets noticed at runtime.

sebpop added inline comments.Jan 29 2016, 11:16 AM
lib/Support/ScopHelper.cpp
481

You could also check whether all arguments of the MergePHI node are the same, in which case you do not need the phi node, and just return the first arg.

  1. (you already know about this) I am concerned that this approach might be more complex and/or may have more corner cases we need to worry about. Aditya mentioned that in their implementation in graphite, they fail to generate code in various situations. My understanding is that you are not aware of any such limitations in your code and you already checked several of the examples Aditya provided. It would probably be good for Aditya and Sebastian to have a look at this patch, as they have experience with SSA code generation and might be able to

provide you with some test cases.

Thest cases would be good yes. And yes, I am not aware of general limitations or bugs.

  1. It is interesting to see that this patch actually removes a lot of code. So maybe I am wrong about point 1)

In my opinion it is not (so much) about the amount of code but the complexity. And most of the new code is rather straight forward (e.g., everything in ScopHelper.cpp, merging after conditionals and after the loop, and the code that is produced by Polly).

  1. An alternative approach might be to just call PromoteMemToRegisters on the scalar allocs we introduced. This would allow us to keep the same mental model, but to still generate good code.

At some point you want to run the vectorizer after Polly. Why require post-transformation if good code generation is not harder than what we have at the moment?

Some questions I have before i get into reviewing this code:

a) Does your patch need to make any effort to eliminate dead PHI nodes? Will we have a lot of PHI nodes that might need to be dead-code eliminated after your patch? More precisely, would we now need to run a phase of DCE after Polly instead of Mem2Reg.

It does produce dead PHI nodes, though, judging from my experiments not many. Additionally, these dead nodes should not harm vectorization or anything else and will be removed eventually. I did not implement an AST SSA generation that does not generate dead PHIs (e.g., http://pp.info.uni-karlsruhe.de/uploads/publikationen/braun13cc.pdf) since I wanted it to be simple. In our special casse one could easily remember all placed PHIs and use the number of users + the operands to remove all of them after code generation in an efficent way (without another pass).

b) Are there any remaining issues or inherent limitations you are aware in comparison to the old approach?

Not that I am aware of.

c) I know you are still running performance numbers, which will be interesting to look at. Otherwise, is this patch ready to be reviewed.

Performance numbers are back. Both on parkas2, almost identical except 2 small compile time improvements.

There is also an advantage in that we can verify the generated IR statically, st wrong code (e.g. forget to initialize a value) is noticed much earlier by the IR verifier before generating a wrong program that only gets noticed at runtime.

I fully agree. There were almost only compile time problems when I wrote this. When I rememeber the time we fixed the scalar codegen the first time we saw a lot of runtime problems due to not or wrong initialization.

include/polly/CodeGen/BlockGenerators.h
308

Any scalar value that is defined and used in different iterations of the same loop. Prior to the transformation this means it is used by a PHI, however after the transformation it might also refere to a scalar that is used in a loop textually before it was defined.

An example would be the folloing original code:

for (i = 0...N)

a = A[i];
A[i+1] = a;

with the following "optimized" version:

for (i = 0...N+1)

if (i > 0)
  A[i] = a;
if (i < N + 1)
  a = A[i];

Here the former non-loop carried scalar a is now loop-carried (thus needs a PHI).

313

Sure.

lib/CodeGen/BlockGenerators.cpp
136

I'll repair this comment.

425

Because we (might) need these values as operands of PHI nodes we haven't constructed yet and we need them to be computed/placed at the correct location. Later on, when we create the PHI, we do not know anymore where we should place the operand code and not even which operand we should create.

In the following example that could be the result after scheduling we would not know where to place which version of x (x1 or x2) when we create code for the join point after the conditional. It would require a backward search on the AST to find the location where the PHI x was written last by each predecessor in order to get the operand and location. However, forcing them to be copyied in the first place solves this quite nice as we alwas know the values we will later need are actually present and mapped.

if (...) {

x1 = 2*i;

S: /* lots of code */
} else {

x2 = 3*i;

P: /* lots of code */
}
x3 = phi(x1, x2)

1055

Yes, I'll fix this.

lib/CodeGen/IslNodeBuilder.cpp
549

Undef values are already "used" in the current code generation but simply hidden well enough. When a scalar is used before it is initialized we basically get an undef. Currently this means a load from an uninitialized alloca is basically an undef. As we promoted these allocas, loads and stores now we see the undefs in the IR, however they should only exist where the alloca content was undefined before. An example:

int x;
for i = 0...N {

if (i > 0)
  A[i-1] = x;
 x = A[i];

}

Here the initial value of x is undefined and therefor the PHI in the loop header as well as the PHI after the loop will have one undef as operand.

In the following example there were no undefs prior to Polly in the IR but right after code generation there are.

if (a) {
S: x = A[i];

/* split BB */

P: A[i] = x;
}

Now we split the conditional for whatever reason and Polly will generate a CFG that kinda looks like this:

if (a)
S: x1 = A[i];
x = phi(undef, x1)

if (a)
P: A[i] = x;

We need the undef for the path that does not define x (or x1). Does that make sense?

lib/Support/ScopHelper.cpp
481

That is what this code is doing, at least it was supposted to do that.
Note that the set "Values" contains all operands of the MergePHI and if it containts only one Value we know all operands are the same, otherwise we know they are not.

Do you see a problematic case?

Any progress?

Thank you for the awseome patch. AFAIU the algorithm is "follow each generated MK_PHIKind through to the end of the generated CFG while adding PHIs to make them usable", ignoring whether the defs are used or how the PHIs where in the original code. Could you explain this somewhere, preferable not only in the commit message?

Is it correct that this gives us O(n*m) generated PHIs (n=number of MK_PHIKind, m=number of statements)?

I still don't really understand the "ScalarMap". What is the difference to BBMaps? Do we need both? It uses original defs as well merge PHIs used as key. I see the test cases motivating this, but could this be avoided by e.g. adding the incoming values in a separate pass instead on-the-fly?

I had 17 LNT test fails with -polly-position=before-vectorizer (none without it). Interestingly I also got 4 unit test fails under Windows but not Linux. Is there maybe some indeterministic ordering? The failing tests are:

Polly :: Isl/CodeGen/phi_loop_carried_float.ll
Polly :: Isl/CodeGen/phi_loop_carried_float_4.ll
Polly :: Isl/CodeGen/phi_loop_carried_float_escape.ll
Polly :: Isl/CodeGen/phi_scalar_simple_1.ll

If you rebase and the problem is still there, I'd debug this for you under Windows.

include/polly/CodeGen/BlockGenerators.h
125

I generally don't like the idea of a private/protected field being sneakingly modified from outside the object (here: IslNodeBuilder)

308

Referring to "(possibly)": What is the function supposed to do if the value is not loop-carried (in what loop)?

313

Idea: Maybe define in some comment what exactly a "loop-carried phi" is? Eg. a PHI in a loop header.

574

ScalarMaps should have its comment. AFAIK Doxygen treats this as uncommented field.

include/polly/CodeGen/IslNodeBuilder.h
69

Can you explain "floating"?

As Sebastian also noted, on seeing "LCPHI" I was first thinking about the PHIs for LCSSA.

71

Could you explain a bit more about "ScalarMaps"? What is it mapping from/to? How/where is it used? What are valid keys?

82

Why moving this definition?

include/polly/Support/ScopHelper.h
170

This would be an excellent place to explain how a "merge PHI" is different from other PHIs.

176

This function returns void

lib/Analysis/ScopDetection.cpp
90

I assume this is a leftover from debugging.

lib/CodeGen/BlockGenerators.cpp
232

This change is surprising to me. Aren't the hoisted loads stored in GlobalMap anymore?

265

What is the reason to not insert the PHI into the generated BB at this point (but to keep them "floating")?

277–281

Could you add comments to explain these cases?

339

block-intern_al_?
mapping_s_

442

"textually" doesn't seem the right word; we are not doing text processing here. Should be something about processing order.

1165

Unrelated?

lib/CodeGen/IslNodeBuilder.cpp
61

Array->getNumberOfDimensions()!=0 seems redundant. MK_PHI always has 0 dimensions.

522

Naive question: Is there some way we could have a nested instance of IslNodeBuilder/function call with fresh values instead of temporarily storing LCPHIs, LoopDepth, PreMap away and restoring it afterwards?

523–524

std::move

549

Could you explain this in the code as a comment?

798

std::move

lib/Support/RegisterPasses.cpp
45

Assuming debugging leftover

lib/Support/ScopHelper.cpp
468

If the incoming value is not found in ScalarMap, I can think of two reasons:

  1. It's defined before the Scop
  2. We forgot to put in there
  3. It was defined only in one incoming branch.

In none of these cases "undef" would make sense. In 3) it cannot be used without an explicit PHI in the original code. What other case is there?

476

Wouldn't it be better to name the PHI after the original value instead one of the incoming values?

479

Because Value.size() will be equal to BBs.size(), this condition can be tested before creating an not used MergePHI.

Thank you for the awseome patch.

I am glad somebody else likes the idea too and I got some more feedback!

AFAIU the algorithm is "follow each generated MK_PHIKind through to the end of the generated CFG while adding PHIs to make them usable", ignoring whether the defs are used or how the PHIs where in the original code. Could you explain this somewhere, preferable not only in the commit message?

It is a bit more general but the idea is correct. Track all generated MK_Value and MK_PHI accesses through the CFG and add new PHIs if different versions for the same base address join in a block with multiple predecessors. [This is a short description I can build on and place somewhere].

Is it correct that this gives us O(n*m) generated PHIs (n=number of MK_PHIKind, m=number of statements)?

Mh, it is unfortunatly not that simple. First, "n" should also include MK_Value but that is not even the problem. The problem is with "m" which is not clearly defined. "m" has to be the number of "join-points" in the generated CFG. This number can grow exponentially with the number of statements/blocks in the generated CFG (which can be arbitrarily higher than the number of statements in the SCoP). Thus, one can achieve O(n*2^m) PHIs, but that also means the optimized version has 2^m different paths through the region which will cause explosions in various other places too. That said, I am not even sure this kind of analysis makes sense because:

We will now only generate a PHI (for one particular SSA-value) if it is "needed" to join different versions of the original "value". While these PHIs can be dead, the total number of PHIs is linear in the number of paths. Additionally, if a path does not introduce a new version of the SSA-value there won't be a PHI once it is joined and if it does contain a new version of an SSA-value we would have placed a store before and mem2reg would have placed the PHI later. The main difference in the complexity comes from dead PHIs that are (as mentioned) limited by the number of paths and could be removed easily after code generation by us.

I still don't really understand the "ScalarMap". What is the difference to BBMaps? Do we need both?

In short: BBMaps are "intra-statement" maps for all local ssa-values but ScalarMaps are (control-)scope-aware "inter-statement" maps for some ssa-values.

Summarized:

  1. ScalarMaps are used to remember mappings between statements
  2. ScalarMaps are (control-)scope-aware, thus each (control-) scope (here loop or conditional) has it's own ScalarMap that is initialized with the ScalarMap of the parent (control-) scope.
  3. ScalarMaps do not contain all ssa-value mappings but only those that are interesting, thus have a SAI object.

It uses original defs as well merge PHIs used as key. I see the test cases motivating this, but could this be avoided by e.g. adding the incoming values in a separate pass instead on-the-fly?

It basically unifies the old ScalarMap and the old PHIOpMap using this trick. One could have two (control-)scope-aware maps again (or maybe do something else) but I do not think it is worth it.

I had 17 LNT test fails with -polly-position=before-vectorizer (none without it). Interestingly I also got 4 unit test fails under Windows but not Linux. Is there maybe some indeterministic ordering? The failing tests are:

Polly :: Isl/CodeGen/phi_loop_carried_float.ll
Polly :: Isl/CodeGen/phi_loop_carried_float_4.ll
Polly :: Isl/CodeGen/phi_loop_carried_float_escape.ll
Polly :: Isl/CodeGen/phi_scalar_simple_1.ll

If you rebase and the problem is still there, I'd debug this for you under Windows.

I will rebase it soon and push it here. Then I will also check LNT again.

include/polly/CodeGen/BlockGenerators.h
125

Agreed, but to be fair, we had those reference maps before too. The only alternative that I see right now is a "ConstructionContext" (similar to the DetectionContext in ScopDetection) that is handed to the functions and can be modified. While this might be a good idea I do think we can postpone it a bit.

308

Good question. The (possibly) hints on the fact that one does not know if a value will be loop carried or not but if it is one needs the PHI. To this end we can generate the PHI even though we might not need it.

313

Can do.

574

This is true, thanks for the catch.

include/polly/CodeGen/IslNodeBuilder.h
69

I can change the name here too and floating means not placed in a basic block [can be added to the comment].

71

Sure

82

Beacuse the constructor/initilizer code looks nicer when we order the members this way. I do not care to much about it and if you feel we should not move this I can undo it.

include/polly/Support/ScopHelper.h
170

I am not sure what you mean by other PHIs and therefor not what exactly you want to see here but I agree, I can add some comment here.

176

I know, therefor the comment in the @returns clause.

lib/Analysis/ScopDetection.cpp
90

Indeed.

lib/CodeGen/BlockGenerators.cpp
232

Yes and no. We do not need to lookup in the GlobalMap here because the hoisted loads are propagated thorugh the scalar maps to the BBMap. One could as easily not do this and I am not sure which is better.

265

This function and all callers are in the BlockGenerator, but the block where these PHIs are supposed to reside in [the loop header of a Polly generated loop] is created in the IslNodeBuilder (or more precise the LoopGenerator). Thus, we either keep them floating or remember loop header blocks somewhere in the BlockGenerator. I choose the first as we can easily place them in the IslNodeBuilder if they are actually needed or delete them otherwise.

277–281

Sure.

339

Thx.

442

AFAIK this is a standard term that we also used in our discussions and emails before. Processing order would not meet the requirements anyway as the actual dynamically executed order is not what this is about. "textually" means the order in which you statically write/see/discover things. In other words the order in which you first see the statements in a linearized/dumped AST if you read it like a page (from top to bottom).

1165

Maybe, I do not recall.

lib/CodeGen/IslNodeBuilder.cpp
61

Probably true, I do again not recall why I did it this way.

522

Possibly, but that is not an easy task and it would not even help much. We would need to copy ScalarMap to the new "instance" anyway. If we think of a good solution we can implement it in the paralell code generation too (we currently copy the maps there too).

523–524

That would "destory" LCPHIs here, wouldn't it? AFAIK you shall not access something that you passed to std::move. We still need the container but just without its content.

549

Sure, but I will have to rephrase it a bit.

798

[See above for more comments] I would like to get it in like this and we can optimize it later, especially since std::move might trigger hard to debug problems I want to avoid for now.

lib/Support/RegisterPasses.cpp
45

Indeed, again.

lib/Support/ScopHelper.cpp
468
  1. and 2) cannot/should never happen or would at least be bugs. 2) is clearly a bug and 1) should always be present.

The reason for the undef here is the same as above, thus 3).

Starting with:

   if (a) {
S:   x = A[i];
P:   A[i] = x;
   }

the scheduler can generate:

   if (a) {
S:   x = A[i];
   }

   if (a) {
P:   A[i] = x;
   }

which needs a PHI (for x after the conditional containing S) even though the original code did not contain/need any. Additionally, there is only one path to that PHI that defines a value for x but we need an operand for the other path reaching the PHI. As I explained Sebastian, these undef's were basically present before, as "content" of the alloca slots we generated. And if you look at the code we generate at the momement for the example abvoe and run mem2reg you will see the same udnef's again.

476

Maybe, I do not recall if this was a concious desicion or not.

479

It is not necessarily equal to BBs.size() as the mappings in several BBs can be the same. If that happens to be the case for all BBs we can remove the PHI immediatly but we do not know that until we looked up all mappings.

Is it correct that this gives us O(n*m) generated PHIs (n=number of MK_PHIKind, m=number of statements)?

Mh, it is unfortunatly not that simple. First, "n" should also include MK_Value but that is not even the problem.

Right; didn't think about MK_Value at that moment.

The problem is with "m" which is not clearly defined. "m" has to be the number of "join-points" in the generated CFG. This number can grow exponentially with the number of statements/blocks in the generated CFG (which can be arbitrarily higher than the number of statements in the SCoP).

I was thinking about statements in the isl_ast, but meant the number of nodes in there. For each isl_ast_node (isl_ast_node_for and isl_ast_node_if), can there be at most one "join-point"?

Thus, one can achieve O(n*2^m) PHIs, but that also means the optimized version has 2^m different paths through the region which will cause explosions in various other places too. That said, I am not even sure this kind of analysis makes sense because:

We will now only generate a PHI (for one particular SSA-value) if it is "needed" to join different versions of the original "value". While these PHIs can be dead, the total number of PHIs is linear in the number of paths.

The number of paths is already 2^m? Per join-point we can also have has many PHIs as scalars, no?

Additionally, if a path does not introduce a new version of the SSA-value there won't be a PHI once it is joined and if it does contain a new version of an SSA-value we would have placed a store before and mem2reg would have placed the PHI later. The main difference in the complexity comes from dead PHIs that are (as mentioned) limited by the number of paths and could be removed easily after code generation by us.

Just exploring whether we can have a case where the number of unused PHIs grows excessively such that it would introduce another scaling problem.

In short: BBMaps are "intra-statement" maps for all local ssa-values but ScalarMaps are (control-)scope-aware "inter-statement" maps for some ssa-values.

Summarized:

  1. ScalarMaps are used to remember mappings between statements
  2. ScalarMaps are (control-)scope-aware, thus each (control-) scope (here loop or conditional) has it's own ScalarMap that is initialized with the ScalarMap of the parent (control-) scope.
  3. ScalarMaps do not contain all ssa-value mappings but only those that are interesting, thus have a SAI object.

Thank you for the explanation.

include/polly/Support/ScopHelper.h
176

@return(s) is intended to document the function's return value, but this function doesn't return anything. MergeMap should be documented using eg.

/// @param MergeMap Receives the merged mappings in @p MergeMap.
lib/CodeGen/BlockGenerators.cpp
265

OK

442

OK, but never seen this personally.

lib/CodeGen/IslNodeBuilder.cpp
523–524

std::swap then?

ValueMapT PreLCPHIs;
LCPHIs.swap(PreLCPHIs);

Although I'd agree this doesn't really make the intend clearer; It's up to you.

lib/Support/ScopHelper.cpp
468

Thank you for the explanation with example

479

Thank you; I missed that Values is a set that 'collapses' equal values.

I think I could write this patch a little differently and we would never build any dead PHIs. However, that would require backtracking instead which might or might not turn out to be faster.

Backtracking sounds like a function to get the generated value on demand, and following the control flow upstream by calling itself. Sounds like a reasonable idea to avoid generating too many unused PHIs.

But I don't know whether it even could introduce a scaling problem and therefore worth the increased implementation complexity (or maybe it's even simpler?)

What do you think?

Backtracking sounds like a function to get the generated value on
demand, and following the control flow upstream by calling itself.
Sounds like a reasonable idea to avoid generating too many unused
PHIs.

Yes, but there is little evidence so far that we generate "too many"
unused PHIs with this patch for interesting code.

But I don't know whether it even could introduce a scaling problem and therefore worth the increased implementation complexity (or maybe it's even simpler?)

I think we can do it with reasonable implementation complexity, however,
you have to consider that we currently (without SSA-codegen) also have a
trade-off "between number of PHIs" and "complexity to place them" even
though it is only implicit. The patch as presented is very good in
terms of complexity to place PHIs but consequently not to smart in
placing them. Alternatively, backtracking would place PHIs smart but
it can cost a lot more to place a PHI in the first place. The overall
complexity of the code is in the worst case always the same [now and with
either SSA-codegen approaches] but the complexity to get to the code is
what is different.

To conclude: I haven't made up my mind what we want and I don't think
you can say "what is best" it in general.

Cheers,

Johannes

- {F1785844, layout=link}

Yes, but there is little evidence so far that we generate "too many"
unused PHIs with this patch for interesting code.

With 'it' in my second paragraph I was referring to D15722 (without backtracking), i.e. was trying to express exactly this. Sorry for my ambiguous writing.

Generally we should be looking for arguments/proof that some code does not introduce scaling issues instead of waiting for evidence that there is one. Otherwise the evidence will come in form of users complaining about excessive compilation time with their sources after a release.

However, in this case I am not too much concerned. Worst case would be to have many Instructions at the beginning of the scop and a PHI node for each of them at each join point. mem2reg might not perform better when we do it with the current implementation. Therefore I suggest to leave it with the current implementation (i.e. D15722).

Hi Johannes,

thank you again for having looked into the SSA codegen. I just spend the day looking through this patch. It contains indeed some very nice ideas and shows that it is probably possible to directly generate SSA without going through memory.

While playing with your patch, I implemented the earlier proposed approach of using LLVM's Mem2Reg utility from inside our code generation (

). With an almost trivial change it seems to achieve the same goals while additionally guaranteeing a minimal set of PHI nodes and coming with a minimal risk of regressions.

I tried to find an argument why we want to go for this 500 line patch, which still needs polishing, a review of the polished version, and also brings the risk of introducing regression:

Going through the discussion I found two arguments for this patch:

  1. The new patch exposed some bugs at compile time, that we missed earlier on
  2. The new patch is smaller and easier to understand than what we have today

I agree that 1) may hold for some cases.

Regarding 2) at least in terms of code size there is not yet too much difference. Only looking at include/ lib/ I see 497 insertions(+), 634 deletions(-). Which is 136 instructions less for the new solution. On the other side, this patch drops a larger 120 line comment describing in detail the previous approach. In addition, the removal of getNewScalarValue that was proposed in this patch has already been incorporated in trunk. Hence, I expect that a rebased version of this patch that adds additional documentation and possibly even code to remove unnecessary PHI nodes is at least not shorter in code size.

Regarding the complexity of the approach, I personally believe that the current approach is as easy to understand as the new approach. In parts that may be because I having written a lot of documentation for the old code and I miss a concise description of the new approach with examples that illustrate the corner cases. Some areas such as non-affine-regions are probably easier to understand with the new code, as finding the right spot for the scalar stores is not easy. However, the new approach becomes possibly more difficult when we want to guarantee that a minimal set of PHI nodes is generated (which we probably want as we have a solution that can do this automatically).

To get a better understanding of this patch, I would really need a clean version of this patch where existing review comments are addressed, the remaining Windows/SIMD bugs resolved, the PHI node minimization implemented and some overview documentation added. In the optimal case we can make a convincing statement why the new approach is linear in compile time and results in a minimal set of PHI nodes as we get it when using Mem2Reg.

Considering this trivial patch that I posted, do you (and the other reviewers) see sufficient benefits in this larger patch now that it would be worth spending several days on rebasing and reviewing the larger patch and risking the introduction of regressions?

I personally would prefer to focus currently more on your latest assumption tracking changes and give you feedback on those. Hence, I would suggest to start off with (

) and use it to introduce all the interesting test case changes that come with this patch. They should match almost 1:1 and would be non-controversial and clearly beneficial.

From this baseline we can then -- when time allows and the need arises -- pickup this patch. If we only introduce it for stylistic issues, the already changed test cases will ensure that IR changes such as the vectorization issue that has been overlooked stand out nicely. However, as I personally could see additional benefits coming from this patch (e.g. because we want to run cost-models on partially generated IR), it might even make sense to introduce this change when in the future non-memory-scalars become necessary even during code generation.

If you guys strongly believe we should go for the new approach now, I can probably take another day to check it throughly for other corner cases that may cause troubles before you start rebasing. (This is not because I don't trust you, but this is a major change on a piece of code which we spent a significant amount of time to get stable, so I believe we need to be vary careful to not regress here).

lib/CodeGen/BlockGenerators.cpp
458

This patch nicely pointed out that getNewScalarValue is unnecessary and its uses can be replaced with getNewValue. This simplification was performed on Jan 26 in https://llvm.org/svn/llvm-project/polly/trunk@258799, such that future versions of this patch will not need to perform this simplification any more.

1007

Why is the last function dropped? This is a sanity check that works on the ScopInfo and is unrelated to how we generate scalar code. It verifies that no scalar write statements are in this ScopStmt, which is likely to cause trouble both for the to-memory scalar codegen as well as SSA codegen. Consequently, I think it makes sense to keep this check.

1038

By dropping generateScalarVectorLoads this patch changes behavior. Originally the code splatted all scalar values into a vector-value. After this change, no vector values are generated. This is likely the reason test/Isl/CodeGen/simple_vec_stride_one.ll changes.

test/Isl/CodeGen/phi_loop_carried_float_5.ll
5

This test case is incomplete. Without a comment it is unclear what exactly is tested here. (The test case seems to be interesting, though)

test/Isl/CodeGen/phi_scalar_simple_1.ll
61

These check lines are misleading as the current patch is generating more (unnecessary) PHI instructions here. In the optimal case we would not even generate them. As long as we generate them, we should list all of them with CHECK-NEXT and add a TODO that some of these PHIs are unnecessary and expected to be deleted later.

test/Isl/CodeGen/phi_scalar_simple_2.ll
16

If we are dropping here almost all CHECK lines, this test case becomes useless. Either we should make an argument that it does indeed not test anything interesting and just delete it or we should add CHECK lines similar to the ones added in phi_scalar_simple_1.ll.

If we make the argument this test case is redundant, it should be deleted in a separate commit.

test/Isl/CodeGen/scalar-store-from-same-bb.ll
14

Nice!

test/Isl/CodeGen/simple_vec_call.ll
35

This is a trivial change, but mostly unrelated to the proposed patch. I would prefer to introduce REGEXP matches ahead of time to keep the review and patch focused. I did this in r267875 so this change won't show up in a possible future version of this patch.

test/Isl/CodeGen/simple_vec_stride_one.ll
8

Instead of a vector store, we now generate four scalar stores. This is a regression that seems to be introduced accidentally in this patch.

test/Isl/CodeGen/srem-in-other-bb.ll
13

Trivial, but nice!

test/ScopInfo/out-of-scop-use-in-region-entry-phi-node-nonaffine-subregion.ll
1

The file mode change is unrelated and should be a separate commit. This has already been fixed in r266473.

For me the main argument for this patch is that it catches potential problems earlier (Your argument 1).

My secondary argument is that it reduces the overall complexity of the system. If currently and with F1855864 we'd do:
MemoryAccesses -> Load/Stores -> Registers
this patch would just do
MemoryAccesses -> Registers
That is, there is one step less, less concern about unpredictable interactions, and maybe even faster compilation.

I don't think there is a meaningful difference between F1855864 and running a mem2reg/SROA pass afterwards, as we do now, eg. in CodegenCleanup.

For me the size of a patch is no argument unless we are maintaining a legacy code base. While new code may introduce new issues that are to be fixed, this will be eventually compensated in the long term, even if the advantages are small. Number of lines is also not a useful metric for complexity/understandability.

My arguments against this patch would be:

  • I actually find it harder to understand than the current. MemoryAccesses are modeled as READ/WRITE actions, so it is more straightforward to also just generate LoadInst/StoreInst.
  • I am less than 100% sure that cases where it generates superlinear more PHIs than necessary never happen in practical code.

Hence, this comes down to what our priorities are. I'd vote in favor of SSA codegen.

@Tobias @Michael:

First, thanks for you your thoughts on this patch! Second, I agree on
basically all the key facts, both positive and negative. Nevertheless,
I like to go forward with the general idea as I think the positive
aspects outweigh the negative ones.

@Tobias: I can come up with an improved patch that is less complex and

places only needed PHI nodes. Would that be enough to get it in
if we do not show regressions? If not I will not invest any
more time.

For me the main argument for this patch is that it catches potential problems earlier (Your argument 1).

My secondary argument is that it reduces the overall complexity of the system. If currently and with http://reviews.llvm.org/F1855864 we'd do:
MemoryAccesses -> Load/Stores -> Registers
this patch would just do
MemoryAccesses -> Registers
That is, there is one step less, less concern about unpredictable interactions, and maybe even faster compilation.

I don't think there is a meaningful difference between http://reviews.llvm.org/F1855864 and running a mem2reg/SROA pass afterwards, as we do now, eg. in CodegenCleanup.

For me the size of a patch is no argument unless we are maintaining a legacy code base. While new code may introduce new issues that are to be fixed, this will be eventually compensated in the long term, even if the advantages are small. Number of lines is also not a useful metric for complexity/understandability.

My arguments against this patch would be:

  • I actually find it harder to understand than the current. MemoryAccesses are modeled as READ/WRITE actions, so it is more straightforward to also just generate LoadInst/StoreInst.
  • I am less than 100% sure that cases where it generates superlinear more PHIs than necessary never happen in practical code.

Hence, this comes down to what our priorities are. I'd vote in favor of SSA codegen.

http://reviews.llvm.org/D15722

You received this message because you are subscribed to the Google Groups "Polly Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to polly-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sebpop resigned from this revision.Sep 19 2016, 1:21 PM
sebpop removed a reviewer: sebpop.
grosser resigned from this revision.Sep 30 2016, 12:23 PM
grosser removed a reviewer: grosser.

I resign from this patch as it is outdated and I am also not yet really convinced that direct SSA generation is the way to go. It is a pretty complicated change which is hard to get right in all corner cases, makes the code more difficult to understand, and does not really improve things either. There is a reason clang does not generate SSA, but relies on -mem2reg ;).
I admit that there may certainly be some benefits in this, but I would suggest to start a discussion on the mailing list before anybody looks into this again.