This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Create virtual independent blocks
AbandonedPublic

Authored by jdoerfert on Oct 9 2015, 5:28 PM.

Details

Summary
Even with the scalar/PHI modeling and code generation we still
utilized the independent blocks pass to move scalars around in order to
decrease the dependences caused by them. However, this has two major
drawbacks:
  o Polly could not be used as analysis only pass by default and
    without the independent blocks pass the result suffered.
  o The independent blocks pass had limited knowledge about domains
    as well as accesses. Both limited the pass to simple
    movement/re-computation of scalars that can be trivially moved.

This patch is a drop-in replacement for the independent blocks pass
that first virtually moves/re-computes scalars in the SCoP description.
Only in the code generation re-computation will actually take place.

The logic is split in three traversals of the scalars in the SCoP:
  1) Collect interesting operands for each scalar definition that
     escapes the block it resides in.
  2) For each scalar use, check if the definition can be recomputed
     in its statement based on the interesting operands of the
     definition. If so, remove the access and remember that the
     definition has to be recomputed later.
  3) For each scalar definition, check if all uses have been
     eliminated through virtual re-computation. If so, remove the
     access as it does not need to be communicated anymore.

Subsequent patches can increase the capabilities easily as the domain
and access information is already available.

  --------------------------  REVIEW NOTES  --------------------------

    - lnt & unit tests are green
    - We had 245 scalar access in the test suite before, now 236. In
      total this logic eliminated 21 trivially recomputable scalars
      and allowed to eliminate 54 instead of 49 statements.
    - The buildbots show the following compile/execution time changes for
      a similar but prior version of this patch:

      Compile time:
        Improvements           : Δ       : Previous : Current : σ
        .../bmm                : -71.43% : 0.6440   : 0.1840  : 0.0054
        .../paq8p              : -12.48% : 5.8644   : 5.1323  : 0.0181
        .../consumer-lame      : -10.13% : 10.3444  : 9.2961  : 0.0376
        .../sieve              : -5.61%  : 0.4280   : 0.4040  : 0.0093
        .../espresso           : -1.35%  : 10.3841  : 10.2441 : 0.0395
        ..//Recurrences-dbl    : -1.24%  : 4.1923   : 4.1403  : 0.0150
        .../471_omnetpp        : -1.08%  : 42.1606  : 41.7046 : 0.1013

      Execution time:
        Regressions            : Δ       : Previous : Current : σ
        .../salsa20            : 4.32%   : 6.6684   : 6.9564  : 0.0059
        .../GlobalDataFlow-dbl : 1.53%   : 4.6963   : 4.7683  : 0.0047

        Improvements           : Δ       : Previous : Current : σ
        ..//viterbi            : -1.61%  : 2.7402   : 2.6962  : 0.0072

Diff Detail

Event Timeline

jdoerfert updated this revision to Diff 37007.Oct 9 2015, 5:28 PM
jdoerfert retitled this revision from to [Polly] Create virtual independent blocks.
jdoerfert added reviewers: grosser, Meinersbur.
jdoerfert updated this object.
jdoerfert added a subscriber: Restricted Project.
jdoerfert updated this object.Oct 9 2015, 7:32 PM
grosser edited edge metadata.Oct 15 2015, 3:40 PM

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

  1. Post-processing statements vs. instruction-list ScopStmts

Your current approach of first building all ScopStmts and then post-process them to eliminate scalar dependences is interesting. However, it requires a special post-processing phase in ScopInfo and additional special-purpose code in ScopStmt and BlockGenerator. I wonder if it may make sense to build ScopStmts from the beginning just as a list of instructions (instead of a basic block) which could then be uniformly processed in the BlockGenerator.

Best,
Tobias

include/polly/ScopInfo.h
970

that needs to be recomputed

lib/Analysis/ScopInfo.cpp
1409–1410

use is / uses are

2794

"The latter to add" does not seem to make sense grammatically.

2919

grammar

2924

I am not yet convinced an exit ScopStmt is something we would like to have. Maybe leave this for another patch review.

Meinersbur edited edge metadata.Oct 15 2015, 4:17 PM

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

To add some details:

D12975 tracks which values flow between ScopStmts which this patch modifies at a very late phase. There should be fewer such implicit flows, but some of those flows might have already be created with origin MAPPED. The array element occupied by this MAPPED flow might be used for some other implicit flow, meaning De-LICM before virtual blocks is ineffective. It is also more difficult to modify MAPPEDs because lifetime issues have to be considered and there is no distinction between SCALAR and PHI origins (not sure how much of a problem this is though).

One of my rationale to do De-LICM before even creating MemoryAccesses is that it is easier to implement (no moving, copying, deleting of MemoryAccesses) and reduced overhead because fewer MemoryAccesses are created in the first place.

So we have the possibilities to either do "simplifyScalarAccesses" before creating MemoryAccesses (e.g. by marking Instructions to where they should be copied) or "De-LICM" after simplifyScalarAccesses.

lib/Analysis/ScopInfo.cpp
1420

I do not understand this code. Why did you even remove any comment about what it is doing?

2782

God function antipattern

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

Yeah, at the time this was written it was fine but as far as I understand it the MAPPED accesses (and possibly more) implicitly change the semantics of the SCoP description without it beeing actually changed (except the "MAPPED" flag). One could argue that this should look for MAPPED accesses but I don't, instead I will argue that "MAPPED" should not be an access property that is determined on the CFG and only a flag in the access description that turns on some alternate path in the code generation. An access that is not actually happening (no code is generated for) should not be in the SCoP description and the ones in there should happen where they are and read/write the location they define. If that is the case this patch should work properly.

  1. Post-processing statements vs. instruction-list ScopStmts

Your current approach of first building all ScopStmts and then post-process them to eliminate scalar dependences is interesting. However, it requires a special post-processing phase in ScopInfo and additional special-purpose code in ScopStmt and BlockGenerator. I wonder if it may make sense to build ScopStmts from the beginning just as a list of instructions (instead of a basic block) which could then be uniformly processed in the BlockGenerator.

Processing the SCoP allows general solutions which are not (easily) reproduceable on the CFG, e.g., determining if a load is overwritten is easy and precise but hard on the CFG. The instruction list is a different issue and it would remove not that much code/logic anyway. With the pending patches by Michael we will reload scalars at the beginning of a statmeent and as I pointed out on that review, alsmost the same is added here. While I did not get a valuable response on the idea (and implementation sketch) to merge the recomputation and the scalar reload at the beginning of a statement, I think it is what we ultimately want.

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

To add some details:

D12975 tracks which values flow between ScopStmts which this patch modifies at a very late phase. There should be fewer such implicit flows, but some of those flows might have already be created with origin MAPPED. The array element occupied by this MAPPED flow might be used for some other implicit flow, meaning De-LICM before virtual blocks is ineffective. It is also more difficult to modify MAPPEDs because lifetime issues have to be considered and there is no distinction between SCALAR and PHI origins (not sure how much of a problem this is though).

See the comment on

One of my rationale to do De-LICM before even creating MemoryAccesses is that it is easier to implement (no moving, copying, deleting of MemoryAccesses) and reduced overhead because fewer MemoryAccesses are created in the first place.

While moving, copying and deleting MemoryAccesses sound like a lot, one has to remember the number of times this can happen and the cost of each operation.
I think the fact that we get generally better results with this patch than without (see commit message) shows that the "overhead" is low enough.

The implementation difficulty is another issue. While this patch currently only forwards/copies scalars its general structure allows more, e.g., reload values instead of communicate them (see subsequent patch) or remove conditional PHI nodes and replace them by selects. While e.g., reloading is generally hard and costly on the CFG level (one has to determine if the value has been overwritten somewhere in between) the SCoP abstraction allows to reason about it in a simple and consise way.

So we have the possibilities to either do "simplifyScalarAccesses" before creating MemoryAccesses (e.g. by marking Instructions to where they should be copied) or "De-LICM" after simplifyScalarAccesses.

I already proposed the latter in a review last week but I don't recall a positive answer to that.

lib/Analysis/ScopInfo.cpp
1420

This code was basically moved and the original comment was left behind, hence the answer to your question is simple:

The comment has to be moved too.
2782

I take this comment should read as:

Please split the function in three parts.

Sure, can do.

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

Yeah, at the time this was written it was fine but as far as I understand it the MAPPED accesses (and possibly more) implicitly change the semantics of the SCoP description without it beeing actually changed (except the "MAPPED" flag). One could argue that this should look for MAPPED accesses but I don't, instead I will argue that "MAPPED" should not be an access property that is determined on the CFG and only a flag in the access description that turns on some alternate path in the code generation. An access that is not actually happening (no code is generated for) should not be in the SCoP description and the ones in there should happen where they are and read/write the location they define. If that is the case this patch should work properly.

The semantics of a Scop does not change. What does change is where implicits are stored. Like AccessInstruction and AccessValue and non-affine subregions, it is no information required by the polyhedral model itself, but required to associate polyhedral statments to the IR behind it.

To avoid modifying IR directly, we "virtualize" the relevant metainformation such that the Scop description and IR code do not map directly. Having to redirect loads/stores becomes unavoidable.

  1. Post-processing statements vs. instruction-list ScopStmts

Your current approach of first building all ScopStmts and then post-process them to eliminate scalar dependences is interesting. However, it requires a special post-processing phase in ScopInfo and additional special-purpose code in ScopStmt and BlockGenerator. I wonder if it may make sense to build ScopStmts from the beginning just as a list of instructions (instead of a basic block) which could then be uniformly processed in the BlockGenerator.

Processing the SCoP allows general solutions which are not (easily) reproduceable on the CFG, e.g., determining if a load is overwritten is easy and precise but hard on the CFG. The instruction list is a different issue and it would remove not that much code/logic anyway. With the pending patches by Michael we will reload scalars at the beginning of a statmeent and as I pointed out on that review, alsmost the same is added here.

I can't find such code in this patch. Where is it?

While I did not get a valuable response on the idea (and implementation sketch) to merge the recomputation and the scalar reload at the beginning of a statement, I think it is what we ultimately want.

What proposal are you refereing to? What scalar reload?

Hi Johannes,

great to see we will finally get rid of the IndependentBlocks pass. I just had a very first skim over this patch, but did not manage to do a full review yet.

The two main items I would still like to look into:

  1. This seems to conflict with http://reviews.llvm.org/D12975 (Michael's DE-LICM)

I need to get a better understanding in which ways these patches possibly conflict (as well as the overall design).

To add some details:

D12975 tracks which values flow between ScopStmts which this patch modifies at a very late phase. There should be fewer such implicit flows, but some of those flows might have already be created with origin MAPPED. The array element occupied by this MAPPED flow might be used for some other implicit flow, meaning De-LICM before virtual blocks is ineffective. It is also more difficult to modify MAPPEDs because lifetime issues have to be considered and there is no distinction between SCALAR and PHI origins (not sure how much of a problem this is though).

See the comment on

Which coomment?

One of my rationale to do De-LICM before even creating MemoryAccesses is that it is easier to implement (no moving, copying, deleting of MemoryAccesses) and reduced overhead because fewer MemoryAccesses are created in the first place.

While moving, copying and deleting MemoryAccesses sound like a lot, one has to remember the number of times this can happen and the cost of each operation.
I think the fact that we get generally better results with this patch than without (see commit message) shows that the "overhead" is low enough.

I have no general objection to it. However, you occasionally raise such concerns yourself (e.g. D13341) so I try to take them into account.

The implementation difficulty is another issue. While this patch currently only forwards/copies scalars its general structure allows more, e.g., reload values instead of communicate them (see subsequent patch) or remove conditional PHI nodes and replace them by selects. While e.g., reloading is generally hard and costly on the CFG level (one has to determine if the value has been overwritten somewhere in between) the SCoP abstraction allows to reason about it in a simple and consise way.

Such analysis on the polyhedral model requires expensive ISL operations to compute the dependences. One of our intents of simplifying the scop description is to make its analysis run faster. We'd do still to the complicated analysis on the large scop description to get a simpler on and redo the same analysis. I am not saying this that this cannot make sense, but the simplifications we want to perform (IndependentBlocks, DeLICM) are both on the IR level on which reasoning is much cheaper.

So we have the possibilities to either do "simplifyScalarAccesses" before creating MemoryAccesses (e.g. by marking Instructions to where they should be copied) or "De-LICM" after simplifyScalarAccesses.

I already proposed the latter in a review last week but I don't recall a positive answer to that.

AFAIU we about to discuss this.

jdoerfert updated this revision to Diff 39225.Nov 4 2015, 11:24 AM
jdoerfert edited edge metadata.

Rebased + Changed according to first comments

jdoerfert updated this revision to Diff 39231.Nov 4 2015, 11:34 AM

Fix minor errors

I have the impression that this very much relies on that accesses (mostly SCALAR READ, but also PHI writes in non-affine subregions) are computed per instruction, whereas in D13762 I tried to make them per-ScopStmt. The difference is that e.g. if there are two reads in the same BB, two loads would be generated. In D13762 I tried to ensure that there is only one load.

This patch seems to skip PHI accesses completely. In San Joe you mentioned that only loop-carried phis would be skipped. Where is that part? Where would DeLICM tell what memory can be reused?

I got the impression we could generalize this with in SCEV expander. A SCEV-like object (we already support SDiv/SRem) could store an entire operand tree of movable instructions and CodeGen would synthesize/materialize it on request. This looks like more streamlined with current mechanisms instead of introducing a new one. What do you think?

include/polly/CodeGen/BlockGenerators.h
422

"all" does not match the description below

426

... where they are used.

428

... thus ensures that all ...

483

llvm_unreachable is not defined to trigger a trap.

488

Missing documentation?

546

Flag to indicate that the instruction ...

include/polly/LinkAllPasses.h
36 ↗(On Diff #37007)

Separate patch to remove IndependentBlocks?

include/polly/ScopInfo.h
516

We can get the AccList from getStatement()->getParent()->AccFuncMap. Not necessary to pass it as parameter.

Rename to "copyTo"?

541

Unrelated?

932

I think we should not depend on the order of accesses. ISL does not understand such an order and depending on it will likely cause some bugs (Think of a load after storing to the same location; do we even handle that case yet?).

AFAIK ISL assumes that all loads happen before the stores.

940

OnlyMA=false removes multiple accesses, so move to a "removeAccessesCausedBy" function?

974

It would be better if the class would not expose its implementation as an API. It could return iterators or an ArrayRef.

1275

It would be more idomatic if each call would check only a single Instruction instead of passing a set implementation.

1291

I think you should clarify that you actually mean a flattened operand _tree_. I think LLVM uses the term operands as the direct arguments/parameters of an instruction.

1295

AFAICS it is only used during the execution of simplifyScalarAccesses. You could make it local and pass it to the functions that need it.

1330

The name "simplifyScalarAccesses" is very general. How about "recomputableDependentScalars"?

lib/Analysis/ScopInfo.cpp
667

What if it has fewer dimensions? If it is impossible, could you add an assertion to make it clear?

673

Doesn't this update InstructionToAccesses?

1420

I'd still ask you to refactor this code. It is hard to understand and looks fragile as it depends on order. Maybe you could use some functions from <algorithm>

2778

What are you plans here? When can instructions with side effects be recomputed?

2783

What is there to see?

2788

Why exit node PHIs processed, but not standard phis?

2800

Why for each access separately? Shouldn't this be per-ScopStmt because it only matters for cross-stmt operands?

2815

auto -> Use ?

2817

This probably will have issues with PHIs in the scop's entry just is in PR25394, ie. if InstOpInst is a PHI with an incoming block from outside the scop.

2824

and continue ?

2876

Wouldn't it be possible to recompute some of the dependent values, even if some operands have side effects?

2880

Why only instructions that have operands from outside? For movable instructions (DependentScalar) without such operands, but implicit accesses, don't we need to remove its memory accesses as well?

2885

This is using the list of accesses this function modifies on the fly ("InstructionToAccess"). As such, it seems to me to depend on the order we iterate over all statements. E.g. the previous iteration already removed the read access of "OutsideUser", hence will not be added to AdditionalAccesses anymore, hence we miss some read accesses.

Or am I wrong? Why is it this cannot happen?

2890

What does this condition mean?

I think we generally are misusing "BaseAddr" for scalar access, because it actually not an address.

2894

I'd expect this to be computed in collectNonTrivialOperands(), like a partition of instructions in the operand tree into

  • Movable
  • Outside/Sideeffect (operand tree leaves)

I understand your algorithm works differently, but to understand, what is the rationale to only have direct operands in DependentScalars, while we want to copy/move an entire operand tree?

2932

If a user has no more accesses, is it because it is intended to be removed?

2935

auto -> MemoryAccess ?

2940

Is this break meant to leave both for-loops?

2947

This seems to be a different kind of "resolved" than in removeRecomputableScalarReads.

2963

collectNonTrivialOperands actually skips PHIs which are also implicit accesses.

2972

hence -> i.e.

lib/CodeGen/BlockGenerators.cpp
67

When does a dependent scalar become a global?

86

Shouldn't these be in the BBMap after recomputeDependentScalars() executed?

130

Why does this become necessary? Shouldn't "trySynthesizeNewValue" determine itself whether it is synthesizable?

147

This essentially becomes

assert(TryOnly)
return nullptr;
169

What is the rationale for this condition?

171

Shouldn't this been handled by recomputeDependentScalars() already? How do we know that the operand is even a DependentScalar?

If this is because DependendScalars is unordered, wouldn't it be cleaner to ensure they are in their order of dependence?

226

I like this change, but could be committed independently. Maybe also as a member function of ScopStmt?

228

This idiom appears multiple times. Introduce a helper function?

334

Any reason to do this before generateScalarLoads()? Naively, I'd think it belongs after it because some if the dependent instructions might still use scalars (e.g. read-only ones).

test/Isl/CodeGen/srem-in-other-bb.ll
1

Nice, but unrelated

I have the impression that this very much relies on that accesses (mostly SCALAR READ, but also PHI writes in non-affine subregions) are computed per instruction, whereas in D13762 I tried to make them per-ScopStmt. The difference is that e.g. if there are two reads in the same BB, two loads would be generated. In D13762 I tried to ensure that there is only one load.

One vs. multiple loads/stores of scalars is only of little difference to us as it will only affect code generation. Everything before, especially dependence analysis, will be the same. Hence, consolidating scalar accesses only to save duplicate loads/stores in the genereted IR (that will be taken care of anyway) is no good reason to make any kind of scalar elimination harder.

This patch seems to skip PHI accesses completely. In San Joe you mentioned that only loop-carried phis would be skipped. Where is that part? Where would DeLICM tell what memory can be reused?

There are 5 patches. This is the first and it only deals with trivially recomputable scalars, thus the onces without any "interesting/complicated operands".
The following patches as well as DeLICM comes into play in the canRecomputeInStmt function which is (almost) the only place that needs to be updated to allow more functionality.

I got the impression we could generalize this with in SCEV expander. A SCEV-like object (we already support SDiv/SRem) could store an entire operand tree of movable instructions and CodeGen would synthesize/materialize it on request. This looks like more streamlined with current mechanisms instead of introducing a new one. What do you think?

If SCEV would allow floating point values maybe, till then, we would implement everything we have here somehow in our SCoPExpander which is not really helpful. Additionally, during the SCoP generation (e.g., when we check if we can synthezise scalars instead of communicating them) we cannot check for overwritten loads and free memory locations we can reuse.

include/polly/CodeGen/BlockGenerators.h
480

When we virtually move instructions we are not really interested in the loop surroinding them but their new parent block/statement, thus the change.

I tried to address (in words) all your comments. Once we agrred on how to proceed I will do code changes.

include/polly/CodeGen/BlockGenerators.h
426

I will check that.

428

I will check that.

546

I will check that.

include/polly/ScopInfo.h
940

We could do that.

1275

We could do that.

1291

Where do you expect a tree here? It actually contains transitive operands and I could add that word.

1295

We could do that.

lib/Analysis/ScopInfo.cpp
2815

We could do that.

2824

This looks weird without the follow up commits that introduce some code here.

2935

We could do that.

lib/CodeGen/BlockGenerators.cpp
226

I can commit it a priori but I don't see the need for a ScopStmt member function at the moment.

228

That's something not conflicting or required for this commit we can do.

334

First, by construction we can place it here as all recomputable scalars do not depend on anything we cannot recompute. Second, we can probably move it after the scalar loads generation but it should be all the same.

test/Isl/CodeGen/srem-in-other-bb.ll
1

I probably forogot that while I prepared this commit and all the test cases. I'm sorry.

jdoerfert added inline comments.Nov 6 2015, 8:31 PM
include/polly/CodeGen/BlockGenerators.h
422

All that are not communicated via memory but need to be recomputed.

483

I can use the original message instead. If you have a better way to describe llvm_unreachable I can put it here alternatively.

488

Yeah, somebody should add documentation to all this undocumented functions at some point.

include/polly/ScopInfo.h
516

I would like to avoid accessing members of a different class.

541

Needed somewhere, but it was quite a while ago. If ppl need to know where I can check.

932

I'm not sure about your isl assumption but in any case, I'm not assuming anything here. The different input locations are purely for user experience (as the accesses are ordered in a more natural fashion).

1330

We do not recompute here and later we do not always recompute to simplify scalars, thus I think it actually fits pretty well.

lib/Analysis/ScopInfo.cpp
667

Sure we can add an assertion.

673

Mh, there was no need but we can do that too I guess.

1420

This code (already in the current HEAD) is ugly because it tries to be smart. While it heavily depends on the order it might need a lot more iterations if it would not. I don't argue we have to keep it but only that it works in Polly HEAD for now and can be takled afterwards, if we find a nice and comparable way to do it.

2778

That was what I was telling you in SJ about. I even pushed the first patch that implements this TODO to reviews and it depends on this one.

2783

I'm not sure how to answer to that as I expected the link to the function that containts the comment for this one to be self-explanatory.

2788

I'm not sure why you ask this but as you can probably see in the code no "PHI access" is processed. Though, if something that is a PHI does not introduce a "PHI access" it might be processed here.

2800

I'm not sure what you mean. Why do we look at each access separatly or why do we look at the non-trivial-operands of each access separalty?

2817

Mh, if we have a test case I'll check that.

2876

Might be, but it won't help with removing the scalar dependences. Actually, it might make them worse.

2880

Outside operands are the read only scalars we track to easy OpenMP code generation. The scalar access we recompute is removed later.

2885

No read access is/should be removed if it is still connected to anything else. Hence, your senario should never happen. Additionally, we actually iterate over accesses here, thus the actual removal happens only at the end (after we traversed all accesses).

2890

It depends on how you see it. I think the (virtual) register is a unique address for a scalar, thus it fits well here.

2894

We copy/move the entire operand tree but for that I do not need to traverse it and put all instructions into a list. I simply store the root of the tree and traverse it during the actual copy/move.

2932

Actually, it has already been removed, hence the statement but no read access.

2940

It could be but doesn't need to.

2947

No it is the same but for the write accesses instead of the reads. Or to be more precise, it is the passive version as writes are (passively) resolved if all their reads have been (actively) resolved.

2963

It does not skip them but stops as they are non-trivial operands.

lib/CodeGen/BlockGenerators.cpp
67

If it is hoisted.

86

After recomputeDependentScalars is executed they are in BBMap. Though our synthesize code sometimes thinks it can sythesize something while it actually referes to old values we have to recompute first while we are in recomputeDependentScalars.

130

Good question. I don't remember anymore. I'll check once the general patch is agreed on.

147

Seems fine to me.

169

If we could not generate a new operand but we need to because the old one is an instruction in the region we have to do something.

171

Shouldn't this been handled by recomputeDependentScalars() already? How do we know that the operand is even a DependentScalar?

This code is only triggered while we execute recomputeDependentScalars.

If this is because DependendScalars is unordered, wouldn't it be cleaner to ensure they are in their order of dependence?

Which order of dependence? We got a DAG here and have to code generate it. One way is to find some topological order on the whole dag and then generate code, the other (which is implemented) is a recursive demand driven code generation. As we do the second everywhere else I figured it makes sense to do it here too.

Meinersbur added inline comments.Dec 1 2015, 3:37 AM
include/polly/ScopInfo.h
1291

What is a "transitive operand"?

lib/Analysis/ScopInfo.cpp
2800

This is nested inside a "for (MemoryAccess *MA : Stmt)" loop and NonTrivialOperandsMap is indexed by its AccessInst.

  • If an instruction has multiple write access (AFAIK not possible at the moment), the code below would run multiple times over the same AccessInst.
  • Multiple instructions in the same Stmt might use the same value, thus its operand tree appears multiple times in this statement. At code generation, each of them would get its own copy of the operand tree, although only one per stmt is necessary (the same tree can be reused for both users)
jdoerfert marked 75 inline comments as done.Feb 16 2016, 4:39 AM
jdoerfert updated this revision to Diff 48064.Feb 16 2016, 4:41 AM

Rebase and adjustment according to comments

jdoerfert removed a subscriber: etherzhhb.
etherzhhb added inline comments.Feb 16 2016, 7:28 AM
include/polly/ScopInfo.h
946

Does "Caused by the instruction of MA" mean the user of instruction of MA?

1190

SE is a member of Scop, no need to pass it as a parameter.

I think it is reasonable to carry it around because we store SCEVs with Scop, e.g. Parameters.

1275

It would be more idomatic if each call would check only a single Instruction instead of passing a set implementation.

+1

1277

why we need they in SCoP Users?

1291
a = b + c
d = a + e

I guess b, c and hopefully a are the transitive operands of d.

They are also the leaves of the operand DAG rooted on d.

When we recompute d, the recomputation stop at b, c, e, right? That is, all instructions inside the DAG is recomputed.

So NoTrivialInstructionSet and OutsideOperandsSet together are the boundaries/leaves of the recompuation DAG, and NoTrivialInstructionSet are the instructions inside the Scop while OutsideOperandsSet are the instructions outside the Scop, right?

1295

I read:

<NoTrivialInstructions (first of NonTrivialOperandsPairTy )>   <OutsideOperandsSetTy (second of NonTrivialOperandsPairTy)>
                                   \                                       /
                                                 Recomputable DAG
                                                        | (root)
                                                 Instruction (the key of NonTrivialOperandsMap)

Actually can we check if an operand is outside Scop on the fly, such that only 1 operand set is need?

lib/Analysis/ScopInfo.cpp
2509

SE is a member of Scop

2790

// Now AccessInst is a write scalar access (and not a PHI).

2805–2830

How about using the following approach, which is all over LLVM codebase, and we can create a function for it.

// Depth-first traverse the operand DAG of AccessInst, until we reach:
//  * An instruction that is defined outside the current Scop (OutsideOperands )
//  * An instruction with side-effect (SideEffectOperands)
std::set Visited;
SmallVector<std::pair<Instruction*, Instruction::op_iterator> > VisitStack;
VisitStack.push_back(std::make_pair(AccessInst, AccessInst->begin()));

while (VisitStack.empty()) {
  auto CurNode = VisitStack.back().first;
  auto &CurChildIt = VisitStack.back().second;

  if (CurChildIt == CurNode->end()) {
    VisitStack.pop_back();
    continue;
  }

  Instruction *ChildInst = dyn_cast<Instruction>(*CurChildIt);
  ++CurChildIt;

  // Stop at invariances
  if (ChildInst == nullptr)
    continue;
  
  if (!Visited.insert(Inst).second)
    continue;

  // Invariance again.
  if (!R.contains(ChildInst)) {
    OutsideOperands.push_back(std::make_pair(InstOpInst, Inst));
    continue;
  }

  if (Inst->mayHaveSideEffects() || Inst->mayReadFromMemory()) {
    SideEffectOperands.insert(Inst);
    continue;
  }

  if (!isa<PHINode>(Inst))
    continue;

  if (R.contains(Inst) && canSynthesize(Inst, &LI, &SE, &R))
    continue;

  SideEffectOperands.insert(Inst);
}
2884

This user is actually inside the Scop, right?

2885

So what we need is only the parent block of the user that is inside the Scop?

2890

I read:

if (Stmt.containValueReadOf(OutsideOperand))
lib/CodeGen/BlockGenerators.cpp
172

If copyInstruction returns the newly copied instruction, the code

copyInstruction(Stmt, OldOperandInst, BBMap, LTS, nullptr, Recompute);
NewOperand = BBMap[OldOperand];

Become:

NewOperand = copyInstruction(Stmt, OldOperandInst, BBMap, LTS, nullptr, Recompute);

Which looks better and easier to understand.

But this can be done in another patch.

etherzhhb edited edge metadata.Feb 17 2016, 7:55 AM

See inline comments, otherwise I do not see any problem

jdoerfert updated this revision to Diff 48642.Feb 21 2016, 3:56 PM
jdoerfert marked 6 inline comments as done.
jdoerfert edited edge metadata.

Updated according to Hongbin's comments. Unified all non-trivial operands.

I addressed all of the comments except one (the new algorithm to collect non-trivial operands).

@etherzhhb, @Meinersbur, @grosser, @sebpop
Since this patch caused so much discussion I guess it would be nice to get two ppl to tell me it's fine to commit.

include/polly/ScopInfo.h
946

I created two functions now. One to remove all accesses caused by an instruction and one to remove exactly the accesses passed as arguments.

1277

To find the access in the SCoP (if there is one). Only knowing the outside operand doesn't suffice to identify the MemoryAccess objects that we already created because of this outside operand.

1291

I guess b, c and hopefully a are the transitive operands of d.

In the code above operands would mean {b,c,e} for the instruction d.

They are also the leaves of the operand DAG rooted on d.

True, but whit PHI nodes we might not have a DAG and the PHIs are not necessarily leaves.

When we recompute d, the recomputation stop at b, c, e, right? That is, all instructions inside the DAG is recomputed.

Yes. And since we do not recompute loop carried PHIs here we always have this nice DAG.

So NoTrivialInstructionSet and OutsideOperandsSet together are the boundaries/leaves of the recompuation DAG, and NoTrivialInstructionSet are the instructions inside the Scop while OutsideOperandsSet are the instructions outside the Scop, right?

Yes, except the PHI case where we do not have a DAG and even though PHI are not leaves we put them in the NoTrivialInstructionSet.

1295

Good point, I adjusted the code to use only one set of non-trivial operands and then compute the outside operands + inside users on the fly. Thanks!

lib/Analysis/ScopInfo.cpp
2790

Added a comment.

2805–2830

I don't get the problem you try to solve here. Why do we need a different algorithm and what is it that it does better?

2884

Renamed.

2885

True. I changed this code to use only one set of operands and compute everything else on the fly.

2890

Yes. Is that a problem?

lib/CodeGen/BlockGenerators.cpp
172

We could do that later, true.

etherzhhb added inline comments.Feb 22 2016, 6:57 AM
include/polly/ScopInfo.h
1291

Just to confirm, we do not recompute loop-carried PHIs, right?

lib/Analysis/ScopInfo.cpp
2805–2830

It is almost the same algorithm, but in different 'style'. I think I should not push too hard about the "style". So I think we are done with this.

But we should put some comments to describe what the algorithm is doing like:

// Traverse the operand DAG of AccessInst, until we reach:
//  * An instruction that is defined outside the current Scop (OutsideOperands )
//  * An instruction with side-effect (SideEffectOperands)
2890

It is a little bit not straight forward, but with the comment above, I think it is ok.

lib/CodeGen/BlockGenerators.cpp
143–147

TryOnly is only used here. And this block is equivalent to:

assert(TryOnly && "Unexpected scalar dependence in region!");
return nullptr;

Instead of passing TryOnly as function parameter, can we move the assertion out of this function, like:

auto *V = getNewValue(....);
assert((V || TryOnly) && "Unexpected scalar dependence in region!");

Now we can remove this confusing TryOnly.

334

First, by construction we can place it here as all recomputable scalars do not depend on anything we cannot recompute. Second, we can probably move it after the scalar loads generation but it should be all the same.

It would be good to put your above explanations as comments to the code.

1071–1072

Interesting duplication :)
We should also add some comments to explain why we recomputeDependentScalar before generateScalarLoads here as well.

etherzhhb requested changes to this revision.Feb 22 2016, 7:03 AM
etherzhhb edited edge metadata.
This revision now requires changes to proceed.Feb 22 2016, 7:03 AM
jdoerfert marked an inline comment as done.Feb 22 2016, 11:30 AM

etherzhhb requested changes to this revision
This revision now requires changes to proceed.

Do you want other changes than the missing comments and the TryOnly thing? Both seem to be minor but the status change seems to indicate a bigger problem with this patch.

include/polly/ScopInfo.h
1291

Not in this patch (or the following three that are more or less rdy), however we are working on loop-carried PHIs too.

lib/Analysis/ScopInfo.cpp
2805–2830

I can do that.

lib/CodeGen/BlockGenerators.cpp
334

I can add a comment. sure.

1071–1072

I am confused. Are you hinting on the fact that we call "recomputeDependentScalars" here too or that we call it again before "generateScalarLoads"? I talked about the order already in the comment above. And the duplication of the "recomputeDependentScalars" call has the reason as the duplication of the "generateScalarLoads" call, this is the starting point of a "copyStmt" function. While above it is for block statements, here it is for region statements. If you want I can extract both calls into a "initializeScalars" method that is called before a statement is copied.

I change the status to make this patch not block on me. There is no major issue.

etherzhhb added inline comments.Feb 22 2016, 2:14 PM
include/polly/ScopInfo.h
1291

ok

lib/CodeGen/BlockGenerators.cpp
1071–1072

I just think the duplicate here is interesting/funny ...

My intention is to suggest that we put the same comments as 344 here to explain the order, otherwise people may confused when they see this before they see the comments in line 344.

If you want I can extract both calls into a "initializeScalars" method that is called before a statement is copied.

This is great but can be done in another patch.

etherzhhb accepted this revision.Feb 26 2016, 3:17 AM
etherzhhb edited edge metadata.

I change the status to make this patch not block on me. There is no major issue.

I think this patch is good to go once the comments are fixed.

This revision is now accepted and ready to land.Feb 26 2016, 3:17 AM

I still want to have a look at this, but did not make it. I am a little
delayed on the larger patches, but first will need to have a look at
Johannes SSA patch.

Best,
Tobias

jdoerfert abandoned this revision.Mar 28 2019, 5:02 PM