This is an archive of the discontinued LLVM Phabricator instance.

[LoopDeletion] Insert an early exit from dead path in loop
Needs ReviewPublic

Authored by jonpa on Dec 22 2020, 3:32 PM.

Details

Summary

This is applied on top of https://reviews.llvm.org/D86844 "[LoopDeletion] Allows deletion of possibly infinite side-effect free loops".

This patch handles the case where the whole loop is not dead, but only a certain constant path through it. If this path is entered, the loop is in fact dead and an early exit could be made.

Before finalizing the patch by adding tests and also handling the new PM, I would like to get some feedback as to if this is looking to be the right approach. I have seen that this handles the omnetpp function printAddressTable() the same way as GCC does, which gives a nice improvement on the benchmark (SystemZ, output disabled).

Diff Detail

Event Timeline

jonpa created this revision.Dec 22 2020, 3:32 PM
jonpa requested review of this revision.Dec 22 2020, 3:32 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2020, 3:32 PM

The idea is that we have

H:
  %c = ... ; invariant wrt H and L
  br %c, L, B
B: 
  side_effects
  br L
L:
  br %x, H, Exit

Exit:
  ...

right?

One of my problems is that this is too tied to the syntax.
Another is that we are not deleting a loop. The former can be addressed later I guess.

Isn't this more related to LoopUnswitch?
We want to "unswitch the paths that are side-effect free", or something like that.

We could also make it generic by collecting the blocks that are allowed on such a path
in order to allow different CFGs. But that can be done later as well.

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
187

Nit: Unrelated and misses documentation. A useful helper though, could be added to D86844 directly (if you don't mind).

556

that reminds me, we really need to make llvm.assume side-effect free... D89054

jonpa added a comment.Dec 22 2020, 4:18 PM

The idea is that we have

H:
  %c = ... ; invariant wrt H and L
  br %c, L, B
B: 
  side_effects
  br L
L:
  br %x, H, Exit

Exit:
  ...

right?

Yes, exactly (could be more blocks than 3, though, of course).

One of my problems is that this is too tied to the syntax.

How do you mean, exactly?

Another is that we are not deleting a loop. The former can be addressed later I guess.

Isn't this more related to LoopUnswitch?
We want to "unswitch the paths that are side-effect free", or something like that.

I thought that if we cannot delete the whole loop we can effectively delete a part of it with an early exit.

LoopUnswitching only works if the condition is loop invariant currently as far as I can see, so I am not sure how simple that would be...

We could also make it generic by collecting the blocks that are allowed on such a path
in order to allow different CFGs. But that can be done later as well.

Might be worth a try...

The idea is that we have

H:
  %c = ... ; invariant wrt H and L
  br %c, L, B
B: 
  side_effects
  br L
L:
  br %x, H, Exit

Exit:
  ...

right?

Yes, exactly (could be more blocks than 3, though, of course).

One of my problems is that this is too tied to the syntax.

How do you mean, exactly?

H:
  %c1 = ... ; invariant wrt H, LazyEval, and L
  br %c1, LazyEval, B
LazyEval:  
  %c2 = ... ; invariant wrt H, LazyEval, and L
  br %c2, L, B
B: 
  side_effects
  br L
L:
  br %x, H, Exit
Exit:
   ...

I mean the above breaks the matching even though the same logic applies. I would argue that is not what we want.
(And given that %c1 and %c2 can have loads, we might not be able to undo short circuit evaluation here to reduce
it to your initial pattern.)

Isn't this more related to LoopUnswitch?
We want to "unswitch the paths that are side-effect free", or something like that.

I thought that if we cannot delete the whole loop we can effectively delete a part of it with an early exit.

We delete parts, agreed, but not a loop. Unsure if this is the right place conceptually.

LoopUnswitching only works if the condition is loop invariant currently as far as I can see, so I am not sure how simple that would be...

With that argumentation you could say LoopDeletion doesn't do this either (right now) ;)
I mean, you can simply copy this code there and it should work, right?

We could also make it generic by collecting the blocks that are allowed on such a path
in order to allow different CFGs. But that can be done later as well.

Might be worth a try...

FWIW, I my goal is to make it something driven by a generic analysis, not another pattern we match.

reames added a subscriber: reames.Dec 22 2020, 8:32 PM

As implemented, this is a bit too specific to one particular pattern to be commitable, but I think you've got the seed of a good idea here.

I think you can generalize this as follows:

  • If all conditions contributing to control flow along the path from header to exit block are loop invariant (use SCEVs definition), then the exit is either taken or not taken equally on all iterations.
  • If all exit blocks meet the previous criteria, then the loop must either execute once or be infinite. If we can prove it's not infinite, it must execute once.
  • For a loop which executes exactly once, the backedge is dead. We can break the backedge, leave the rest of the loop unchanged, and get the result you're looking for.

As a special case of the above (which is probably all we should implement), if all loop exits dominate the backedge, SCEV will be able to compute the backedge taken count as zero. You can simply check the backedge taken count and break the backedge.

Note that my phrasing doesn't involve reasoning about side effects.

fhahn added a comment.Dec 23 2020, 7:42 AM

Thanks for sharing the patch Jonas!

Isn't this more related to LoopUnswitch?
We want to "unswitch the paths that are side-effect free", or something like that.

I thought that if we cannot delete the whole loop we can effectively delete a part of it with an early exit.

We delete parts, agreed, but not a loop. Unsure if this is the right place conceptually.

LoopUnswitching only works if the condition is loop invariant currently as far as I can see, so I am not sure how simple that would be...

As mentioned already, I think framing this as extension to unswitching could be interesting. In a way, it is some kind of 'partial' unswitching, where only the condition of the no-sideeffect version becomes known after unswitching.

I put together an early version of that idea in D93764. It focuses primarily on conditions as in omnetpp's MACRelayUnitBase::printAddressTable, where the condition is a load of a loaded address and one the path from the false successor to header/exits has no side effects. It still an early version(not ready for review yet) which needs more tests and comments, but from looking at the impact on the test-suite (increase in number of branches unswitched), it seems that this could be viable in general.

Note that my phrasing doesn't involve reasoning about side effects.

I think the side effect part is still needed (at some point) as the use case they want to tackle has loads in the condition that might be modified by the "main body" (which is potentially never executed).

jonpa updated this revision to Diff 314473.Jan 4 2021, 4:08 PM

With the motto of pushing things forward even if only by aiding the other related patches, I have continued to improve my patch to use as some kind of baseline for "early exit" insertions. Perhaps it can be used during development of the partial loop-unswitching to find cases to handle, or perhaps it could be used for some cases if it would reduce the burden on the other algorithm. It would be very nice if partial unswitching could handle all this instead, of course :-)

Instead of just handling the header->latch edge, now any edge from the "Header region" to the "Latch region" can be handled. The requirements are that in the header region, all conditions must be loop invariant, and there can only be one reachable exit block in the latch region from the branch target block back to header. If there is no exit block on the dead path but the loop has a unique exit block, that exit block is used for the early exit, given the mustprogress attribute.

Interestingly enough, there are now cases where this patch manages to "delete" (or "eliminate") a loop which has multiple exits. The loop structure is removed while the BBs remain, which should hopefullyl be removed later by CFG-opt. I think this is rare on benchmarks though... (see below)

I found out quickly that perhaps the hardest part of this was to update datastructures after changing the CFG in a loop... I have barely been able to build the benchmarks as it is, so I am in need of some good advice on how to update things after a loop change, for this patch to be usable. Currently I have changed things temporarily so that LI, DT, etc are recomputed after loop deletion (BTW, I found that only recomputing those analyses on master after LoopDeletion was not NFC, which surprised me... Is that a bug or expected with the aim to save compile time?)

Statistics on SPEC-17 on SystemZ:

master (patched to not preserve analyses after LoopDeletion for a fair comparison):

      7276                    loop-delete - Number of loops deleted
Only Header/Latch (like first simple version aimed to do):
      1414                    loop-delete - Number of early exits inserted
      7498                    loop-delete - Number of loops deleted
         8                    loop-delete - Number of loops eliminated (no remaining blocks)
       462                    loop-delete - Number of skipped loops (SCEV 0)
Top/Bot regions (current patch):
      2353                    loop-delete - Number of early exits inserted
      7397                    loop-delete - Number of loops deleted
         6                    loop-delete - Number of loops eliminated (no remaining blocks)
       462                    loop-delete - Number of skipped loops (SCEV 0)

A rise from ~1400 to ~2350 is a fairly nice improvement in number of early exits inserted compared to initial patch. There seem to be some difference also in number of loops deleted as reported by deleteLoopIfDead(), which I have not investigated. It seems that somehow must relate to a parent loop now not having a subloop..?

@reames: I am not quite sure how the SE->isZero() case directly relates to the detection of dead paths per my patch... I have here tried a statistic for that case, which showed that some loops could be handled by your patch, while there are still many more that can not... (see above).

@jdoerfert:

FWIW, I my goal is to make it something driven by a generic analysis, not another pattern we match.

I agree with that, I have now made the patch more general.

fhahn added a comment.Jan 8 2021, 6:54 AM

I found out quickly that perhaps the hardest part of this was to update datastructures after changing the CFG in a loop... I have barely been able to build the benchmarks as it is, so I am in need of some good advice on how to update things after a loop change, for this patch to be usable. Currently I have changed things temporarily so that LI, DT, etc are recomputed after loop deletion (BTW, I found that only recomputing those analyses on master after LoopDeletion was not NFC, which surprised me... Is that a bug or expected with the aim to save compile time?)

Statistics on SPEC-17 on SystemZ:

Thanks for the update Jonas! It looks like the patch includes some required changes that still landed (D86844), which might impact the number of removed loops. It might be good to re-collect the statistics. I tried to collect stats with this patch on SPEC2000/SPEC2006/MultiSource for X86 with LTO, but unfortunately there have been a few crashes.

@jdoerfert:

FWIW, I my goal is to make it something driven by a generic analysis, not another pattern we match.

I agree with that, I have now made the patch more general.

I think both this patch and D93764 require very similar analysis to find 'no-op' paths through loops (with the difference that partial unswitching can allow stores to memory that does not clobber the condition). Do you think it would be worth unifying the analysis code?

jonpa added a comment.Jan 8 2021, 12:29 PM

I found out quickly that perhaps the hardest part of this was to update datastructures after changing the CFG in a loop... I have barely been able to build the benchmarks as it is, so I am in need of some good advice on how to update things after a loop change, for this patch to be usable. Currently I have changed things temporarily so that LI, DT, etc are recomputed after loop deletion (BTW, I found that only recomputing those analyses on master after LoopDeletion was not NFC, which surprised me... Is that a bug or expected with the aim to save compile time?)

Statistics on SPEC-17 on SystemZ:

Thanks for the update Jonas! It looks like the patch includes some required changes that still landed (D86844), which might impact the number of removed loops. It might be good to re-collect the statistics. I tried to collect stats with this patch on SPEC2000/SPEC2006/MultiSource for X86 with LTO, but unfortunately there have been a few crashes.

Sorry to hear about the crashes, hopefully it can all be fixed... I rebased the patch and did a new run:

master (patched to not preserve analyses after LoopDeletion for a fair comparison):

      7557                    loop-delete - Number of loops deleted
current patch:
      2352                    loop-delete - Number of early exits inserted
      7420                    loop-delete - Number of loops deleted
         6                    loop-delete - Number of loops eliminated (no remaining blocks)
       439                    loop-delete - Number of skipped loops (SCEV 0)

@jdoerfert:

FWIW, I my goal is to make it something driven by a generic analysis, not another pattern we match.

I agree with that, I have now made the patch more general.

I think both this patch and D93764 require very similar analysis to find 'no-op' paths through loops (with the difference that partial unswitching can allow stores to memory that does not clobber the condition). Do you think it would be worth unifying the analysis code?

To me it depends on what your goal is with the partial unswitching - do you mean to extend your patch to include multiple partially invariant conditions in the header region? (Or is that already a side-effect of revisiting the new loop? If so, there is no need to compute the "Top-region" like this patch does.). I guess we could perhaps have a common function that finds a unique exit block from SuccBB back to header and collect the memory references as well on that path...

Do you think partial unswitching will be able eventually to handle all the early-exit cases? I haven't tried this patch yet on top of your patch, but that will be interesting :-)

jonpa updated this revision to Diff 315487.Jan 8 2021, 12:32 PM

patch rebased

jdoerfert added inline comments.Jan 8 2021, 2:33 PM
llvm/lib/Transforms/Scalar/LoopDeletion.cpp
118

DriveBy:

This hurts. I know it was there before but auto &I is BasicBlock *... argh.
Could we please change that if we commit this to a auto * at least with a sensible variable name.

jonpa added inline comments.Jan 9 2021, 1:42 PM
llvm/lib/Transforms/Scalar/LoopDeletion.cpp
118

I think it's better in that case perhaps to just commit such an NFC change separately beforehand?

What is left after we merged the loop unswitch solution?

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
118

Yeah, stuff like that can be committed directly as NFC.

What is left after we merged the loop unswitch solution?

I realized that the for the SECP2017 version, loop-unswitching does not happen by default, due to cost-modeling. (It does happen for the SPEC2006 version). So I tried to extend the logic to check if the candidate path is a no-op: D95468. That should also handle the SPEC2017 case.

jonpa updated this revision to Diff 319447.Jan 26 2021, 5:38 PM

Patch rebased.

What is left after we merged the loop unswitch solution?

I did a rerun today on top of 302432f, which should include both Florians and Philips patches:

trunk:
      8201                    loop-delete - Number of loops deleted
patch:

      2624                    loop-delete - Number of early exits inserted
      8006                    loop-delete - Number of loops deleted
         6                    loop-delete - Number of loops eliminated (no remaining blocks)
       279                    loop-delete - Number of skipped loops (SCEV 0)
patch + D95468

      2617                    loop-delete - Number of early exits inserted
      7798                    loop-delete - Number of loops deleted
         6                    loop-delete - Number of loops eliminated (no remaining blocks)
       279                    loop-delete - Number of skipped loops (SCEV 0)

I realized that the for the SECP2017 version, loop-unswitching does not happen by default, due to cost-modeling. (It does happen for the SPEC2006 version). So I tried to extend the logic to check if the candidate path is a no-op: D95468. That should also handle the SPEC2017 case.

I may be doing something wrong, but D95468 did not help very much looking at these numbers it seems...

I may be doing something wrong, but D95468 did not help very much looking at these numbers it seems...

Maybe outer loops have been skipped and therefore you avoided duplication of outer and inner loops (with D95468). The statistics we have are too coarse grained to exactly pinpoint what happened.

jonpa added a comment.Jan 28 2021, 3:56 PM

I may be doing something wrong, but D95468 did not help very much looking at these numbers it seems...

Maybe outer loops have been skipped and therefore you avoided duplication of outer and inner loops (with D95468). The statistics we have are too coarse grained to exactly pinpoint what happened.

I wonder what loops are those which where "partially unrolled", and which were not. Do the partially unrolled ones get some recognizable name maybe in the header blocks?

I may be doing something wrong, but D95468 did not help very much looking at these numbers it seems...

Maybe outer loops have been skipped and therefore you avoided duplication of outer and inner loops (with D95468). The statistics we have are too coarse grained to exactly pinpoint what happened.

I wonder what loops are those which where "partially unrolled", and which were not. Do the partially unrolled ones get some recognizable name maybe in the header blocks?

Do you mean partially unswitched or partially unrolled? I don't think so unfortunately. But I think most cases where already caught with D95468 and only in a few cases we now skip duplication. Also note that D95468 adds the shortcut outside the loop, so loop-deletion will still insert an early exit as before (long term I think we could also just turn the branch in the loop into an early exit).

jonpa updated this revision to Diff 329788.Mar 10 2021, 3:01 PM

Patch updated to also run with the new pass manager.

This now gives a ~9% improvement on Omnetpp - just like it was with the legacy pass manager (before but not after Florians patch, I believe).

@fhahn : are you going to port your improvement to loop unswitching to newpm?

Patch updated to also run with the new pass manager.

This now gives a ~9% improvement on Omnetpp - just like it was with the legacy pass manager (before but not after Florians patch, I believe).

@fhahn : are you going to port your improvement to loop unswitching to newpm?

Yes I am planning to, but we are seeing several other regressions with the new pass manager that I'll probably need to investigate first. So if anyone wants to port the loop-unswitching changes before I get a chance, I'd be more than happy.

jonpa updated this revision to Diff 375210.Sep 27 2021, 5:25 AM

I again saw a regression on omnetpp against gcc so I decided to revisit this patch as it previously handled that benchmar. In doing so I have updated/revisited with some more comments as well. It now only runs on the new pass manager.

It seems however that this time it was not the loop in PrintAddressTable that was the issue, so I still can't say what the omnetpp regression is about this time. However, I reran SPEC-17 and found that a few benchmarks seemed to improve just slighty (~1%). Statistics report that still ~1400 loops get optimized with this patch (~2400 edges redirected). So at least in theory, this patch might still be of interest, even if just to give hints on missed optimizations by other loop passes.

I checked if the extra work of finding Top/Bot regions and handling multiple exits where still worthwhile (compared to a very simple approach). I found that the Top/Bot search about doubled the effectiveness (relative just checking edges from Header to Latch), where the multiple exit loops were about 10% of the cases (I now made multiple-exit handling fall under an experimental option "-early-exit-extra".).

f510.parest_r seemed to improve ~1.5% on both z14 and z15, and I made a reduced test case for one of the files changed (sparsity_pattern.ii, picked randomly). I see three less BBs in the output and an outer loop removed.

opt -mtriple=systemz-unknown -march=z14 -O3 ./tc_earlex.ll -debug-only=loop-delete

Analyzing Loop for deletion: Loop at depth 1 containing: %bb5.preheader<header>,%bb8,%bb1.loopexit<latch><exiting>,%bb8.preheader,%bb1.loopexit.loopexit
    Loop at depth 2 containing: %bb8<header><latch><exiting>
Loop is not invariant, cannot delete.
Trying to insert early exits:
Top region: bb5.preheader, bb8.preheader, 
Bot region: bb1.loopexit, bb1.loopexit.loopexit, 
Inserting early exit in bb5.preheader:
  br i1 %i7.not3, label %bb1.loopexit, label %bb8.preheader
  =>
  br i1 %i7.not3, label %bb9.loopexit, label %bb8.preheader

Thoughts anyone? Maybe somebody would like to try the patch on some other platform?

jonpa added a comment.May 27 2022, 2:43 AM

@fhahn : I reran this today and found that there are no real benchmark performance improvements anymore. However, there are still +2000 early exits insertions reported, It seems you have managed to handle the important cases, so I guess this still isn't really interesting. It would be however interesting to hear your thoughts on this: I remember you did a handling for this and I guess you must have covered the important cases already..?

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 2:43 AM
jonpa added a comment.Sep 8 2022, 10:02 AM

I have derived two reduced test cases where this patch improve the loop branching from SPEC. These were the first two files I looked at out of many, and it seems that in both cases LoopDeletion fails in deleting the loop as it is found to be not variant. In both of these cases it was a matter of breaking the outer loop if the inner loop was never visited.

On SPEC, I still see 2185 of these early exits inserted. This is kind of interesting, although it may be that this is not going to improve any benchmarks. Maybe it relates to outer loops with inner loops that always are executed. Maybe the overhead of the outer loop isn't that great in other cases...

// Derived from gcc / tree-vect-slp.c                                                                                                                                                                                                                                             

int VEC_gimple_base_iterate_vec_;

void build_vector();

unsigned VEC_gimple_base_length();

int VEC_gimple_base_iterate() {
  if (VEC_gimple_base_iterate_vec_)
    return 1;
  return 0;
}

void vect_get_constant_vectors() {
  int j = VEC_gimple_base_length();
  for (; j; j++)                            **// No Loop deletion, but early exit if inner loop never visited.  **                                                                                                                                                                                              
    for (; VEC_gimple_base_iterate();)
      build_vector();
}

clang -c -o tree-vect-slp.s -S -O3 -march=arch13  tree-vect-slp.i -w -mllvm -debug-only=loop-delete

Analyzing Loop for deletion: Loop at depth 1 containing: %for.cond1.preheader<header>,%for.body4,%for.inc<latch><exiting>,%for.body4.preheader,%for.inc.loopexit
    Loop at depth 2 containing: %for.body4<header><latch><exiting>
Loop is not invariant, cannot delete.
Trying to insert early exits:
Top region: for.cond1.preheader, for.body4.preheader, 
Bot region: for.inc, for.inc.loopexit, 
Inserting early exit in for.cond1.preheader:
  br i1 %tobool.not.i.not7, label %for.inc, label %for.body4.preheader
  =>
  br i1 %tobool.not.i.not7, label %for.end5.loopexit10, label %for.body4.preheader
// Derived from cactus / FlatBoundary.c                                                                                                                                                                                                                                           

void memcpy();
int CCTK_GroupDimI();

typedef struct {
  int *cctk_lsh
} cGH;

int Glob_A, Glob_B, Glob_C;

void BndFlatDirVIApplyBndFlat(cGH *GH) {
  int i = 0, j = 0, k = 0, ash[3] = {0, 0, 0}, lsh[3] = {0, 0, 0};
  int vtypesize = CCTK_GroupDimI();

  for (; Glob_A;) {
    for (i = 0; i < Glob_B; i++)
      ash[i] = lsh[i] = GH->cctk_lsh[i];

    for (k = 0; k < Glob_C; k++)
      for (; j < 1;)
        for (; lsh[0];)
          ;

    for (; k < 100000; k++) {               // No Loop deletion, but early exit if inner loop(s) never visited.                                                                                                                                                                               
      for (j = 0; j < lsh[1]; j++) {
        for (i = 0; i < lsh[0]; i++) {
          int _index_to = ash[0] * ash[1] * (k - 1) * vtypesize;
          memcpy(GH + _index_to);
        }
      }
    }
  }

}



Analyzing Loop for deletion: Loop at depth 2 containing: %for.cond25.preheader.us<header>,%for.cond29.preheader.us.us,%for.body32.us.us,%for.cond29.for.inc40_crit_edge.us.us,%for.cond25.for.inc43_crit_edge.us<latch><exiting>,%for.cond29.preheader.us.us.preheader,%for.cond25.for.inc43_crit_edge.us.loopexit
    Loop at depth 3 containing: %for.cond29.preheader.us.us<header>,%for.body32.us.us,%for.cond29.for.inc40_crit_edge.us.us<latch><exiting>
        Loop at depth 4 containing: %for.body32.us.us<header><latch><exiting>
Loop is not invariant, cannot delete.
Trying to insert early exits:
Top region: for.cond25.preheader.us, for.cond29.preheader.us.us.preheader, 
Bot region: for.cond25.for.inc43_crit_edge.us, for.cond25.for.inc43_crit_edge.us.loopexit, 
Inserting early exit in for.cond25.preheader.us:
  br i1 %cmp3165, label %for.cond29.preheader.us.us.preheader, label %for.cond25.for.inc43_crit_edge.us
  =>
  br i1 %cmp3165, label %for.cond29.preheader.us.us.preheader, label %for.cond.loopexit.loopexit
jonpa added a comment.Nov 22 2022, 2:47 PM

@fhahn Any comments on the C test cases I derived..?

jonpa added a comment.Jun 7 2023, 5:54 AM

ping

@fhahn : I am still curious why this is not an improvement. I realize that you have your reasoning that makes this patch less likely beneficial, and as I spent some time on it, I would be very happy to also understand why :-)