Page MenuHomePhabricator

[LoopUnswitch] Implement first version of partial unswitching.
AcceptedPublic

Authored by fhahn on Wed, Dec 23, 7:25 AM.

Details

Summary

This patch applies the idea from D93734 to LoopUnswitch.

It adds support for unswitching on conditions that are only
invariant along certain paths through a loop.

In particular, it targets conditions in the loop header that
depend on values loaded from memory. If either path from
the true or false successor through the loop does not modify
memory, perform partial loop unswitching.

That is, duplicate the instructions feeding the condition in the pre-header.
Then unswitch on the duplicated condition. The condition is now known
in the unswitched version for the 'invariant' path through the original loop.

On caveat of this approach is that one of the loops created can be partially
unswitched again. To avoid this behavior, llvm.loop.unswitch.partial.disable
metadata is added to the unswitched loops, to avoid subsequent partial
unswitching.

If that's the approach to go, I can move the code handling the metadata kind
into separate functions.

This increases the cases we unswitch quite a bit in SPEC2006/SPEC2000 &
MultiSource. It also allows us to eliminate a dead loop in SPEC2017's omnetpp

Tests: 236
Same hash: 170 (filtered out)
Remaining: 66
Metric: loop-unswitch.NumBranches

Program                                        base   patch  diff
 test-suite...000/255.vortex/255.vortex.test     2.00  23.00 1050.0%
 test-suite...T2006/401.bzip2/401.bzip2.test     7.00  55.00 685.7%
 test-suite :: External/Nurbs/nurbs.test         5.00  26.00 420.0%
 test-suite...s-C/unix-smail/unix-smail.test     1.00   3.00 200.0%
 test-suite.../Prolangs-C++/ocean/ocean.test     1.00   3.00 200.0%
 test-suite...tions/lambda-0.1.3/lambda.test     1.00   3.00 200.0%
 test-suite...yApps-C++/PENNANT/PENNANT.test     2.00   5.00 150.0%
 test-suite...marks/Ptrdist/yacr2/yacr2.test     1.00   2.00 100.0%
 test-suite...lications/viterbi/viterbi.test     1.00   2.00 100.0%
 test-suite...plications/d/make_dparser.test    12.00  24.00 100.0%
 test-suite...CFP2006/433.milc/433.milc.test    14.00  27.00 92.9%
 test-suite.../Applications/lemon/lemon.test     7.00  12.00 71.4%
 test-suite...ce/Applications/Burg/burg.test     6.00  10.00 66.7%
 test-suite...T2006/473.astar/473.astar.test    16.00  26.00 62.5%
 test-suite...marks/7zip/7zip-benchmark.test    78.00 121.00 55.1%

Diff Detail

Event Timeline

fhahn created this revision.Wed, Dec 23, 7:25 AM
fhahn requested review of this revision.Wed, Dec 23, 7:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptWed, Dec 23, 7:25 AM
fhahn updated this revision to Diff 313796.Sun, Dec 27, 11:34 AM

Add comments and tests

fhahn retitled this revision from [LoopUnswitch] Implement first version of partial unswitching. (WIP) to [LoopUnswitch] Implement first version of partial unswitching..Sun, Dec 27, 11:43 AM
fhahn edited the summary of this revision. (Show Details)
jonpa added a comment.Sun, Dec 27, 5:10 PM

This patch applies the idea from D93734 to LoopUnswitch.

One disadvantage here is that LoopUnswitch has a size-limitation on the loop it handles, which is not really true: If a no-side-effects loop is created and the intent is to later delete it, it should not be limited by the size heuristic.

On the other hand, I wonder if the idea here is to only isolate totally dead paths per this first draft or perhaps also paths where specifically the condition is invariant? This would be the advantage I can see compared to D93734: to increase unswitching where there may be side-effects that is known not to alias with the ToDuplicate instructions.

What about the original loop: If it is entered (instead of the new smaller loop your patch creates), it may still be true that the path where the condition becomes invariant is entered. So (at least theoretically) the full loop should in that case branch into the smaller loop. This would also eliminate the need to duplicate instructions in the preheader.

If we only want to achieve the "early exit" result, in other words that this patch will create a new loop with no side-effects (and then depend on later passes to remove it), it seems simpler to me to have this as a separate pass (or perhaps as a fix-up in LoopDeletion per D93734).

xbolva00 added a comment.EditedMon, Dec 28, 1:19 AM

GCC has separate pass for this - loopsplit. It can handle also general case like:

for (i = 0; i < 100; i++)
     {
       if (i < 50)
         A;
       else
         B;
     }
   into:
   for (i = 0; i < 50; i++)
     {
       A;
     }
   for (; i < 100; i++)
     {
       B;
     }

https://godbolt.org/z/zqsPY6

fhahn updated this revision to Diff 313906.Mon, Dec 28, 2:46 PM

Updated to use MemorySSA to allow non-clobbering stores in the 'invariant' part. There's a crash while updating MemorySSA, probably because of a new case of loop not accounted in the existing updating. Will fix tomorrow.

This patch applies the idea from D93734 to LoopUnswitch.

One disadvantage here is that LoopUnswitch has a size-limitation on the loop it handles, which is not really true: If a no-side-effects loop is created and the intent is to later delete it, it should not be limited by the size heuristic.

That is true, but the size of the overall loop is probably not too big in practice, if there is a no-op part. If we implement your suggestion below (to jump to the smaller one for the invariant case in the larger one), the size increase should be negligible. It would be interesting to check how many dead loops get missed by unswitching. If there are enough cases, it should be easy to share the detection logic between both passes and also use it in LoopDeletion.

On the other hand, I wonder if the idea here is to only isolate totally dead paths per this first draft or perhaps also paths where specifically the condition is invariant? This would be the advantage I can see compared to D93734: to increase unswitching where there may be side-effects that is known not to alias with the ToDuplicate instructions.

Yes, I think the main advantage of the 'partial' unswitching is that neither path needs to be dead, it just cannot clobber the condition we unswitch on. So for example, reductions would be fine or stores to locations that do not alias any loads feeding the condition. The original version of the patch did not yet implement that. But I updated the patch to use MemorySSA to also catch this case.

What about the original loop: If it is entered (instead of the new smaller loop your patch creates), it may still be true that the path where the condition becomes invariant is entered. So (at least theoretically) the full loop should in that case branch into the smaller loop. This would also eliminate the need to duplicate instructions in the preheader.

Yes, in some cases it would be possible to just jump to the smaller loop. Unfortunately this will require some extra checks and work to set up the continuation values in the smaller loop. This is certainly something interesting worth exploring, but probably not in the initial version.

If we only want to achieve the "early exit" result, in other words that this patch will create a new loop with no side-effects (and then depend on later passes to remove it), it seems simpler to me to have this as a separate pass (or perhaps as a fix-up in LoopDeletion per D93734).

The aim for this patch is to catch more cases in general, including cases that are not no-ops. The updated version supports memory ops that do not alias the loads feeding the condition.

fhahn added a comment.Mon, Dec 28, 2:50 PM

GCC has separate pass for this - loopsplit. It can handle also general case like:

for (i = 0; i < 100; i++)
     {
       if (i < 50)
         A;
       else
         B;
     }
   into:
   for (i = 0; i < 50; i++)
     {
       A;
     }
   for (; i < 100; i++)
     {
       B;
     }

https://godbolt.org/z/zqsPY6

I think this is a separate issue (the condition is dependent on the induction variable), but nevertheless interesting to pursue. IIRC there is already a pass that tries to do something similar, but I don't know which one of the top of my head. Will check.

This comment was removed by xbolva00.

I think this is a separate issue (the condition is dependent on the induction variable), but nevertheless interesting to pursue. IIRC there is already a pass that tries to do something similar, but I don't know which one of the top of my head. Will check.

This is InductiveRangeCheckElimination which implements a form of iteration set splitting. I'll warn that IRCE has been historically bug prone, and is not on by default upstream.

Overall, I think this is the right place for such logic. Some initial code comments.

TBH, I think it would be nicer to "search for the paths" instead. Though, a TODO might be sufficient for the start.
What I think would be nice if we do not only deal with the conditional in the header, still thinking of *l && *r.

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
675

.pop_back_val() ?

677

Does code generation handle the case where a loop invariant instruction is inside the loop? With the comment below, this might be changed to L->contains(I) instead.

680

"Later" we could allow anything that is speculatively executable and loads, right?

fhahn updated this revision to Diff 314219.Fri, Jan 1, 7:12 AM

Overall, I think this is the right place for such logic. Some initial code comments.

Thanks for taking a look!

TBH, I think it would be nicer to "search for the paths" instead. Though, a TODO might be sufficient for the start.

Do you mean not using MemorySSA initially? In a way, the current approach first collects all blocks on a path and then walks MemorySSA downwards to check all potential clobbers in blocks on the path. Instead of that, we could just iterate over all instructions in the blocks and bail out if any has side-effects, as in the original version.

The main advantage of the MemorySSA version is that it allows us to just check the potentially clobbering instructions, so this is mostly an optimization. If it makes things easier, that could be done as a separate patch?

What I think would be nice if we do not only deal with the conditional in the header, still thinking of *l && *r.

IIUC there are 2 things that could be improved: currently the condition must be an icmp, but this is mostly an artificial initial restriction that is easy to lift subsequently, e.g. to allow OR/AND instructions as conditions.

And we could also look for conditional branches in other blocks than the header. There very likely will be cases where this helps, but I am not sure how frequently it will trigger compared to the additional compile-time spent. Certainly something to explore further though.

fhahn updated this revision to Diff 314234.Fri, Jan 1, 2:53 PM
fhahn marked an inline comment as done.

Add a few additional test cases.

fhahn added a comment.Fri, Jan 1, 2:54 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

Loop-invariant instructions will be duplicated outside the loop at the moment (added a test). With 'handling them', do you mean updating the uses inside the loop to use the hoisted instruction? Not sure if that will help a lot in practice, as they should already have been moved out I think and/or GVN/LICM will clean them up.

680

yep I think so.

I refined one of my comments. I'm other than that OK with this. Let's wait for a few days and get a second opinion.

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

So, my problem was that I assumed something like X below might be considered "loop invariant". However, the API in llvm::Loop does simply perform a constains check anyway, which is what I suggested instead. That said, I think it is confusing to call isLoopInvariant here because what we want/need is contains. If someone later recognizes in isLoopInvariant that X is not loop variant, we would not put the add in the toDuplicate set and fail to create valid code, agreed? (Right now it would only show for GEP but it's the same story.)

int A[100];
int X, a = ..., b = ...;
for (...) {
  X = a + b;
  A[i] = X;
}
fhahn updated this revision to Diff 314300.Sun, Jan 3, 12:42 PM

rebased after precommitting tests in edb52c626b5340a5a42ed833fc776bc815507283

Also updated to use !L->contains() instead of isLoopInvariant()

fhahn added inline comments.Sun, Jan 3, 12:43 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

That said, I think it is confusing to call isLoopInvariant here because what we want/need is contains

Yeah, given that we only call it for instructions anyways, using contains seems clearer. Updated.

fhahn updated this revision to Diff 315804.Mon, Jan 11, 7:24 AM

Rebase & ping :)

Generally fine, I have one questions though

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
716

Nit: const auto &

748

Why do we need to check these uses?

fhahn updated this revision to Diff 315932.Mon, Jan 11, 2:00 PM

Thanks! Addressed the nit and also added an option to limit the number of memory accesses to explore, to avoid excessive compile-times in degenerate cases

fhahn added inline comments.Mon, Jan 11, 2:05 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

Unfortunately a MemoryDef does not necessarily have all may/must-alias defs that follow it as users. For example, I think we could have something like

%0 = MemoryDef (LiveOnEntry)
%1 = MemoryPhi(%0,...)
%2 = MemoryDef(%1,...) ; may-alias %0

depending on what MemorySSA optimizations are applied, I think there could be similar examples with just MemoryDefs.

jdoerfert added inline comments.Mon, Jan 11, 2:27 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

I'm trying to wrap my head around this and it is probably me. I haven't worked with MSSA much.

So, AccessesToCheck starts with the defining access for each read location, so far so good. (correct me if I'm wrong at some point)
If that access is outside the loop, done.
If that access is inside and aliasing a location, done.
If that access is inside and not aliasing a location, why do we look at the uses?
I would understand if we look at the "operands":

Header:
%1 = MemDef(%0,...) // clobbers %3
%2 = MemDef(%1,...) // defining access of %3
%3 = MemUse(%2,...) // in AccessedLocations

Though, I assume I simply do not understand MSSA in various ways.

fhahn added inline comments.Tue, Jan 12, 11:53 AM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

So, AccessesToCheck starts with the defining access for each read location, so far so good. (correct me if I'm wrong at some point)
If that access is outside the loop, done.
If that access is inside and aliasing a location, done.

Sounds good so far.

If that access is inside and not aliasing a location, why do we look at the uses?

Because the uses of access A are the memory accesses that may read/write the memory location written by A, after A. For accesses that may access the same memory locations in a loop/cycle, there will be uses in MemoryPhis.

A concrete example is below. To visit all potential clobbers of ; MemoryUse(1) ( %lv = load i32, i32* %ptr, align 4) , we have to visit all uses of %1 = MemoryDef(4) (call void @clobber()), their uses and so on.

The case where a clobber comes before the defining access in the header (as in your example) will be handled by this forward-scanning approach because there will be a MemoryPhi that's the defining access of %1 in your example.

Does this make sense?

While looking at this again, I realized that queuing the defining access at line 692 could be overly pessimistic, if there are scenarios where it may alias a location in AccessedLocations, but its defining access is outside the loop. It would probably better to directly queue the uses of the defining access. But I am not sure if such a scenario can happen in practice.

define i32 @test(i32* %ptr, i32 %N) {
entry:
  br label %loop.header

loop.header:                                      ; preds = %loop.latch, %entry
; 4 = MemoryPhi({entry,liveOnEntry},{loop.latch,3})
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.latch ]
; 1 = MemoryDef(4)
  call void @clobber()
; MemoryUse(1)
  %lv = load i32, i32* %ptr, align 4
  %sc = icmp eq i32 %lv, 100
  br i1 %sc, label %noclobber, label %clobber

noclobber:                                        ; preds = %loop.header
  br label %loop.latch

clobber:                                          ; preds = %loop.header
; 2 = MemoryDef(1)
  call void @clobber()
  br label %loop.latch

loop.latch:                                       ; preds = %clobber, %noclobber
; 3 = MemoryPhi({noclobber,1},{clobber,2})
  %c = icmp ult i32 %iv, %N
  %iv.next = add i32 %iv, 1
  br i1 %c, label %loop.header, label %exit

exit:                                             ; preds = %loop.latch
  ret i32 10
}
jdoerfert accepted this revision.Tue, Jan 12, 12:15 PM

I think I got it now. Due to the PHI nodes we can follow uses and eventually visit every definition in the loop (we care about). LGTM. Feel free to wait for someone else to look at it, or not.

This revision is now accepted and ready to land.Tue, Jan 12, 12:15 PM
jonpa added a comment.Wed, Jan 13, 4:36 PM

If a dead path in a loop is unswitched into an empty loop, I suppose the idea is that LoopDeletion will later then delete it?

What about the remaining original loop: will it remain the same or will it get the edge into the dead path redirected to the new smaller loop?

mdchen added a subscriber: mdchen.Wed, Jan 13, 6:53 PM
mdchen added inline comments.Thu, Jan 14, 7:20 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
671

I guess only the conditional value needs to be added here.

692

MemUse could be nullptr.

713

pop_back_val()