Page MenuHomePhabricator

[LoopUnswitch] Implement first version of partial unswitching.
ClosedPublic

Authored by fhahn on Dec 23 2020, 7:25 AM.

Details

Summary

This patch applies the idea from D93734 to LoopUnswitch.

It adds support for unswitching on conditions that are only
invariant along certain paths through a loop.

In particular, it targets conditions in the loop header that
depend on values loaded from memory. If either path from
the true or false successor through the loop does not modify
memory, perform partial loop unswitching.

That is, duplicate the instructions feeding the condition in the pre-header.
Then unswitch on the duplicated condition. The condition is now known
in the unswitched version for the 'invariant' path through the original loop.

On caveat of this approach is that one of the loops created can be partially
unswitched again. To avoid this behavior, llvm.loop.unswitch.partial.disable
metadata is added to the unswitched loops, to avoid subsequent partial
unswitching.

If that's the approach to go, I can move the code handling the metadata kind
into separate functions.

This increases the cases we unswitch quite a bit in SPEC2006/SPEC2000 &
MultiSource. It also allows us to eliminate a dead loop in SPEC2017's omnetpp

Tests: 236
Same hash: 170 (filtered out)
Remaining: 66
Metric: loop-unswitch.NumBranches

Program                                        base   patch  diff
 test-suite...000/255.vortex/255.vortex.test     2.00  23.00 1050.0%
 test-suite...T2006/401.bzip2/401.bzip2.test     7.00  55.00 685.7%
 test-suite :: External/Nurbs/nurbs.test         5.00  26.00 420.0%
 test-suite...s-C/unix-smail/unix-smail.test     1.00   3.00 200.0%
 test-suite.../Prolangs-C++/ocean/ocean.test     1.00   3.00 200.0%
 test-suite...tions/lambda-0.1.3/lambda.test     1.00   3.00 200.0%
 test-suite...yApps-C++/PENNANT/PENNANT.test     2.00   5.00 150.0%
 test-suite...marks/Ptrdist/yacr2/yacr2.test     1.00   2.00 100.0%
 test-suite...lications/viterbi/viterbi.test     1.00   2.00 100.0%
 test-suite...plications/d/make_dparser.test    12.00  24.00 100.0%
 test-suite...CFP2006/433.milc/433.milc.test    14.00  27.00 92.9%
 test-suite.../Applications/lemon/lemon.test     7.00  12.00 71.4%
 test-suite...ce/Applications/Burg/burg.test     6.00  10.00 66.7%
 test-suite...T2006/473.astar/473.astar.test    16.00  26.00 62.5%
 test-suite...marks/7zip/7zip-benchmark.test    78.00 121.00 55.1%

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fhahn retitled this revision from [LoopUnswitch] Implement first version of partial unswitching. (WIP) to [LoopUnswitch] Implement first version of partial unswitching..Dec 27 2020, 11:43 AM
fhahn edited the summary of this revision. (Show Details)
jonpa added a comment.Dec 27 2020, 5:10 PM

This patch applies the idea from D93734 to LoopUnswitch.

One disadvantage here is that LoopUnswitch has a size-limitation on the loop it handles, which is not really true: If a no-side-effects loop is created and the intent is to later delete it, it should not be limited by the size heuristic.

On the other hand, I wonder if the idea here is to only isolate totally dead paths per this first draft or perhaps also paths where specifically the condition is invariant? This would be the advantage I can see compared to D93734: to increase unswitching where there may be side-effects that is known not to alias with the ToDuplicate instructions.

What about the original loop: If it is entered (instead of the new smaller loop your patch creates), it may still be true that the path where the condition becomes invariant is entered. So (at least theoretically) the full loop should in that case branch into the smaller loop. This would also eliminate the need to duplicate instructions in the preheader.

If we only want to achieve the "early exit" result, in other words that this patch will create a new loop with no side-effects (and then depend on later passes to remove it), it seems simpler to me to have this as a separate pass (or perhaps as a fix-up in LoopDeletion per D93734).

xbolva00 added a comment.EditedDec 28 2020, 1:19 AM

GCC has separate pass for this - loopsplit. It can handle also general case like:

for (i = 0; i < 100; i++)
     {
       if (i < 50)
         A;
       else
         B;
     }
   into:
   for (i = 0; i < 50; i++)
     {
       A;
     }
   for (; i < 100; i++)
     {
       B;
     }

https://godbolt.org/z/zqsPY6

fhahn updated this revision to Diff 313906.Dec 28 2020, 2:46 PM

Updated to use MemorySSA to allow non-clobbering stores in the 'invariant' part. There's a crash while updating MemorySSA, probably because of a new case of loop not accounted in the existing updating. Will fix tomorrow.

This patch applies the idea from D93734 to LoopUnswitch.

One disadvantage here is that LoopUnswitch has a size-limitation on the loop it handles, which is not really true: If a no-side-effects loop is created and the intent is to later delete it, it should not be limited by the size heuristic.

That is true, but the size of the overall loop is probably not too big in practice, if there is a no-op part. If we implement your suggestion below (to jump to the smaller one for the invariant case in the larger one), the size increase should be negligible. It would be interesting to check how many dead loops get missed by unswitching. If there are enough cases, it should be easy to share the detection logic between both passes and also use it in LoopDeletion.

On the other hand, I wonder if the idea here is to only isolate totally dead paths per this first draft or perhaps also paths where specifically the condition is invariant? This would be the advantage I can see compared to D93734: to increase unswitching where there may be side-effects that is known not to alias with the ToDuplicate instructions.

Yes, I think the main advantage of the 'partial' unswitching is that neither path needs to be dead, it just cannot clobber the condition we unswitch on. So for example, reductions would be fine or stores to locations that do not alias any loads feeding the condition. The original version of the patch did not yet implement that. But I updated the patch to use MemorySSA to also catch this case.

What about the original loop: If it is entered (instead of the new smaller loop your patch creates), it may still be true that the path where the condition becomes invariant is entered. So (at least theoretically) the full loop should in that case branch into the smaller loop. This would also eliminate the need to duplicate instructions in the preheader.

Yes, in some cases it would be possible to just jump to the smaller loop. Unfortunately this will require some extra checks and work to set up the continuation values in the smaller loop. This is certainly something interesting worth exploring, but probably not in the initial version.

If we only want to achieve the "early exit" result, in other words that this patch will create a new loop with no side-effects (and then depend on later passes to remove it), it seems simpler to me to have this as a separate pass (or perhaps as a fix-up in LoopDeletion per D93734).

The aim for this patch is to catch more cases in general, including cases that are not no-ops. The updated version supports memory ops that do not alias the loads feeding the condition.

fhahn added a comment.Dec 28 2020, 2:50 PM

GCC has separate pass for this - loopsplit. It can handle also general case like:

for (i = 0; i < 100; i++)
     {
       if (i < 50)
         A;
       else
         B;
     }
   into:
   for (i = 0; i < 50; i++)
     {
       A;
     }
   for (; i < 100; i++)
     {
       B;
     }

https://godbolt.org/z/zqsPY6

I think this is a separate issue (the condition is dependent on the induction variable), but nevertheless interesting to pursue. IIRC there is already a pass that tries to do something similar, but I don't know which one of the top of my head. Will check.

This comment was removed by xbolva00.

I think this is a separate issue (the condition is dependent on the induction variable), but nevertheless interesting to pursue. IIRC there is already a pass that tries to do something similar, but I don't know which one of the top of my head. Will check.

This is InductiveRangeCheckElimination which implements a form of iteration set splitting. I'll warn that IRCE has been historically bug prone, and is not on by default upstream.

Overall, I think this is the right place for such logic. Some initial code comments.

TBH, I think it would be nicer to "search for the paths" instead. Though, a TODO might be sufficient for the start.
What I think would be nice if we do not only deal with the conditional in the header, still thinking of *l && *r.

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
675

.pop_back_val() ?

677

Does code generation handle the case where a loop invariant instruction is inside the loop? With the comment below, this might be changed to L->contains(I) instead.

680

"Later" we could allow anything that is speculatively executable and loads, right?

fhahn updated this revision to Diff 314219.Jan 1 2021, 7:12 AM

Overall, I think this is the right place for such logic. Some initial code comments.

Thanks for taking a look!

TBH, I think it would be nicer to "search for the paths" instead. Though, a TODO might be sufficient for the start.

Do you mean not using MemorySSA initially? In a way, the current approach first collects all blocks on a path and then walks MemorySSA downwards to check all potential clobbers in blocks on the path. Instead of that, we could just iterate over all instructions in the blocks and bail out if any has side-effects, as in the original version.

The main advantage of the MemorySSA version is that it allows us to just check the potentially clobbering instructions, so this is mostly an optimization. If it makes things easier, that could be done as a separate patch?

What I think would be nice if we do not only deal with the conditional in the header, still thinking of *l && *r.

IIUC there are 2 things that could be improved: currently the condition must be an icmp, but this is mostly an artificial initial restriction that is easy to lift subsequently, e.g. to allow OR/AND instructions as conditions.

And we could also look for conditional branches in other blocks than the header. There very likely will be cases where this helps, but I am not sure how frequently it will trigger compared to the additional compile-time spent. Certainly something to explore further though.

fhahn updated this revision to Diff 314234.Jan 1 2021, 2:53 PM
fhahn marked an inline comment as done.

Add a few additional test cases.

fhahn added a comment.Jan 1 2021, 2:54 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

Loop-invariant instructions will be duplicated outside the loop at the moment (added a test). With 'handling them', do you mean updating the uses inside the loop to use the hoisted instruction? Not sure if that will help a lot in practice, as they should already have been moved out I think and/or GVN/LICM will clean them up.

680

yep I think so.

I refined one of my comments. I'm other than that OK with this. Let's wait for a few days and get a second opinion.

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

So, my problem was that I assumed something like X below might be considered "loop invariant". However, the API in llvm::Loop does simply perform a constains check anyway, which is what I suggested instead. That said, I think it is confusing to call isLoopInvariant here because what we want/need is contains. If someone later recognizes in isLoopInvariant that X is not loop variant, we would not put the add in the toDuplicate set and fail to create valid code, agreed? (Right now it would only show for GEP but it's the same story.)

int A[100];
int X, a = ..., b = ...;
for (...) {
  X = a + b;
  A[i] = X;
}
fhahn updated this revision to Diff 314300.Jan 3 2021, 12:42 PM

rebased after precommitting tests in edb52c626b5340a5a42ed833fc776bc815507283

Also updated to use !L->contains() instead of isLoopInvariant()

fhahn added inline comments.Jan 3 2021, 12:43 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
677

That said, I think it is confusing to call isLoopInvariant here because what we want/need is contains

Yeah, given that we only call it for instructions anyways, using contains seems clearer. Updated.

fhahn updated this revision to Diff 315804.Jan 11 2021, 7:24 AM

Rebase & ping :)

Generally fine, I have one questions though

llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
716

Nit: const auto &

748

Why do we need to check these uses?

fhahn updated this revision to Diff 315932.Jan 11 2021, 2:00 PM

Thanks! Addressed the nit and also added an option to limit the number of memory accesses to explore, to avoid excessive compile-times in degenerate cases

fhahn added inline comments.Jan 11 2021, 2:05 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

Unfortunately a MemoryDef does not necessarily have all may/must-alias defs that follow it as users. For example, I think we could have something like

%0 = MemoryDef (LiveOnEntry)
%1 = MemoryPhi(%0,...)
%2 = MemoryDef(%1,...) ; may-alias %0

depending on what MemorySSA optimizations are applied, I think there could be similar examples with just MemoryDefs.

jdoerfert added inline comments.Jan 11 2021, 2:27 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

I'm trying to wrap my head around this and it is probably me. I haven't worked with MSSA much.

So, AccessesToCheck starts with the defining access for each read location, so far so good. (correct me if I'm wrong at some point)
If that access is outside the loop, done.
If that access is inside and aliasing a location, done.
If that access is inside and not aliasing a location, why do we look at the uses?
I would understand if we look at the "operands":

Header:
%1 = MemDef(%0,...) // clobbers %3
%2 = MemDef(%1,...) // defining access of %3
%3 = MemUse(%2,...) // in AccessedLocations

Though, I assume I simply do not understand MSSA in various ways.

fhahn added inline comments.Jan 12 2021, 11:53 AM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
748

So, AccessesToCheck starts with the defining access for each read location, so far so good. (correct me if I'm wrong at some point)
If that access is outside the loop, done.
If that access is inside and aliasing a location, done.

Sounds good so far.

If that access is inside and not aliasing a location, why do we look at the uses?

Because the uses of access A are the memory accesses that may read/write the memory location written by A, after A. For accesses that may access the same memory locations in a loop/cycle, there will be uses in MemoryPhis.

A concrete example is below. To visit all potential clobbers of ; MemoryUse(1) ( %lv = load i32, i32* %ptr, align 4) , we have to visit all uses of %1 = MemoryDef(4) (call void @clobber()), their uses and so on.

The case where a clobber comes before the defining access in the header (as in your example) will be handled by this forward-scanning approach because there will be a MemoryPhi that's the defining access of %1 in your example.

Does this make sense?

While looking at this again, I realized that queuing the defining access at line 692 could be overly pessimistic, if there are scenarios where it may alias a location in AccessedLocations, but its defining access is outside the loop. It would probably better to directly queue the uses of the defining access. But I am not sure if such a scenario can happen in practice.

define i32 @test(i32* %ptr, i32 %N) {
entry:
  br label %loop.header

loop.header:                                      ; preds = %loop.latch, %entry
; 4 = MemoryPhi({entry,liveOnEntry},{loop.latch,3})
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.latch ]
; 1 = MemoryDef(4)
  call void @clobber()
; MemoryUse(1)
  %lv = load i32, i32* %ptr, align 4
  %sc = icmp eq i32 %lv, 100
  br i1 %sc, label %noclobber, label %clobber

noclobber:                                        ; preds = %loop.header
  br label %loop.latch

clobber:                                          ; preds = %loop.header
; 2 = MemoryDef(1)
  call void @clobber()
  br label %loop.latch

loop.latch:                                       ; preds = %clobber, %noclobber
; 3 = MemoryPhi({noclobber,1},{clobber,2})
  %c = icmp ult i32 %iv, %N
  %iv.next = add i32 %iv, 1
  br i1 %c, label %loop.header, label %exit

exit:                                             ; preds = %loop.latch
  ret i32 10
}
jdoerfert accepted this revision.Jan 12 2021, 12:15 PM

I think I got it now. Due to the PHI nodes we can follow uses and eventually visit every definition in the loop (we care about). LGTM. Feel free to wait for someone else to look at it, or not.

This revision is now accepted and ready to land.Jan 12 2021, 12:15 PM
jonpa added a comment.Jan 13 2021, 4:36 PM

If a dead path in a loop is unswitched into an empty loop, I suppose the idea is that LoopDeletion will later then delete it?

What about the remaining original loop: will it remain the same or will it get the edge into the dead path redirected to the new smaller loop?

mdchen added a subscriber: mdchen.Jan 13 2021, 6:53 PM
mdchen added inline comments.Jan 14 2021, 7:20 PM
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
671

I guess only the conditional value needs to be added here.

692

MemUse could be nullptr.

713

pop_back_val()

fhahn updated this revision to Diff 317231.Jan 17 2021, 10:34 AM

If a dead path in a loop is unswitched into an empty loop, I suppose the idea is that LoopDeletion will later then delete it?

Yes exactly, this work is delegated to LoopDeletion; LoopUnswitch does not really care if the loop is dead at the moment or not, just if a condition in the body can be simplified by moving out a memory dependent check.

What about the remaining original loop: will it remain the same or will it get the edge into the dead path redirected to the new smaller loop?

At the moment, it remains the same. But it would be a great extension to update the body to jump to the unswitched version, if possible. I think this would best be done in a separate patch.

fhahn marked an inline comment as done.Jan 17 2021, 10:36 AM
fhahn added inline comments.
llvm/lib/Transforms/Scalar/LoopUnswitch.cpp
671

Yes that would be sufficient, if the loop below would support compares, which it does not in the current version. It would be good to extend the supported instructions below to any arithmetic/compare instruction in the future though.

692

I think we should not reach the code, if nullptr is assigned to MemUse in the if condition.

713

Thanks, I updated the code!

fhahn marked an inline comment as done.Jan 20 2021, 10:20 AM

Thanks everyone for the feedback. I am planning on landing the change tomorrow.

dmajor added a subscriber: dmajor.Jan 21 2021, 4:31 PM

To give you an early heads up, we're seeing crashes in Firefox after this commit. Unfortunately it's only happening on Mac, which I am not equipped to debug. I may need to try enlisting a colleague to investigate and/or throw more test suites at it and hope we can capture something on Linux or Windows.

Since there's a release branch point coming soon, if we don't get anywhere by the end of the week, what do you think about reverting until after 12.0.0-rc1 is tagged?

I'm also seeing lots of hangs/infinite loops in code built after this commit. I don't have anything nicely reduced to dig in on (yet), but it broke a significant amount of my testing setup.

fhahn added a comment.Jan 22 2021, 3:08 AM

To give you an early heads up, we're seeing crashes in Firefox after this commit. Unfortunately it's only happening on Mac, which I am not equipped to debug. I may need to try enlisting a colleague to investigate and/or throw more test suites at it and hope we can capture something on Linux or Windows.

Since there's a release branch point coming soon, if we don't get anywhere by the end of the week, what do you think about reverting until after 12.0.0-rc1 is tagged?

I'm also seeing lots of hangs/infinite loops in code built after this commit. I don't have anything nicely reduced to dig in on (yet), but it broke a significant amount of my testing setup.

Thanks for the heads up! It would be great to get a reproducer, but I'll revert by the end of day if the issue is not resolved by then.

I've got one problem narrowed down to one source file in a project; https://martin.st/temp/dav1d-thread_task-preproc.c, compiled with clang -target x86_64-w64-mingw32 -c -O3 does trigger the issue - I haven't dug in closer to zoom in on exactly what it is though - hopefully you can spot the differences.

fhahn added a comment.Jan 22 2021, 7:12 AM

I've got one problem narrowed down to one source file in a project; https://martin.st/temp/dav1d-thread_task-preproc.c, compiled with clang -target x86_64-w64-mingw32 -c -O3 does trigger the issue - I haven't dug in closer to zoom in on exactly what it is though - hopefully you can spot the differences.

One issue with the file was that the patch did not account for atomic loads and would add additional atomic loads. I fixed that in 86991d323133. Please let me know if that fixes your issue. Otherwise I'll revert the patch to investigate further.

One issue with the file was that the patch did not account for atomic loads and would add additional atomic loads. I fixed that in 86991d323133. Please let me know if that fixes your issue. Otherwise I'll revert the patch to investigate further.

Ah, that sounds promising! On our side we haven't got down to specific details yet but it _may_ be related to atomics. I've kicked off new tests on 86991d323133, should have results in 2-3 hours.

I've got one problem narrowed down to one source file in a project; https://martin.st/temp/dav1d-thread_task-preproc.c, compiled with clang -target x86_64-w64-mingw32 -c -O3 does trigger the issue - I haven't dug in closer to zoom in on exactly what it is though - hopefully you can spot the differences.

One issue with the file was that the patch did not account for atomic loads and would add additional atomic loads. I fixed that in 86991d323133. Please let me know if that fixes your issue. Otherwise I'll revert the patch to investigate further.

Thanks! I think it looks mostly good now - I'm still seeing some breakage compared to thursday, but I'll have to bisect those errors separately to see where they broke.

86991d323133 looks good from our side, thank you!

Thanks! I think it looks mostly good now - I'm still seeing some breakage compared to thursday, but I'll have to bisect those errors separately to see where they broke.

The other issues turned out to be false alarms, so I think everything should be in order now again. Thanks!

fhahn added a comment.Jan 25 2021, 3:48 AM

Thank you very much for the quick feedback! Glad that it looks like the issues have been resolved.

uabelho added a subscriber: uabelho.

Hi!

I started seeing crashes with this patch:

opt -o /dev/null bbi-52312.ll -loop-unswitch

Result:

opt: ../lib/Analysis/MemorySSAUpdater.cpp:1154: void llvm::MemorySSAUpdater::applyInsertUpdates(ArrayRef<llvm::CFGUpdate>, llvm::DominatorTree &, const GraphDiff<llvm::BasicBlock *> *): Assertion `IDom && "Block must have a valid IDom."' failed.

For another program I see

opt: ../lib/Analysis/MemorySSA.cpp:2032: void llvm::MemorySSA::verifyOrderingDominationAndDefUses(llvm::Function &) const: Assertion `dominates(MD, U) && "Memory Def does not dominate it's uses"' failed.

when I run

opt -O1 -inline -loop-unswitch -verify-memoryssa

with this patch but unfortunately I haven't been able to reduce that down to a simpler command line (it doesn't reproduce if I run the passes I get from -debug-pass=Arguments) and the input uses some features we only have downstream so I don't have a reproducer I can share for this :(

fhahn added a comment.Jan 29 2021, 2:11 PM

Hi!

I started seeing crashes with this patch:

opt -o /dev/null bbi-52312.ll -loop-unswitch

Result:

opt: ../lib/Analysis/MemorySSAUpdater.cpp:1154: void llvm::MemorySSAUpdater::applyInsertUpdates(ArrayRef<llvm::CFGUpdate>, llvm::DominatorTree &, const GraphDiff<llvm::BasicBlock *> *): Assertion `IDom && "Block must have a valid IDom."' failed.

Thanks ,let me take a look!

fhahn added a comment.Jan 30 2021, 7:46 AM

Hi!

I started seeing crashes with this patch:

opt -o /dev/null bbi-52312.ll -loop-unswitch

Result:

opt: ../lib/Analysis/MemorySSAUpdater.cpp:1154: void llvm::MemorySSAUpdater::applyInsertUpdates(ArrayRef<llvm::CFGUpdate>, llvm::DominatorTree &, const GraphDiff<llvm::BasicBlock *> *): Assertion `IDom && "Block must have a valid IDom."' failed.

Thanks ,let me take a look!

Should be fixed by 10c57268c074. Please let me know if you are seeing any other issues.

Hi!

I started seeing crashes with this patch:

opt -o /dev/null bbi-52312.ll -loop-unswitch

Result:

opt: ../lib/Analysis/MemorySSAUpdater.cpp:1154: void llvm::MemorySSAUpdater::applyInsertUpdates(ArrayRef<llvm::CFGUpdate>, llvm::DominatorTree &, const GraphDiff<llvm::BasicBlock *> *): Assertion `IDom && "Block must have a valid IDom."' failed.

Thanks ,let me take a look!

Should be fixed by 10c57268c074. Please let me know if you are seeing any other issues.

I guess that one needs to be backported to 12.x, after cooking in main for a while.

fhahn added a comment.Jan 30 2021, 8:30 AM

Should be fixed by 10c57268c074. Please let me know if you are seeing any other issues.

I guess that one needs to be backported to 12.x, after cooking in main for a while.

I picked the fix on 12.x already for https://bugs.llvm.org/show_bug.cgi?id=48865. I'll keep an eye out for any issues.

Should be fixed by 10c57268c074. Please let me know if you are seeing any other issues.

Thanks! Both problems I saw went away with the fix!

Maybe I'm missing something, but this change doesn't seem to be effective anymore after the new pass manager switcheroo. Did this pass not get ported, in favour of SimpleLoopUnswitch? This now shows up as a regression relative to (what will be) LLVM 12.

fhahn added a comment.Feb 8 2021, 10:01 AM

Maybe I'm missing something, but this change doesn't seem to be effective anymore after the new pass manager switcheroo. Did this pass not get ported, in favour of SimpleLoopUnswitch? This now shows up as a regression relative to (what will be) LLVM 12.

LoopUnswitch is legacy PM only unfortunately. This needs to be ported to SimpleLoopUNswitch.