This is an archive of the discontinued LLVM Phabricator instance.

[SCEV] Check IDom with single predecessor on getPredecessorWithUniqueSuccessorForBB
AbandonedPublic

Authored by jaykang10 on Apr 27 2021, 3:56 PM.

Details

Summary

getPredecessorWithUniqueSuccessorForBB finds a BB which has a single predecessor and dominates target BB.

At this moment, the function is checking below two things.

  1. Target BB has single predecessor. If so, it returns predecessor and target BB.
  2. Target BB is part of loop. If so, it returns the loop's pre-header and header.

We could check one more thing as below.

If Target BB has IDom and the IDom has single predecessor, it also meet the condition of getPredecessorWithUniqueSuccessorForBB and we can return the IDom's predecessor and the IDom.

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 27 2021, 3:56 PM
jaykang10 requested review of this revision.Apr 27 2021, 3:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2021, 3:56 PM
nikic added inline comments.Apr 28 2021, 12:42 PM
llvm/lib/Analysis/ScalarEvolution.cpp
9313

I was going to suggest that we can generalize this by walking up the idom chain (this also removes the need to special case loops):

while (true) {
  if (const BasicBlock *Pred = BB->getSinglePredecessor())
    return {Pred, BB};

  DomTreeNode *IDom = DT.getNode(BB)->getIDom();
  if (!IDom)
    break;

  BB = IDom->getBlock();
}

return {nullptr, nullptr};

Unfortunately, this has a bad compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=41849a91956755e15591240816d1d0c5ec402895&to=41241255419965bbfbc6d361058ea02449be9c50&stat=instructions ClamAV is up nearly 3%.

So I also tested your original patch: https://llvm-compile-time-tracker.com/compare.php?from=41849a91956755e15591240816d1d0c5ec402895&to=a012ed61e795b191cd7d114bfe972ce05c9f3d6b&stat=instructions This is better, but still bad, especially with little codegen impact.

isImpliedCond() is notoriously expensive, so compile-time is sensitive to the amount of guards we look at. I think if we want to do any extensions here, we need to also aggressively limit the number of edges for which we take conditions into account.

jaykang10 added inline comments.Apr 28 2021, 1:55 PM
llvm/lib/Analysis/ScalarEvolution.cpp
9313

I appreciate your comment.

I was going to suggest that we can generalize this by walking up the idom chain (this also removes the need to special case loops):

while (true) {
  if (const BasicBlock *Pred = BB->getSinglePredecessor())
    return {Pred, BB};

  DomTreeNode *IDom = DT.getNode(BB)->getIDom();
  if (!IDom)
    break;

  BB = IDom->getBlock();
}

return {nullptr, nullptr};

Yep, if we walk up dominator tree, it is more general solution and we can remove the code to check the loop.

Unfortunately, this has a bad compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=41849a91956755e15591240816d1d0c5ec402895&to=41241255419965bbfbc6d361058ea02449be9c50&stat=instructions ClamAV is up nearly 3%.

So I also tested your original patch: https://llvm-compile-time-tracker.com/compare.php?from=41849a91956755e15591240816d1d0c5ec402895&to=a012ed61e795b191cd7d114bfe972ce05c9f3d6b&stat=instructions This is better, but still bad, especially with little codegen impact.

Thanks for checking compile time and codegen impact. At this point, I need to explain why I am trying to add changes to SCEV and IRCE pass.

I am trying to vectorize loops. Let's see a simple loop.

while() {
...
if ()
...
}

As you can see, there is if statement inside loop. As you know, LoopVectorizer tries to predicate the block of if statement with its condition and it asks target machines the cost of the instructions. If there are load or store instructions in the block to be predicated and target machine does not support masked load and store, LoopVectorizer assigns big number to the load and store instructions and it means we can not vectorize the loop. In this case, it is important to remove the if statement inside loop before LoopVectorizer.

We need to consider two cases to remove the if statement inside loop as below.

  1. if statement's condition with loop invariant variables
  2. if statement's condition with induction variables

For the first one, UnSwitch/SimpleUnSwitch pass handles the case. The passes hoist the loop invariant condition outside loop and unswitch loop.
For the second one, there could be several transformations but I think the IRCE pass is general solution. The pass splits the iteration space following the condition with induction variable.

I am trying the above passes handle more cases in order to make more loops vectorizable. Especially for the IRCE pass, the pass is using isLoopEntryGuardedByCond for bound check of new loop and it sometimes blocks the pass processes loop even though the loop is fine. This is a reason why I am trying to extend the SCEV's functions.

At this moment, the only SimpleUnSwitch pass is in pipeline of new pass manager but IRCE is not. Therefore, it is not easy to see the codegen impact from this SCEV change because they aim to help IRCE pass.

If possible, I would like to enable IRCE in the pipeline of new pass manager with these SCEV changes because it would make more loops vectorizable.

isImpliedCond() is notoriously expensive, so compile-time is sensitive to the amount of guards we look at. I think if we want to do any extensions here, we need to also aggressively limit the number of edges for which we take conditions into account.

Let me think about it.

reames added inline comments.
llvm/lib/Analysis/ScalarEvolution.cpp
9313

@mkazantsev and I have spent a lot of time working on vectorization of similar loops. If you want to talk through this, I'd be happy to jump on a call. You may also find my talk at last years llvm conference on multiple exit loops useful background.

A couple of general comments.

I really don't think that extending IRCE is the right path forward here. IRCE has some serious design defects, and I'm honestly quite nervous about it's correctness. I think that iteration set splitting (the basic transform IRCE uses) is absolutely something we should implement for the main pipeline, but I'd approach it as building new infrastructure to replace IRCE, not as getting IRCE on by default. In particular, I suspect the value comes primarily from a cost model driven approach to splitting, not IRCE's unconditional one.

Second, I advise being very cautious about going directly for the general case here. The general case for this is *really really hard*. If it wasn't, we'd already have robust solutions. If you can describe your motivating examples in a bit more depth (maybe offline), we can see if we can find a specific sub-case which is both tractable and profitable.

If we continue this discussion, I'd encourage a move to llvm-dev. This is going to be a longer discussion then is easy to follow in phab, and probably worth exposing to a broader audience anyways.

jaykang10 added inline comments.Apr 29 2021, 7:13 AM
llvm/lib/Analysis/ScalarEvolution.cpp
9313

@mkazantsev and I have spent a lot of time working on vectorization of similar loops. If you want to talk through this, I'd be happy to jump on a call. You may also find my talk at last years llvm conference on multiple exit loops useful background.

Ah, Thanks for sharing it. I am also happy to have a call. I will have a look at your talks and IRCE pass more first and then let's have a call together.

A couple of general comments.

I really don't think that extending IRCE is the right path forward here. IRCE has some serious design defects, and I'm honestly quite nervous about it's correctness. I think that iteration set splitting (the basic transform IRCE uses) is absolutely something we should implement for the main pipeline, but I'd approach it as building new infrastructure to replace IRCE, not as getting IRCE on by default. In particular, I suspect the value comes primarily from a cost model driven approach to splitting, not IRCE's unconditional one.

um... let me have a look at the IRCE pass more for correctness you mentioned. Thanks for letting me know.

Second, I advise being very cautious about going directly for the general case here. The general case for this is *really really hard*. If it wasn't, we'd already have robust solutions. If you can describe your motivating examples in a bit more depth (maybe offline), we can see if we can find a specific sub-case which is both tractable and profitable.

I agree it would be very hard to generalize all cases.

I have reduced the test case as below.

target triple = "aarch64"

define void @foo(i64 %a, i64* noalias %src, i64* noalias %dst, i64 %n) {
entry:
  br label %bb 

bb:
  %if.cond = icmp slt i64 1, %n
  br i1 %if.cond, label %if.then, label %exit

if.then:
  %if.cond.2 = icmp slt i64 10, %n
  br i1 %if.cond.2, label %loop.ph, label %if.then.else

if.then.else:
  br label %loop.ph

loop.ph:
  br label %loop

loop:
  %iv = phi i64 [ %inc, %for.inc ], [ 1, %loop.ph ]
  %cmp = icmp slt i64 %iv, %a
  br i1 %cmp, label %if.then.2, label %for.inc

if.then.2:
  %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv 
  %val = load i64, i64* %src.arrayidx
  %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv 
  store i64 %val, i64* %dst.arrayidx
  br label %for.inc

for.inc:
  %inc = add nuw nsw i64 %iv, 1
  %cond = icmp eq i64 %inc, %n
  br i1 %cond, label %exit, label %loop

exit:
  ret void
}

If we try to vectorize above code, we can see below debug output.

opt  -loop-vectorize -S ./test.ll  -debug-only=loop-vectorize

LV: Scalar loop costs: 5.
...
LV: Found an estimated cost of 3000000 for VF 2 For instruction:   %val = load i64, i64* %src.arrayidx, align 4
LV: Vector loop of width 2 costs: 1500004.
...
LV: Vectorization is possible but not beneficial

If we run IRCE pass ahead of loop vectorizer with SCEV changes https://reviews.llvm.org/D101409 and https://reviews.llvm.org/D100566, we can see the loop is vectorized with below debug output.

opt -irce -irce-skip-profitability-checks -simplifycfg -instcombine -loop-vectorize -S ./test.ll -debug-only=loop-vectorize

LV: Scalar loop costs: 6.
LV: Vector loop of width 2 costs: 2.

I think it is not complex case. If there are other ways to remove conditional branch inside loop or vectorize above example, I will be happy with it.

If we continue this discussion, I'd encourage a move to llvm-dev. This is going to be a longer discussion then is easy to follow in phab, and probably worth exposing to a broader audience anyways.

Yep, I will move this discussion to llvm-dev.

jaykang10 planned changes to this revision.Jun 2 2021, 9:25 AM
mdchen added a subscriber: mdchen.Sep 2 2021, 4:23 AM
lebedev.ri resigned from this revision.Jan 12 2023, 4:50 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:50 PM
Herald added a subscriber: StephenFan. · View Herald Transcript
jaykang10 abandoned this revision.Jan 13 2023, 12:11 AM