This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Introduce and use BranchOnCount VPInstruction.
ClosedPublic

Authored by fhahn on Jan 1 2022, 12:41 PM.

Details

Summary

This patch adds a new BranchOnCount VPInstruction opcode with 2
operands. It first compares its 2 operands (increment of canonical
induction and vector trip count), followed by a branch to either the
exit block or back to the vector header.

It must be the last recipe in the exit block of the topmost vector loop
region.

This extracts parts from D113224 and was discussed in D113223.

Diff Detail

Event Timeline

fhahn created this revision.Jan 1 2022, 12:41 PM
fhahn requested review of this revision.Jan 1 2022, 12:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 1 2022, 12:41 PM
Herald added a subscriber: vkmr. · View Herald Transcript
fhahn updated this revision to Diff 396874.Jan 1 2022, 12:44 PM

Update addCanonicalIVRecipes comment.

Ayal added a comment.Jan 2 2022, 10:00 AM

Combine ExitCheckAndBranch with CanonicalIVIncrement{NUW} into a single BranchOnCount recipe or two BranchOnCount{NUW} VPInstructions?
It is conceptually a single inc-cmp-br idiom, akin to PowerPC's "bdnz" instruction - except instead of decrementing an implicit count register until zero we increment an IV until VectorTripCount.

fhahn updated this revision to Diff 396938.Jan 2 2022, 11:08 AM

Move VectorTripCount printing from D113223 to this patch.

Combine ExitCheckAndBranch with CanonicalIVIncrement{NUW} into a single BranchOnCount recipe or two BranchOnCount{NUW} VPInstructions?
It is conceptually a single inc-cmp-br idiom, akin to PowerPC's "bdnz" instruction - except instead of decrementing an implicit count register until zero we increment an IV until VectorTripCount.

I'm not sure about the benefits of having the incremenet included in the CheckAndBranch opcode. It would reduce the number of opcodes, but may have the following potential drawbacks:

  1. Once we can model live-ins, we may be able to get rid of the specialized increment opcodes and model it as add with the increment as operand,
  2. increments for other phi recipes also need to be modeled explicitly, so having an explicit increment for the canonical IV seems more consistent,
  3. we may want to remove the 'increment' instruction in a VPlan transform in the future, e.g. because we have a canonical vector induction which could be used instead of the scalar one.
Ayal added inline comments.Jan 2 2022, 1:06 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8962

setCondBit under IsVPlanNative?

9424

Setting insert point once before the loop is fine; in addition, selects are now introduced at the beginning of the latch rather than at the end? Make sure live-out instruction Red appears before the latch?
Perhaps insert at/before Last(NonTerminal?) of latch , akin to FirstNonPhi?

llvm/lib/Transforms/Vectorize/VPlan.cpp
734

early-break if Part > 0?

746

Once the exiting block is modelled in VPlan, it should help retrieve its IRBB directly (TODO).

945–946

So if now LastBB is expected to terminate with unreachable rather than branch in native as well, remove the EnableVPlanNativePath from the assert?

947

Should the unreachable terminator be removed from LastBB when the branch is introduced, rather than here? As in VPBranchOnMaskRecipe::execute()'s final ReplaceInstWithInst().

Could the compare stay in LastBB, soon to join its branch in VectorLatchBB when the two BB's merge?

It may be possible to associate LatchVPBB with VectorLatchBB upon construction and avoid createEmptyBasicBlock for it during VPBB::execute(), thereby having all latch recipes code-gen'd directly into VectorLatchBB, rather than having to merge separate LastBB and VectorLatchBB here.

Ayal added a comment.Jan 2 2022, 2:04 PM

Move VectorTripCount printing from D113223 to this patch.

Combine ExitCheckAndBranch with CanonicalIVIncrement{NUW} into a single BranchOnCount recipe or two BranchOnCount{NUW} VPInstructions?
It is conceptually a single inc-cmp-br idiom, akin to PowerPC's "bdnz" instruction - except instead of decrementing an implicit count register until zero we increment an IV until VectorTripCount.

I'm not sure about the benefits of having the incremenet included in the CheckAndBranch opcode. It would reduce the number of opcodes, but may have the following potential drawbacks:

  1. Once we can model live-ins, we may be able to get rid of the specialized increment opcodes and model it as add with the increment as operand,
  2. increments for other phi recipes also need to be modeled explicitly, so having an explicit increment for the canonical IV seems more consistent,
  3. we may want to remove the 'increment' instruction in a VPlan transform in the future, e.g. because we have a canonical vector induction which could be used instead of the scalar one.

Main motivation behind a BranchOnCount recipe is to capture higher semantics and a meaningful idiom in VPlan - clearly modelling the scalar IV which controls a countable loop, rather than minimizing number of opcodes or lowering to ordinary operations. How other IV's should be modelled and optimized is indeed worth addressing, aiming for clarity as well while avoiding premature optimization best left to other passes, e.g., LSR. It should also be possible to optimize BranchOnCount away into other recipes as a VPlan to VPlan transformation, if and when desired.

fhahn updated this revision to Diff 397094.Jan 3 2022, 10:01 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

Move VectorTripCount printing from D113223 to this patch.

Main motivation behind a BranchOnCount recipe is to capture higher semantics and a meaningful idiom in VPlan - clearly modelling the scalar IV which controls a countable loop, rather than minimizing number of opcodes or lowering to ordinary operations. How other IV's should be modelled and optimized is indeed worth addressing, aiming for clarity as well while avoiding premature optimization best left to other passes, e.g., LSR. It should also be possible to optimize BranchOnCount away into other recipes as a VPlan to VPlan transformation, if and when desired.

Sounds good. Should I rename CheckExitAndBranch -> BranchOnCount? The scalar IV should already be modeled clearly via the dedicated recipe, irrespective of where the bump happens. I left it as is for now, but folding the count in should be a trivial change.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8962

Updated, thanks!

llvm/lib/Transforms/Vectorize/VPlan.cpp
734

Adjusted, thanks!

746

I added the todo, thanks!

945–946

Removed VPlanNativePath from assert, thanks!

947

Should the unreachable terminator be removed from LastBB when the branch is introduced, rather than here? As in VPBranchOnMaskRecipe::execute()'s final ReplaceInstWithInst().

Updated, thanks!

Could the compare stay in LastBB, soon to join its branch in VectorLatchBB when the two BB's merge?

It could, but it would require updating ~50 tests. The reason is that widenIntOrFpInduction will still generate the vector induction increments in VectorLatchBB, which is outside VPlan.

It may be possible to associate LatchVPBB with VectorLatchBB upon construction and avoid createEmptyBasicBlock for it during VPBB::execute(), thereby having all latch recipes code-gen'd directly into VectorLatchBB, rather than having to merge separate LastBB and VectorLatchBB here.

It seems like this would require threading through VectorLatchBB to VPBasicBlock::execute. Not sure if that extra work is justified for the temporary workaround that won't be needed once the latch block is modeled in VPlan directly.

fhahn added inline comments.Jan 4 2022, 5:12 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
938

I looked into removing the separate latch block, but unfortunately widenIntOrFpInduction still relies on the latch block being created up-front:

// Move the last step to the end of the latch block. This ensures consistent
// placement of all induction updates.
auto *LoopVectorLatch = LI->getLoopFor(LoopVectorBody)->getLoopLatch();
auto *Br = cast<BranchInst>(LoopVectorLatch->getTerminator());
LastInduction->moveBefore(Br);
LastInduction->setName("vec.ind.next");

Reusing the created vector latch for the exit block of the region is also not straight-forward, because we may need to look through multiple single successors until we reach a the exit VPBB. This is complicated by blocks inside replicate regions, where we would need to check the successor of the region, but *only* for the last iteration in the loop for the region.

So I think it is simpler to keep things as is, until the VPWidenIntOrFpInduction increments are modeled separately.

Ayal added inline comments.Jan 5 2022, 1:41 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8962

Fold it under the above "if (IsVPlanNative)"?

9424

Make sure live-out instruction Red appears before the latch?

llvm/lib/Transforms/Vectorize/VPlan.cpp
743

IsVPlanNative? Perhaps it should use a subclass of VPRegionBlock which produces getHeader()/getExit() by bumping from Entry/Exit, in one place.

936

drop "Remove the Unreachable terminator"

938

Fine. If this too-early sinking of LastInduction is the only reason for creating the latch block up-front, then let's remember (TODO?) to remove the latter once the former is modelled with a recipe properly placed in the Exit VPBB, next to ExitCheckAndBranch.

944

moveBefore(LastBranch) ?

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
180

Exit->empty()? Check above along with !Exit, to avoid taking prev of Exit->end()?

fhahn updated this revision to Diff 397693.Jan 5 2022, 1:18 PM
fhahn marked 7 inline comments as done.

Address latest comments, thanks! Also renamed ExitCheckAndBranch to BranchOnCount.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9424

Added an assert. Was that what you had in mind?

llvm/lib/Transforms/Vectorize/VPlan.cpp
743

Added an assert. Maybe it would be better to get rid of the difference in the native path & model the pre-header block outside the vector loop region? (as follow-up down the line)

936

Thanks, removed.

938

Added a todo.

944

That's better, thanks!

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
180

I added a lambda for the VPInstruction check and updated to use empty(). With the check above the error message may be a bit misleading or would need duplicating.

fhahn retitled this revision from [VPlan] Introduce and use ExitCheckAndBranch VPInstruction. to [VPlan] Introduce and use BranchOnCount VPInstruction..Jan 5 2022, 1:19 PM
fhahn edited the summary of this revision. (Show Details)
fhahn updated this revision to Diff 398428.Jan 9 2022, 5:19 AM

Rebase and ping :)

Ayal accepted this revision.Jan 11 2022, 4:19 AM

This looks good to me, adding minor optional suggestions.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9424

Yeah, or set the insert point to the end of LatchVPBB (prior to its BranchOnCount terminator) rather than the beginning of LatchVPBB, retaining original behavior; conceptually, the reduction recipe should be allowed to reside (earlier) in the latch.
Above comment "at the end of the latch" deserves updating.

llvm/lib/Transforms/Vectorize/VPlan.cpp
739

Add a comment that after creating the compare we now create the branch?

743

Added an assert. Maybe it would be better to get rid of the difference in the native path & model the pre-header block outside the vector loop region? (as follow-up down the line)

+1

746

Worth setting BasicBlock *ExitBlock = ... next to setting Header above, and place comment there?

754

This is fine, can alternatively use ReplaceInstWithInst() as does VPBranchOnMask::execute.

936

typo: branchand

998–1003

Now we can assert number of users is positive?

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
180

Spit out "exit block must not be empty" if Exit.empty() and check the inlined lambda afterwards?

This revision is now accepted and ready to land.Jan 11 2022, 4:19 AM
fhahn updated this revision to Diff 399290.Jan 12 2022, 4:54 AM
fhahn marked 7 inline comments as done.

Address latest comments, thanks! Also rebased. I am planning to commit this once the pre-commit tests come back OK.

fhahn added inline comments.Jan 12 2022, 4:55 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9424

Updated the comment, thanks!

llvm/lib/Transforms/Vectorize/VPlan.cpp
754

I left it as is for now.

936

fixed, thanks

998–1003

There's still an edge case where this assert may not hold unfortunately: VPlan-stress testing, which exits before the recipes are introduced. Potentially buuildHCFG could already add the branch & mask recipe, but for now I left it as is.

This revision was landed with ongoing or failed builds.Jan 12 2022, 5:44 AM
This revision was automatically updated to reflect the committed changes.

This commit somehow causes the "unroll.disable" metadata to be dropped from the vector epilogue loops. For example:

> cat tmp.ll
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

; Function Attrs: nounwind
define dso_local void @f1(float* noalias %aa, float* noalias %bb, float* noalias %cc, i32 signext %N) #0 {
entry:
  %cmp1 = icmp sgt i32 %N, 0
  br i1 %cmp1, label %for.body.preheader, label %for.end

for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext i32 %N to i64
  br label %for.body

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds float, float* %bb, i64 %indvars.iv
  %0 = load float, float* %arrayidx, align 4
  %arrayidx2 = getelementptr inbounds float, float* %cc, i64 %indvars.iv
  %1 = load float, float* %arrayidx2, align 4
  %add = fadd fast float %0, %1
  %arrayidx4 = getelementptr inbounds float, float* %aa, i64 %indvars.iv
  store float %add, float* %arrayidx4, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %for.body, label %for.end.loopexit

for.end.loopexit:                                 ; preds = %for.body
  br label %for.end

for.end:                                          ; preds = %for.end.loopexit, %entry
  ret void
}
opt -passes=loop-vectorize ./tmp.ll -S -o out.ll -epilogue-vectorization-force-VF=2
> grep "!" out.ll
  br i1 %132, label %middle.block, label %vector.body, !llvm.loop !0
  br i1 %144, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !2
  br i1 %exitcond, label %for.body, label %for.end.loopexit.loopexit, !llvm.loop !3
!0 = distinct !{!0, !1}
!1 = !{!"llvm.loop.isvectorized", i32 1}
!2 = distinct !{!2, !1}
!3 = distinct !{!3, !4, !1}
!4 = !{!"llvm.loop.unroll.runtime.disable"}

Previously we had:

...
  br i1 %144, label %vec.epilog.middle.block, label %vec.epilog.vector.body, !llvm.loop !2
...
!2 = distinct !{!2, !3, !1}
!3 = !{!"llvm.loop.unroll.runtime.disable"}
fhahn added a comment.Jan 16 2022, 5:19 AM

This commit somehow causes the "unroll.disable" metadata to be dropped from the vector epilogue loops. For example:

Thanks for the heads-up. Looks like unfortunately the relevant check lines to guard against the regression have been removed earlier. It should be fixed by 070d1034da87. I also tried to restore the original check lines removed by 278aa65cc495

This commit somehow causes the "unroll.disable" metadata to be dropped from the vector epilogue loops. For example:

Thanks for the heads-up. Looks like unfortunately the relevant check lines to guard against the regression have been removed earlier. It should be fixed by 070d1034da87. I also tried to restore the original check lines removed by 278aa65cc495

Thanks for the fix and the reviving of the MD checks!