This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts
AbandonedPublic

Authored by david-arm on Mar 17 2022, 3:38 AM.

Details

Summary

For low trip counts the vectoriser will attempt to create a single
predicated loop that folds the scalar tail into the vector body. For
some combinations of the trip count and the VF it is possible to
determine at compile time if there will only be a single vector
iteration. If so, we can avoid creating the comparison at the end of
the loop and just always branch to the loop exit. This improves the
code quality for smaller loops with low trip counts because the
compare + branch add a relatively high cost to the loop.

This optimisation may also apply for unpredicated vector loops with
low trip counts too, hence the change in test X86/pr42674.ll.

Diff Detail

Event Timeline

david-arm created this revision.Mar 17 2022, 3:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 3:38 AM
david-arm requested review of this revision.Mar 17 2022, 3:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2022, 3:38 AM

Thanks for this patch Dave, I've left a few comments.

Perhaps it's worth highlighting in the commit message that this is more of a problem for scalable vectors, where the branch/compare isn't constant folded or instcombined away that easily as it is for fixed-width vectors.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8809

Initially I wanted to suggest creating a new (unconditional) branch instruction here, but at this point you don't know which VF will be chosen for the given VPlan, so we have to defer this decision until codegen.

8812

There's nothing wrong with it, but perhaps it's a bit unfortunate this has to be passed into the VPInstruction as an operand. It seems more like a bit of useful knowledge of the loop that the VPInstruction could use to optimise the branch, rather than an actual operand for the conceptual BranchOnCount intrinsic. If the trip-count of the loop is constant, maybe we can store that information as a piece of state somewhere.

@fhahn what would be the desired place to add such information about the loop? My understanding was that the uses of the InnerLoopVectorizer in VPlan were being phased out, is that right?

llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

nit: How about

Value *ConstCmp = nullptr;
if (auto *C = dyn_cast<ConstantInt>(TC))
  if (C->getZExtValue() <= State.UF * State.VF.getKnownMinValue())
    ConstCmp = Builder.getInt1(true);
Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);
978

I noticed some of this code here has been removed and I'm not sure if this change is then still required. In any case this patch needs a rebase.

david-arm added inline comments.Apr 27 2022, 8:49 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8812

Yeah I'm not terribly happy about adding this as an operand, but I wasn't sure how else to pass this. I could possibly add a constant trip count to the VPTransformState, which is available when generating the instruction?

llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

It would have to be something like this I think:

Value *ConstCmp = nullptr;
if (auto *C = dyn_cast<ConstantInt>(TC)) {
  uint64_t TCVal = C->getZExtValue();
  if (TCVal && TCVal  <= State.UF * State.VF.getKnownMinValue())
    ConstCmp = Builder.getInt1(true);
}
Value *Cond = ConstCmp ? ConstCmp : Builder.CreateICmpEQ(IV, VTC);

because I'm pretty sure I found a test where the trip count was actually defined as a zero constant. I'm happy to change it to the code above, but not sure it looks much better to be honest. :)

978

Yeah I noticed that too. Will fix in a new patch!

sdesmalen added inline comments.Apr 27 2022, 8:59 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

If the trip-count is zero, wouldn't true be the correct value for the condition?

fhahn added inline comments.Apr 27 2022, 2:05 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

Hmm, I am not sure if doing this late in codegen is the right place. If we simplify the condition and effectively remove the loop, it would be good to remove the branch-on-count recipe directly in VPlan. I *think* we should have almost everything needed to do that already in place. Let me check.

david-arm added inline comments.May 10 2022, 5:30 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

Hi @fhahn, have you had a chance to look into this at all?

fhahn added inline comments.May 16 2022, 12:43 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

Unfortunately not yet, I've still got a backlog of other things to work through :(. Hopefully I'll be able to check this week, otherwise I think we should proceed with this patch next week.

david-arm added inline comments.May 20 2022, 7:10 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

So the problem with treating zero-trip counts the same as non-zero trip counts is that the loop vectoriser considers loops like this as having a zero trip count:

entry:
  br label %loop.body

loop.body:
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %loop.body ]
  %a = extractvalue { i64, i64 } %sv, 0
  %b = extractvalue { i64, i64 } %sv, 1
  %addr = getelementptr i64, i64* %dst, i32 %iv
  %add = add i64 %a, %b
  store i64 %add, i64* %addr
  %iv.next = add nsw i32 %iv, 1
  %cond = icmp ne i32 %iv.next, 0
  br i1 %cond, label %loop.body, label %exit

exit:
  ret void

even though this loop is actually going to execute UINT32_MAX times before finally the IV overflows back to 0! This came from a real test:

Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll

If I fold away the comparison and always branch out of the loop then I've changed the behaviour of the original scalar loop. So I think we either have to:

  1. Fix the vectoriser to report the trip count as UINT32_MAX for loops like this, or
  2. Only apply my optimisation for low trip counts > 0
david-arm updated this revision to Diff 431302.May 23 2022, 1:35 AM
  • I've removed the need for a separate TripCount VPValue in the VPlan class because we're always going to need the original scalar trip count, and the value is the same for each Part anyway. Now we just store a copy in the VPTransformState so that the execute() functions can access it.
david-arm marked 7 inline comments as done.May 23 2022, 1:36 AM
fhahn added inline comments.May 30 2022, 2:43 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
750–760

Took a bit longer, but D126680 is what simplification directly on VPlan would look like. It depends on a few upcoming improvements though.

david-arm abandoned this revision.Jun 7 2022, 5:48 AM

Already fixed by an alternative implementation - D126680