This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve codegen for vectorised loops with two active lane masks
Changes PlannedPublic

Authored by david-arm on Mar 3 2023, 5:34 AM.

Details

Summary

When vectorising loops using tail-folding and interleaving we end up
with two back-to-back active.lane.mask intrinsic calls. Unfortunately,
this leads to poor codegen like this:

.LBB0_1:
  ...
  whilelo p1.b, x11, x1
  cset, mi
  whilelo p0.b, x12, x1
  tbnz ..., #0, .LBB0_1

This is because in AArch64InstrInfo::optimizeCondBranch we bail out if
we find a flag-setting operation between a CSINC and a TBNZW machine
node. However, in these cases nothing depends upon the flags set by
the second whilelo and it's safe to move it above the first whilelo.

I've changed AArch64InstrInfo::optimizeCondBranch to support having
a single flag-setting operation between CSINC and TBNZW, provided
we can prove it's safe to move it above the first flag-setting op.

Diff Detail

Event Timeline

david-arm created this revision.Mar 3 2023, 5:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2023, 5:34 AM
david-arm requested review of this revision.Mar 3 2023, 5:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2023, 5:34 AM
david-arm edited the summary of this revision. (Show Details)Mar 3 2023, 5:35 AM
david-arm edited the summary of this revision. (Show Details)
david-arm planned changes to this revision.Jun 14 2023, 7:26 AM

I'm no longer sure if this is the best approach, so I'm putting this patch to one side for now.