This is an archive of the discontinued LLVM Phabricator instance.

[PoC][WIP] Add an AArch64 specific pass for loop idiom recognition
Needs ReviewPublic

Authored by david-arm on Aug 18 2023, 9:31 AM.

Details

Reviewers
kmclaughlin
Summary

This pass looks for loops such as the following:

while (i != max_len)
    if (a[i] != b[i])
        break;

Although similar to a memcmp, this is slightly difference because instead of returning
the difference between the values of the first non-matching pair of bytes, it returns
the index of the first mismatch. As such, we are not able to lower this to a memcmp call.
Replacing this pattern with a specialised predicated SVE loop gives a significant
performance improvement for AArch64.

This patch introduces a new pass which identifies this pattern and replaces it with the
SVE loop. It is intended as a short-term solution until this is handled in the vectoriser.

A new intrinsic is created in this patch for counting the trailing zero elements in a
vector which has generic lowering in SelectionDAGBuilder. For AArch64 where SVE is
enabled, this is replaced with brkb & cntp instructions.

Patch co-authored by Kerry McLaughlin (@kmclaughlin) and David Sherwood (@david-arm)

Note: This is a work in progress, see discussion on Discourse:
https://discourse.llvm.org/t/aarch64-target-specific-loop-idiom-recognition/72383

Diff Detail

Event Timeline

kmclaughlin created this revision.Aug 18 2023, 9:31 AM
kmclaughlin requested review of this revision.Aug 18 2023, 9:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2023, 9:31 AM
Matt added a subscriber: Matt.Aug 18 2023, 9:35 AM
kmclaughlin edited the summary of this revision. (Show Details)Aug 18 2023, 9:36 AM
kmclaughlin added a subscriber: david-arm.
craig.topper added inline comments.
llvm/include/llvm/IR/Intrinsics.td
2180 ↗(On Diff #551549)

I wonder if something like "find first nonzero element" would be better?

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7467 ↗(On Diff #551549)

changeVectorElementType doesn't work if the source type is an MVT and the resulting type is not an MVT. Probably better to use getVectorVT. There have been two recent bug fixes for something like this https://reviews.llvm.org/D157392 and 512a6c50e87c1956c028daf3317b07b3aa0e309f

efriedma added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7481 ↗(On Diff #551549)

Is the upper bound here guaranteed to fit into an i64?

7487 ↗(On Diff #551549)

This multiply can overflow?

7488 ↗(On Diff #551549)

We never want to increase EltWidth beyond the width of the result, I think? (If width of the vector doesn't fit into the return value of cttz.elts, is the result poison, or something else?)

7504 ↗(On Diff #551549)

Is there some reason to use SMAX instead of UMAX? It seems to complicate reasoning about the sign bit.

llvm/lib/Passes/PassBuilderPipelines.cpp
622 ↗(On Diff #551549)

Did you really mean to remove LoopIdiomRecognize and replace it with a second run of IndVarSimplify?

I'm not sure why this patch requires messing with the default pass pipeline.

kmclaughlin marked an inline comment as done.Aug 31 2023, 7:49 AM
kmclaughlin added a subscriber: eli.friedman.
kmclaughlin added inline comments.
llvm/include/llvm/IR/Intrinsics.td
2180 ↗(On Diff #551549)

This was something we considered, but we wanted to add an intrinsic which mirrors the behaviour of the existing cttz intrinsic. In D159283 I've added a second operand to indicate whether the result is poison if the first argument is all zero, similar to cttz.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7467 ↗(On Diff #551549)

Thank you @craig.topper, this has been addressed in D159283.

7481 ↗(On Diff #551549)

Hi @eli.friedman, I believe it is safe to assume that the upper bound will fit into an i64. Calls to the vscale intrinsic are often 64 bits when generated by the vectoriser, and the getVScaleRangeMin/getVScaleRangeMax functions themselves are returning unsigned types.

7487 ↗(On Diff #551549)

I've tried to address this on D159283 by calculating the smallest possible type using the umul_sat operation of ConstantRange.

7504 ↗(On Diff #551549)

There was no reason for choosing SMAX, I have updated this to use UMAX instead.

llvm/lib/Passes/PassBuilderPipelines.cpp
622 ↗(On Diff #551549)

Removing the LoopIdiomRecognize is a mistake, I only intended to move the IndVarSimplify pass after invokeLateLoopOptimizationsEPCallbacks so that the new pass runs as close to LoopIdiomRecognize as possible.

craig.topper added inline comments.Sep 1 2023, 8:35 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
282 ↗(On Diff #551549)

I think you want m_Specific(Index) instead of m_Instruction. m_Instruction will match any instruction and overwrite Index

craig.topper added inline comments.Sep 1 2023, 9:10 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
159 ↗(On Diff #551549)

Do we need to check skipLoop for opt-bisect-limit?

206 ↗(On Diff #551549)

Doesn't being a "preheader" guarantee it's not conditional?

306 ↗(On Diff #551549)

This doesn't guarantee the loads are loading i8. The load have their own type and don't have to match the GEP result type.

321 ↗(On Diff #551549)

m_Instruction -> m_Specific

326 ↗(On Diff #551549)

Isn't IdxA, zext(Index)? So Index must dominate IdxA.

david-arm commandeered this revision.Sep 7 2023, 5:20 AM
david-arm added a reviewer: kmclaughlin.

As @kmclaughlin mentioned on D159283 she will be away for a few weeks. However, in the meantime I would like to address some of the comments on this patch related specifically to bug fixes and also update the patch to use the latest version of the intrinsic in D159283. Unfortunately, the only way I can do this is to commandeer the patch temporarily!

david-arm edited the summary of this revision. (Show Details)Sep 7 2023, 5:21 AM
david-arm updated this revision to Diff 556134.Sep 7 2023, 5:31 AM
david-arm marked an inline comment as done.
  • Fixed some bugs found by @craig.topper when recognising the byte mismatch idiom. This also required updating one of the tests in Transforms/LoopIdiom/AArch64/byte-compare-index.ll that was using the wrong index for comparison.
  • Reinstated the generic loop idiom recognise pass.
  • Add new patterns to ensure we use incp instead of cntp+add.
david-arm marked 3 inline comments as done.Sep 7 2023, 5:32 AM
craig.topper added inline comments.Sep 7 2023, 5:22 PM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
289 ↗(On Diff #556134)

Do we know for sure that WhileBB is the block in the loop? Could EndBB above be the backedge?

295 ↗(On Diff #556134)

Do we need to check that TrueBB is the header?

328 ↗(On Diff #556134)

The IdxA != IdxB check is identical to the previous if

333 ↗(On Diff #556134)

IdxA is is a zero extend of Index according to the previous if, so doesn't Index always dominate IdxA?

274 ↗(On Diff #551549)

why is this needed?

Hi @craig.topper, thanks again for the review comments! I'll take a look at your comments regarding blocks being in the loop and see if there is a problem or not. It's possible that the canonical form of a loop allows us to make certain assumptions, but I'll double check.

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
274 ↗(On Diff #551549)

There is no fundamental reason why the checks are needed, but it made the vector implementation of the mismatch algorithm simpler since we didn't have to worry about poison during signed or unsigned overflow. For the cases we were interested in (unsigned 32-bit addition in C) there were no nsw or nuw flags so we thought for now we'd restrict it to just these cases.

It probably makes sense to relax this restriction in future, but it will require carefully rewriting the vectorised implementation to be safe with regards poison/overflow, and ensuring there are no performance regressions for the loops we care about.

david-arm updated this revision to Diff 556252.Sep 8 2023, 6:39 AM
  • Added more checks that the icmp predicates are correct (EQ) and ensure that the true/false block ordering for the branches are what we expect.
  • Added more negative test cases for bad icmps, bad branches and bad load types.
david-arm marked 2 inline comments as done.Sep 8 2023, 6:42 AM

Again, I've only addressed bug fixes in this new update - I'll let @kmclaughlin deal with any other comments once she is back!

llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
295 ↗(On Diff #556134)

Both this and the above comment about WhileBB are excellent spots. I've fixed these now - thanks @craig.topper. :)

craig.topper added inline comments.Sep 8 2023, 10:28 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
274 ↗(On Diff #551549)

Isn’t it always safe to drop the flags if needed?

craig.topper added inline comments.Sep 18 2023, 11:35 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
355 ↗(On Diff #556252)

mention of "call" and "callsite" here. there was no call involved in the original code.

377 ↗(On Diff #556252)

Why do we only update DT for this block and not the others?

567 ↗(On Diff #556252)

Why do we need a phi if the incoming values are the same?

craig.topper added inline comments.Sep 19 2023, 10:57 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
515 ↗(On Diff #556252)

Do we need to check for inBounds on the original GEPs before we can set it here?

520 ↗(On Diff #556252)

Do we need to check for inBounds on the original GEPs before we can set it here?

craig.topper added inline comments.Sep 19 2023, 11:02 AM
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
159 ↗(On Diff #551549)

Maybe skipLoop is handled directly by the new pass manager? I'm too used to old pass manager.

david-arm updated this revision to Diff 557230.Sep 22 2023, 1:26 AM
david-arm marked an inline comment as done.
  • Only mark the new GEPs as 'inbounds' if the original GEPs were too.
  • Update the dominator tree for all newly inserted blocks.
  • Remove pointless PHI in scalar loop preheader block.
david-arm marked 4 inline comments as done.Sep 22 2023, 1:30 AM

Thanks @craig.topper for spotting the bugs with the dominator tree and setting GEPs inbound. I've fixed those, plus removed the redundant PHI from the scalar loop preheader. I still see the same performance improvements for the loops we care about. I realise I haven't addressed all of your comments - we will try to address them later!

Does the new pass need to check that SVE is enabled before doing the transform?

david-arm updated this revision to Diff 558082.Nov 13 2023, 8:42 AM
  • Rebased the patch to reduce the diff.
  • Added checks so that we only attempt the transformation if the target supports scalable vectors and we know the minimum page size.
  • Renamed the class to AArch64LoopIdiomTransform to better reflect what the pass is doing, i.e. transforming an idiom from one form to another.
  • Removed some of the pipeline changes that are no longer necessary.
  • Added a new RUN line to byte-compare-index.ll to show that in the absence of SVE we don't do the transform.

Does the new pass need to check that SVE is enabled before doing the transform?

Hi @craig.topper, good point! I've added a check that the target supports scalable vectors and that we know the minimum page size. Although the pass currently lives in lib/Target/AArch64 it is generic enough that it could be moved into a common directory and used by other targets.

It might make sense to move this patch into github soon for a full review, even though I prefer Phabricator. :)

david-arm updated this revision to Diff 558096.Nov 14 2023, 5:59 AM
  • Address more review comments
david-arm marked 7 inline comments as done.Nov 14 2023, 6:01 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64LoopIdiomRecognize.cpp
159 ↗(On Diff #551549)

For the legacy pass manager we do!

206 ↗(On Diff #551549)

You're right. I've simplified the logic here to assume a canonical form, particularly since we rejected loops without preheaders in AArch64LoopIdiomRecognize::run