This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Coalesce ptrue instrinsic calls where possible
ClosedPublic

Authored by joechrisellis on Jan 7 2021, 4:49 AM.

Details

Summary

It is possible to eliminate redundant calls to the SVE ptrue intrinsic.
For example: suppose that we have two SVE ptrue intrinsic calls P1 and
P2. If P1 is at least as wide as P2, then P2 can be written as a
reinterpret P1 using the SVE reinterpret intrinsics.

Coalescing ptrue intrinsics can result in fewer ptrue instructions in
the codegen, and is conducive to better analysis further down the line.

This commit extends the aarch64-sve-intrinsic-opts pass to support
coalescing ptrue intrisic calls.

Diff Detail

Event Timeline

joechrisellis created this revision.Jan 7 2021, 4:49 AM
joechrisellis requested review of this revision.Jan 7 2021, 4:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2021, 4:49 AM

Fix broken test.

bsmith requested changes to this revision.Jan 7 2021, 5:22 AM

I'm not sure this patch is correct as it's not taking into account how the predicates are used, for example in following case your patch replaces the ptrue_b32() predicate of the %5 8 x i16 load with a ptrue_b16(), which changes the behaviour.

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}
This revision now requires changes to proceed.Jan 7 2021, 5:22 AM
fhahn added a subscriber: fhahn.Jan 7 2021, 5:24 AM

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

_Actually_ fix broken test.

bsmith added a comment.Jan 7 2021, 7:01 AM

After much discussion I'm actually incorrect in this assertion, as I mistakenly thought that the ptrue's were ending up being passed straight into the load rather than through the existing svbool convertions. That said this case with (%4, %5 and %7 made not redundant) does now produce worse codegen with this pass:

Currently:

ptrue   p0.s
ptrue   p1.h
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p0/z, [x1]
ld1h    { z8.h }, p1/z, [x1]
...

With patch:

ptrue   p0.h
ptrue   p1.s
ptrue   p2.b
and     p1.b, p2/z, p0.b, p1.b
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p1/z, [x1]
ld1h    { z8.h }, p0/z, [x1]
...

I do wonder whether this should be an MIR pass rather than an IR one?

I'm not sure this patch is correct as it's not taking into account how the predicates are used, for example in following case your patch replaces the ptrue_b32() predicate of the %5 8 x i16 load with a ptrue_b16(), which changes the behaviour.

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

Fix poor codegen found in @bsmith's example.

Hi @bsmith,

The poor codegen in that example is happening because we're factoring out a ptrue which is immediately converted to a 'sparse' predicate via a sequence of SVE reinterpret intrinsics. E.g.:

%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
// <1, 1, 1, 1>
%2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
// <1, 0, 0, 0, 1, 0 ,0, 0, 1, 0, 0, 0, 1, 0, 0, 0>
%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)
// <1, 0, 1, 0, 1, 0, 1, 0> ('sparse' predicate)

In these specific circumstances it doesn't make sense to eliminate the ptrue because we're just going to create an even longer chain which cannot be reduced by the SVEIntrinsicOpts.cpp pass (also see D94074, which extends this pass to reduce long conversion chains). I've modified this pass to account for these situations, and the codegen that we get now is identical to before. I've added this case into the tests, too.

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
87 ↗(On Diff #315737)

Line 87: Nit: Extra blank.
Line 86: Currently the name suggests a property of the ptrue. The word "simple" could also conceivably have multiple meanings. The condition depends on how the ptrue is used, not "how the ptrue itself is". Suggest isPTrueNarrowed or similar instead.

104 ↗(On Diff #315737)

Clarify what is being zeroed.

113 ↗(On Diff #315737)

Another possible comment clarification.

joechrisellis marked 3 inline comments as done.

Address @peterwaller-arm's comments.

  • s/isSimplePTrue/isPTruePromoted -- intent of the function is now clearer.
  • Remove extra blank.
  • Clarify comments.
llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
144 ↗(On Diff #315748)

still using the word "simple".

llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
78 ↗(On Diff #315748)

still using the word "simple".

joechrisellis marked 2 inline comments as done.

Address @peterwaller-arm's comments.

  • Replace 'simple' with 'promoted'.

Remove uses of terminology such as 'wide'/'wider'/'widest', opting instead for 'encompassing'.

I'd like to see one-liner comments on each test explaining the spirit of what is being tested.

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
14 ↗(On Diff #315772)

Non-grammatical text "If P1 is at encompasses P2". Also, definition of "encompasses" is missing here, I think this text should stand alone. I'd talk about supersets of active lane bits or similar instead.

214 ↗(On Diff #315772)

If doing this, the comment should make clear: Why store upfront? Why not runOnFunction(f) for each function f?

llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
5 ↗(On Diff #315772)

Just checking: is this comment applicable here? I'm thinking that since this fix doesn't relate to fixing any TypeSize warnings, it doesn't apply.

joechrisellis marked 2 inline comments as done.

Address @peterwaller-arm's comments.

  • Rename pass to aarch64-sve-coalesce-ptrue-intrinsics to make it clear that this deals specifically with ptrue intrinsics.
  • Added more detail into header comment.
  • Clarifications in both the pass and tests.
  • Removed unnecessary dominator tree code.

Coalesce ptrue intrinsics that use the SV_POW2 pattern as well.

fhahn added a comment.Jan 13 2021, 1:31 AM

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

But SVEIntrinsicOpts.cpp could just apply the same optimization as this pass, after the per-instruction optimizations? The main thing I am worried about is not this pass in particular, but that we will end up with a lot of separate passes just optimizing a single SVE intrinsics.

Each pass has a cost, e.g each pass adds another option to enable/disable, needs to be scheduled, needs to iterate over all declarations to find the right intrinsic id, needs the whole pass setup boilerplate. Add to that the confusion that there is a pass that sounds general (SVEIntrinsicOpt), but doesn't optimize SVE intrinsic in general. While it may be cleaner/easier to implement optimizations for intrinsics in isolation, I am not sure that justifies the additional cost.

I have not looked at the specific intrinsic optimizations in detail, but on a high-level view it seems like they may benefit from working together, e.g. SVECoalescePTrueIntrinsicsPass might remove some instructions, which then enables further optimizations by SVEIntrinsicOpt, which in turn enables further optimizations by SVECoalescePTrueIntrinsicsPass. Separate passes make this much harder or impossible to handle.

llvm/lib/Target/AArch64/SVECoalescePTrueIntrinsics.cpp
187 ↗(On Diff #316339)

do we need to capture everything here?

264 ↗(On Diff #316339)

Is it possible to use something like for (User *U : F.users()) or is the manual iterator management needed ?

265 ↗(On Diff #316339)

only use dyn_cast if you actually check if the result is nullptr. Otherwise, just use cast, which asserts that the result is != nullptr.

Also, I think the user could also be a constant expression, so the cast would fail?

joechrisellis marked 3 inline comments as done.

Address @fhahn's comments.

  • Get rid of new pass, reimplement within aarch64-sve-intrinsic-opts.

@fhahn: thanks for your comments -- you have persuaded me. 🙂

I've folded it all into SVEIntrinsicOpts.cpp, and implemented the rest of your suggestions.

bsmith resigned from this revision.Jan 19 2021, 6:10 AM

Resigning as reviewer to remove needs changes flag

bsmith added a subscriber: bsmith.Jan 19 2021, 6:10 AM
fhahn added a comment.Jan 19 2021, 6:27 AM

Thanks for the update, the new structure looks good to me. I left a few more stylistic comments, but it would be good if someone more familiar with the details of the transform would take a look as well

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
12

nit: The wording seems a bit odd. Are there 'non-main' goals for the pass? Maybe just say something like This pass performs the following optimizations: ....

72

can you add comments explaining to scope opimtizeIntrinsicCalls and optimizeFunctions operate?

130

nit: this could be something like
if (match(User, m_Intrinsic<Intrinsic::aarch64_sve_convert_to_svbool>(,

might save a line or two.

267

you just need to iterate over all instructions in a function, right? There's instruction(F) which does so directly.

277

nit: it would be simpler to read if the ! would be moved inside?

joechrisellis marked 4 inline comments as done.

Address @fhahn's comments.

  • Style changes.
  • More comments.

Thanks for the comments @fhahn! 😄

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
267

These are looking at basic-block local ptrue intrinsic calls at the minute -- we're passing the basic block to coalescePTrueIntrinsicCalls below too. 🙂

CarolineConcatto added inline comments.
llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
472

nit

Could you update the commit message as I think it's now out of date? For example, "This commit introduces a new pass ..."

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
151

Again, this check is only valid if both types are fixed width or both types are scalable.

174

Do we ever expect fixed width types here? If not, I wonder if you should cast to ScalableVectorType instead. The reason I mention this is that your comparison below only works when all are fixed width or all are scalable.

202

What if there are multiple ptrues with the same type as the most encompassing? We'll be needlessly creating convert_from_svbool intrinsics here I think? I think in this case we can just call replaceAllUsesWith(MostEncompassingPTrueVTy)

joechrisellis marked 3 inline comments as done.

Address @david-arm's comments.

  • Use ScalableVectorType instead of VectorType; we don't expect anything other than scalable vectors in this context.

Hi @david-arm, thanks for the review!

I have updated my local commit message, but unfortunately arc doesn't seem to propagate this change to phabricator. Will do it manually. Good spot though!

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
202

Good question -- this case is actually captured in the test below, see coalesce_test_same_size. 😄

                                                                                                                                                                                                                                                                                                                                                                              ; Two calls to the SVE ptrue intrinsic which are both of the same size. In this case, one should be identified
; as redundant and rewritten and an SVE reinterpret of the other via the convert.{to,from}.svbool intrinsics.
; This introduces a redundant conversion which will then be eliminated.
joechrisellis edited the summary of this revision. (Show Details)Feb 3 2021, 3:19 AM
Matt added a subscriber: Matt.Feb 3 2021, 5:54 AM
david-arm accepted this revision.Feb 3 2021, 8:07 AM

LGTM!

This revision is now accepted and ready to land.Feb 3 2021, 8:07 AM
asl requested changes to this revision.Feb 4 2021, 12:32 PM
asl added a subscriber: asl.

It seems you pushed to master, not main.

This revision now requires changes to proceed.Feb 4 2021, 12:32 PM

@asl -- yes -- doh! Muscle memory. Sorry about that.

I can't delete the new master branch on GitHub, though -- presume I lack permissions or something.

Appreciate the heads up. :)

joechrisellis retitled this revision from [AArch64][SVE] Add SVE IR pass to coalesce ptrue instrinsic calls to [AArch64][SVE] Coalesce ptrue instrinsic calls where possible.Feb 5 2021, 2:42 AM
This revision was not accepted when it landed; it landed in state Needs Revision.Feb 5 2021, 2:45 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.