Page MenuHomePhabricator

[AArch64][SVE] Add SVE IR pass to coalesce ptrue instrinsic calls
Needs ReviewPublic

Authored by joechrisellis on Thu, Jan 7, 4:49 AM.

Details

Summary

It is possible to eliminate redundant calls to the SVE ptrue intrinsic.
For example: suppose that we have two SVE ptrue intrinsic calls P1 and
P2. If P1 is at least as wide as P2, then P2 can be written as a
reinterpret P1 using the SVE reinterpret intrinsics.

Coalescing ptrues can result in fewer ptrue instructions in the codegen,
and is conducive to better analysis further down the line.

This commit introduces a new pass, aarch64-sve-coalesce-ptrues, which
removes redundant ptrue intrinsic calls, replacing them with
reinterprets of existing 'wider' ptrue intrinsic calls where possible.

Diff Detail

Event Timeline

joechrisellis created this revision.Thu, Jan 7, 4:49 AM
joechrisellis requested review of this revision.Thu, Jan 7, 4:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptThu, Jan 7, 4:49 AM

Fix broken test.

bsmith requested changes to this revision.Thu, Jan 7, 5:22 AM

I'm not sure this patch is correct as it's not taking into account how the predicates are used, for example in following case your patch replaces the ptrue_b32() predicate of the %5 8 x i16 load with a ptrue_b16(), which changes the behaviour.

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}
This revision now requires changes to proceed.Thu, Jan 7, 5:22 AM
fhahn added a subscriber: fhahn.Thu, Jan 7, 5:24 AM

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

_Actually_ fix broken test.

bsmith added a comment.Thu, Jan 7, 7:01 AM

After much discussion I'm actually incorrect in this assertion, as I mistakenly thought that the ptrue's were ending up being passed straight into the load rather than through the existing svbool convertions. That said this case with (%4, %5 and %7 made not redundant) does now produce worse codegen with this pass:

Currently:

ptrue   p0.s
ptrue   p1.h
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p0/z, [x1]
ld1h    { z8.h }, p1/z, [x1]
...

With patch:

ptrue   p0.h
ptrue   p1.s
ptrue   p2.b
and     p1.b, p2/z, p0.b, p1.b
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p1/z, [x1]
ld1h    { z8.h }, p0/z, [x1]
...

I do wonder whether this should be an MIR pass rather than an IR one?

I'm not sure this patch is correct as it's not taking into account how the predicates are used, for example in following case your patch replaces the ptrue_b32() predicate of the %5 8 x i16 load with a ptrue_b16(), which changes the behaviour.

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

Fix poor codegen found in @bsmith's example.

Hi @bsmith,

The poor codegen in that example is happening because we're factoring out a ptrue which is immediately converted to a 'sparse' predicate via a sequence of SVE reinterpret intrinsics. E.g.:

%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
// <1, 1, 1, 1>
%2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
// <1, 0, 0, 0, 1, 0 ,0, 0, 1, 0, 0, 0, 1, 0, 0, 0>
%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)
// <1, 0, 1, 0, 1, 0, 1, 0> ('sparse' predicate)

In these specific circumstances it doesn't make sense to eliminate the ptrue because we're just going to create an even longer chain which cannot be reduced by the SVEIntrinsicOpts.cpp pass (also see D94074, which extends this pass to reduce long conversion chains). I've modified this pass to account for these situations, and the codegen that we get now is identical to before. I've added this case into the tests, too.

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
87 ↗(On Diff #315737)

Line 87: Nit: Extra blank.
Line 86: Currently the name suggests a property of the ptrue. The word "simple" could also conceivably have multiple meanings. The condition depends on how the ptrue is used, not "how the ptrue itself is". Suggest isPTrueNarrowed or similar instead.

104 ↗(On Diff #315737)

Clarify what is being zeroed.

113 ↗(On Diff #315737)

Another possible comment clarification.

joechrisellis marked 3 inline comments as done.

Address @peterwaller-arm's comments.

  • s/isSimplePTrue/isPTruePromoted -- intent of the function is now clearer.
  • Remove extra blank.
  • Clarify comments.
llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
144 ↗(On Diff #315748)

still using the word "simple".

llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
78 ↗(On Diff #315748)

still using the word "simple".

joechrisellis marked 2 inline comments as done.

Address @peterwaller-arm's comments.

  • Replace 'simple' with 'promoted'.

Remove uses of terminology such as 'wide'/'wider'/'widest', opting instead for 'encompassing'.

I'd like to see one-liner comments on each test explaining the spirit of what is being tested.

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
14 ↗(On Diff #315772)

Non-grammatical text "If P1 is at encompasses P2". Also, definition of "encompasses" is missing here, I think this text should stand alone. I'd talk about supersets of active lane bits or similar instead.

214 ↗(On Diff #315772)

If doing this, the comment should make clear: Why store upfront? Why not runOnFunction(f) for each function f?

llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
5 ↗(On Diff #315772)

Just checking: is this comment applicable here? I'm thinking that since this fix doesn't relate to fixing any TypeSize warnings, it doesn't apply.

joechrisellis marked 2 inline comments as done.

Address @peterwaller-arm's comments.

  • Rename pass to aarch64-sve-coalesce-ptrue-intrinsics to make it clear that this deals specifically with ptrue intrinsics.
  • Added more detail into header comment.
  • Clarifications in both the pass and tests.
  • Removed unnecessary dominator tree code.

Coalesce ptrue intrinsics that use the SV_POW2 pattern as well.

fhahn added a comment.Wed, Jan 13, 1:31 AM

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

But SVEIntrinsicOpts.cpp could just apply the same optimization as this pass, after the per-instruction optimizations? The main thing I am worried about is not this pass in particular, but that we will end up with a lot of separate passes just optimizing a single SVE intrinsics.

Each pass has a cost, e.g each pass adds another option to enable/disable, needs to be scheduled, needs to iterate over all declarations to find the right intrinsic id, needs the whole pass setup boilerplate. Add to that the confusion that there is a pass that sounds general (SVEIntrinsicOpt), but doesn't optimize SVE intrinsic in general. While it may be cleaner/easier to implement optimizations for intrinsics in isolation, I am not sure that justifies the additional cost.

I have not looked at the specific intrinsic optimizations in detail, but on a high-level view it seems like they may benefit from working together, e.g. SVECoalescePTrueIntrinsicsPass might remove some instructions, which then enables further optimizations by SVEIntrinsicOpt, which in turn enables further optimizations by SVECoalescePTrueIntrinsicsPass. Separate passes make this much harder or impossible to handle.

llvm/lib/Target/AArch64/SVECoalescePTrueIntrinsics.cpp
187 ↗(On Diff #316339)

do we need to capture everything here?

264 ↗(On Diff #316339)

Is it possible to use something like for (User *U : F.users()) or is the manual iterator management needed ?

265 ↗(On Diff #316339)

only use dyn_cast if you actually check if the result is nullptr. Otherwise, just use cast, which asserts that the result is != nullptr.

Also, I think the user could also be a constant expression, so the cast would fail?

joechrisellis marked 3 inline comments as done.

Address @fhahn's comments.

  • Get rid of new pass, reimplement within aarch64-sve-intrinsic-opts.

@fhahn: thanks for your comments -- you have persuaded me. 🙂

I've folded it all into SVEIntrinsicOpts.cpp, and implemented the rest of your suggestions.