This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
9/11
SVEIntrinsicOpts.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-coalesce-ptrue-intrinsics.ll

Differential D94230

[AArch64][SVE] Coalesce ptrue instrinsic calls where possible
ClosedPublic

Authored by joechrisellis on Jan 7 2021, 4:49 AM.

Download Raw Diff

Details

Reviewers

peterwaller-arm
efriedma
david-arm
kmclaughlin
asl
bsmith

Commits

rG3d257fde75f8: [AArch64][SVE] Coalesce ptrue instrinsic calls where possible

Summary

It is possible to eliminate redundant calls to the SVE ptrue intrinsic.
For example: suppose that we have two SVE ptrue intrinsic calls P1 and
P2. If P1 is at least as wide as P2, then P2 can be written as a
reinterpret P1 using the SVE reinterpret intrinsics.

Coalescing ptrue intrinsics can result in fewer ptrue instructions in
the codegen, and is conducive to better analysis further down the line.

This commit extends the aarch64-sve-intrinsic-opts pass to support
coalescing ptrue intrisic calls.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

joechrisellis created this revision.Jan 7 2021, 4:49 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJan 7 2021, 4:49 AM

Herald added subscribers: NickHung, psnobl, hiraditya and 3 others. · View Herald Transcript

joechrisellis requested review of this revision.Jan 7 2021, 4:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2021, 4:49 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Fix broken test.

Herald added a subscriber: nikic. · View Herald TranscriptJan 7 2021, 5:17 AM

I'm not sure this patch is correct as it's not taking into account how the predicates are used, for example in following case your patch replaces the ptrue_b32() predicate of the %5 8 x i16 load with a ptrue_b16(), which changes the behaviour.

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}

This revision now requires changes to proceed.Jan 7 2021, 5:22 AM

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Harbormaster completed remote builds in B84318: Diff 315104.Jan 7 2021, 5:25 AM

_Actually_ fix broken test.

Harbormaster completed remote builds in B84320: Diff 315110.Jan 7 2021, 5:52 AM

Harbormaster completed remote builds in B84323: Diff 315114.Jan 7 2021, 6:23 AM

After much discussion I'm actually incorrect in this assertion, as I mistakenly thought that the ptrue's were ending up being passed straight into the load rather than through the existing svbool convertions. That said this case with (%4, %5 and %7 made not redundant) does now produce worse codegen with this pass:

Currently:

ptrue   p0.s
ptrue   p1.h
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p0/z, [x1]
ld1h    { z8.h }, p1/z, [x1]
...

With patch:

ptrue   p0.h
ptrue   p1.s
ptrue   p2.b
and     p1.b, p2/z, p0.b, p1.b
ld1w    { z0.s }, p0/z, [x0]
ld1h    { z1.h }, p1/z, [x1]
ld1h    { z8.h }, p0/z, [x1]
...

I do wonder whether this should be an MIR pass rather than an IR one?

In D94230#2484195, @bsmith wrote:

declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)

declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

define <vscale x 8 x i16> @coalesce_test_basic(i32* %addr1, i16* %addr2) {
  %1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
  %2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
  %3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

  %4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
  %5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

  %6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
  %7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)

  ret <vscale x 8 x i16> %7
}

In D94230#2484200, @fhahn wrote:

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

Fix poor codegen found in @bsmith's example.

Hi @bsmith,

The poor codegen in that example is happening because we're factoring out a ptrue which is immediately converted to a 'sparse' predicate via a sequence of SVE reinterpret intrinsics. E.g.:

%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
// <1, 1, 1, 1>
%2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
// <1, 0, 0, 0, 1, 0 ,0, 0, 1, 0, 0, 0, 1, 0, 0, 0>
%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)
// <1, 0, 1, 0, 1, 0, 1, 0> ('sparse' predicate)

In these specific circumstances it doesn't make sense to eliminate the ptrue because we're just going to create an even longer chain which cannot be reduced by the SVEIntrinsicOpts.cpp pass (also see D94074, which extends this pass to reduce long conversion chains). I've modified this pass to account for these situations, and the codegen that we get now is identical to before. I've added this case into the tests, too.

peterwaller-arm added inline comments.Jan 11 2021, 2:27 AM

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
87 ↗	(On Diff #315737)	Line 87: Nit: Extra blank. Line 86: Currently the name suggests a property of the ptrue. The word "simple" could also conceivably have multiple meanings. The condition depends on how the ptrue is used, not "how the ptrue itself is". Suggest `isPTrueNarrowed` or similar instead.
104 ↗	(On Diff #315737)	Clarify what is being zeroed.
113 ↗	(On Diff #315737)	Another possible comment clarification.

Harbormaster completed remote builds in B84650: Diff 315737.Jan 11 2021, 3:10 AM

Address @peterwaller-arm's comments.

s/isSimplePTrue/isPTruePromoted -- intent of the function is now clearer.
Remove extra blank.
Clarify comments.

peterwaller-arm added inline comments.Jan 11 2021, 4:00 AM

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
144 ↗	(On Diff #315748)	still using the word "simple".
llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
78 ↗	(On Diff #315748)	still using the word "simple".

Address @peterwaller-arm's comments.

Replace 'simple' with 'promoted'.

Harbormaster completed remote builds in B84656: Diff 315748.Jan 11 2021, 4:40 AM

Harbormaster completed remote builds in B84659: Diff 315755.Jan 11 2021, 4:59 AM

Remove uses of terminology such as 'wide'/'wider'/'widest', opting instead for 'encompassing'.

Harbormaster completed remote builds in B84672: Diff 315772.Jan 11 2021, 6:27 AM

I'd like to see one-liner comments on each test explaining the spirit of what is being tested.

llvm/lib/Target/AArch64/SVECoalescePTrues.cpp
14 ↗	(On Diff #315772)	Non-grammatical text "If P1 is at encompasses P2". Also, definition of "encompasses" is missing here, I think this text should stand alone. I'd talk about supersets of active lane bits or similar instead.
214 ↗	(On Diff #315772)	If doing this, the comment should make clear: Why store upfront? Why not runOnFunction(f) for each function f?
llvm/test/CodeGen/AArch64/sve-coalesce-ptrues.ll
5 ↗	(On Diff #315772)	Just checking: is this comment applicable here? I'm thinking that since this fix doesn't relate to fixing any TypeSize warnings, it doesn't apply.

Address @peterwaller-arm's comments.

Rename pass to aarch64-sve-coalesce-ptrue-intrinsics to make it clear that this deals specifically with ptrue intrinsics.
Added more detail into header comment.
Clarifications in both the pass and tests.
Removed unnecessary dominator tree code.

Harbormaster completed remote builds in B84843: Diff 316113.Jan 12 2021, 9:48 AM

Coalesce ptrue intrinsics that use the SV_POW2 pattern as well.

In D94230#2484413, @joechrisellis wrote:

In D94230#2484200, @fhahn wrote:

It seems there is already a pass to optimize SVE intrinsics. Is there a reason to add a another separate pass, rather than extending the existing one?

Hi @fhahn -- the reason I created a new pass here is because SVEIntrinsicOpts.cpp works on an instruction basis -- i.e. look at each SVE intrinsic call in isolation and see what can be done locally from there. On the other hand, the optimisation implemented by this pass instead requires you to look at the set of ptrue intrinsic calls in each basic block, which is not so easy to do within the existing structure of the sve-intrinsic-opts pass. I think it's probably cleaner to add a new pass to handle this, but I am open to suggestions. 😄

But SVEIntrinsicOpts.cpp could just apply the same optimization as this pass, after the per-instruction optimizations? The main thing I am worried about is not this pass in particular, but that we will end up with a lot of separate passes just optimizing a single SVE intrinsics.

Each pass has a cost, e.g each pass adds another option to enable/disable, needs to be scheduled, needs to iterate over all declarations to find the right intrinsic id, needs the whole pass setup boilerplate. Add to that the confusion that there is a pass that sounds general (SVEIntrinsicOpt), but doesn't optimize SVE intrinsic in general. While it may be cleaner/easier to implement optimizations for intrinsics in isolation, I am not sure that justifies the additional cost.

I have not looked at the specific intrinsic optimizations in detail, but on a high-level view it seems like they may benefit from working together, e.g. SVECoalescePTrueIntrinsicsPass might remove some instructions, which then enables further optimizations by SVEIntrinsicOpt, which in turn enables further optimizations by SVECoalescePTrueIntrinsicsPass. Separate passes make this much harder or impossible to handle.

llvm/lib/Target/AArch64/SVECoalescePTrueIntrinsics.cpp
187 ↗	(On Diff #316339)	do we need to capture everything here?
264 ↗	(On Diff #316339)	Is it possible to use something like `for (User *U : F.users())` or is the manual iterator management needed ?
265 ↗	(On Diff #316339)	only use `dyn_cast` if you actually check if the result is `nullptr`. Otherwise, just use `cast`, which asserts that the result is `!= nullptr`. Also, I think the user could also be a constant expression, so the cast would fail?

Harbormaster completed remote builds in B84984: Diff 316339.Jan 13 2021, 1:41 AM

Address @fhahn's comments.

Get rid of new pass, reimplement within aarch64-sve-intrinsic-opts.

@fhahn: thanks for your comments -- you have persuaded me. 🙂

I've folded it all into SVEIntrinsicOpts.cpp, and implemented the rest of your suggestions.

Harbormaster completed remote builds in B85168: Diff 316654.Jan 14 2021, 8:38 AM

Resigning as reviewer to remove needs changes flag

bsmith added a subscriber: bsmith.Jan 19 2021, 6:10 AM

Thanks for the update, the new structure looks good to me. I left a few more stylistic comments, but it would be good if someone more familiar with the details of the transform would take a look as well

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
12	nit: The wording seems a bit odd. Are there 'non-main' goals for the pass? Maybe just say something like `This pass performs the following optimizations: ....`
72	can you add comments explaining to scope `opimtizeIntrinsicCalls` and `optimizeFunctions` operate?
130	nit: this could be something like `if (match(User, m_Intrinsic<Intrinsic::aarch64_sve_convert_to_svbool>(`, might save a line or two.
267	you just need to iterate over all instructions in a function, right? There's `instruction(F)` which does so directly.
277	nit: it would be simpler to read if the `!` would be moved inside?

Address @fhahn's comments.

Style changes.
More comments.

Thanks for the comments @fhahn! 😄

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
267	These are looking at basic-block local ptrue intrinsic calls at the minute -- we're passing the basic block to coalescePTrueIntrinsicCalls below too. 🙂

Harbormaster completed remote builds in B85898: Diff 317891.Jan 20 2021, 10:35 AM

joechrisellis added a reviewer: david-arm.Jan 25 2021, 6:03 AM

joechrisellis added a reviewer: kmclaughlin.Jan 28 2021, 3:03 AM

CarolineConcatto added a subscriber: CarolineConcatto.Jan 29 2021, 1:58 AM

CarolineConcatto added inline comments.

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
472	nit

Could you update the commit message as I think it's now out of date? For example, "This commit introduces a new pass ..."

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp
151	Again, this check is only valid if both types are fixed width or both types are scalable.
174	Do we ever expect fixed width types here? If not, I wonder if you should cast to ScalableVectorType instead. The reason I mention this is that your comparison below only works when all are fixed width or all are scalable.
202	What if there are multiple ptrues with the same type as the most encompassing? We'll be needlessly creating convert_from_svbool intrinsics here I think? I think in this case we can just call replaceAllUsesWith(MostEncompassingPTrueVTy)

Address @david-arm's comments.

Use ScalableVectorType instead of VectorType; we don't expect anything other than scalable vectors in this context.

Hi @david-arm, thanks for the review!

I have updated my local commit message, but unfortunately arc doesn't seem to propagate this change to phabricator. Will do it manually. Good spot though!

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp

202

Good question -- this case is actually captured in the test below, see coalesce_test_same_size. 😄

                                                                                                                                                                                                                                                                                                                                                                              ; Two calls to the SVE ptrue intrinsic which are both of the same size. In this case, one should be identified
; as redundant and rewritten and an SVE reinterpret of the other via the convert.{to,from}.svbool intrinsics.
; This introduces a redundant conversion which will then be eliminated.

joechrisellis edited the summary of this revision. (Show Details)Feb 3 2021, 3:19 AM

Harbormaster completed remote builds in B87667: Diff 321044.Feb 3 2021, 4:45 AM

Matt added a subscriber: Matt.Feb 3 2021, 5:54 AM

LGTM!

This revision is now accepted and ready to land.Feb 3 2021, 8:07 AM

It seems you pushed to master, not main.

This revision now requires changes to proceed.Feb 4 2021, 12:32 PM

@asl -- yes -- doh! Muscle memory. Sorry about that.

I can't delete the new master branch on GitHub, though -- presume I lack permissions or something.

Appreciate the heads up. :)

joechrisellis retitled this revision from [AArch64][SVE] Add SVE IR pass to coalesce ptrue instrinsic calls to [AArch64][SVE] Coalesce ptrue instrinsic calls where possible.Feb 5 2021, 2:42 AM

This revision was not accepted when it landed; it landed in state Needs Revision.Feb 5 2021, 2:45 AM

This revision was landed with ongoing or failed builds.

Closed by commit rG3d257fde75f8: [AArch64][SVE] Coalesce ptrue instrinsic calls where possible (authored by joechrisellis). · Explain Why

This revision was automatically updated to reflect the committed changes.

joechrisellis added a commit: rG3d257fde75f8: [AArch64][SVE] Coalesce ptrue instrinsic calls where possible.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

SVEIntrinsicOpts.cpp

222 lines

test/

CodeGen/

AArch64/

sve-coalesce-ptrue-intrinsics.ll

189 lines

Diff 321691

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp

//===----- SVEIntrinsicOpts - SVE ACLE Intrinsics Opts --------------------===//		//===----- SVEIntrinsicOpts - SVE ACLE Intrinsics Opts --------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Performs general IR level optimizations on SVE intrinsics.		// Performs general IR level optimizations on SVE intrinsics.
//		//
// The main goal of this pass is to remove unnecessary reinterpret		// This pass performs the following optimizations:
		fhahnUnsubmitted Done Reply Inline Actions nit: The wording seems a bit odd. Are there 'non-main' goals for the pass? Maybe just say something like `This pass performs the following optimizations: ....` fhahn: nit: The wording seems a bit odd. Are there 'non-main' goals for the pass? Maybe just say…
// intrinsics (llvm.aarch64.sve.convert.[to\|from].svbool), e.g:
//		//
		// - removes unnecessary reinterpret intrinsics
		// (llvm.aarch64.sve.convert.[to\|from].svbool), e.g:
// %1 = @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %a)		// %1 = @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %a)
// %2 = @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)		// %2 = @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %1)
//		//
// This pass also looks for ptest intrinsics & phi instructions where the		// - removes unnecessary ptrue intrinsics (llvm.aarch64.sve.ptrue), e.g:
// operands are being needlessly converted to and from svbool_t.		// %1 = @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
		// %2 = @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
		// ; (%1 can be replaced with a reinterpret of %2)
		//
		// - optimizes ptest intrinsics and phi instructions where the operands are
		// being needlessly converted to and from svbool_t.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Utils/AArch64BaseInfo.h"		#include "Utils/AArch64BaseInfo.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
Show All 23 Lines	struct SVEIntrinsicOpts : public ModulePass {
}		}

bool runOnModule(Module &M) override;		bool runOnModule(Module &M) override;
void getAnalysisUsage(AnalysisUsage &AU) const override;		void getAnalysisUsage(AnalysisUsage &AU) const override;

private:		private:
static IntrinsicInst isReinterpretToSVBool(Value V);		static IntrinsicInst isReinterpretToSVBool(Value V);

		bool coalescePTrueIntrinsicCalls(BasicBlock &BB,
		SmallSetVector<IntrinsicInst *, 4> &PTrues);
		bool optimizePTrueIntrinsicCalls(SmallSetVector<Function *, 4> &Functions);

		/// Operates at the instruction-scope. I.e., optimizations are applied local
		/// to individual instructions.
static bool optimizeIntrinsic(Instruction *I);		static bool optimizeIntrinsic(Instruction *I);
		bool optimizeIntrinsicCalls(SmallSetVector<Function *, 4> &Functions);
		fhahnUnsubmitted Done Reply Inline Actions can you add comments explaining to scope `opimtizeIntrinsicCalls` and `optimizeFunctions` operate? fhahn: can you add comments explaining to scope `opimtizeIntrinsicCalls` and `optimizeFunctions`…

		/// Operates at the function-scope. I.e., optimizations are applied local to
		/// the functions themselves.
bool optimizeFunctions(SmallSetVector<Function *, 4> &Functions);		bool optimizeFunctions(SmallSetVector<Function *, 4> &Functions);

static bool optimizeConvertFromSVBool(IntrinsicInst *I);		static bool optimizeConvertFromSVBool(IntrinsicInst *I);
static bool optimizePTest(IntrinsicInst *I);		static bool optimizePTest(IntrinsicInst *I);

static bool processPhiNode(IntrinsicInst *I);		static bool processPhiNode(IntrinsicInst *I);
};		};
} // end anonymous namespace		} // end anonymous namespace
Show All 21 Lines	if (!I)
return nullptr;		return nullptr;

if (I->getIntrinsicID() != Intrinsic::aarch64_sve_convert_to_svbool)		if (I->getIntrinsicID() != Intrinsic::aarch64_sve_convert_to_svbool)
return nullptr;		return nullptr;

return I;		return I;
}		}

		/// Checks if a ptrue intrinsic call is promoted. The act of promoting a
		/// ptrue will introduce zeroing. For example:
		///
		/// %1 = <vscale x 4 x i1> call @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
		/// %2 = <vscale x 16 x i1> call @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
		/// %3 = <vscale x 8 x i1> call @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> %2)
		///
		/// %1 is promoted, because it is converted:
		///
		/// <vscale x 4 x i1> => <vscale x 16 x i1> => <vscale x 8 x i1>
		///
		/// via a sequence of the SVE reinterpret intrinsics convert.{to,from}.svbool.
		bool isPTruePromoted(IntrinsicInst *PTrue) {
		// Find all users of this intrinsic that are calls to convert-to-svbool
		// reinterpret intrinsics.
		SmallVector<IntrinsicInst *, 4> ConvertToUses;
		for (User *User : PTrue->users()) {
		if (match(User, m_Intrinsic<Intrinsic::aarch64_sve_convert_to_svbool>())) {
		fhahnUnsubmitted Done Reply Inline Actions nit: this could be something like `if (match(User, m_Intrinsic<Intrinsic::aarch64_sve_convert_to_svbool>(`, might save a line or two. fhahn: nit: this could be something like ` if (match(User, m_Intrinsic<Intrinsic…
		ConvertToUses.push_back(cast<IntrinsicInst>(User));
		}
		}

		// If no such calls were found, this is ptrue is not promoted.
		if (ConvertToUses.empty())
		return false;

		// Otherwise, try to find users of the convert-to-svbool intrinsics that are
		// calls to the convert-from-svbool intrinsic, and would result in some lanes
		// being zeroed.
		const auto *PTrueVTy = cast<ScalableVectorType>(PTrue->getType());
		for (IntrinsicInst *ConvertToUse : ConvertToUses) {
		for (User *User : ConvertToUse->users()) {
		auto *IntrUser = dyn_cast<IntrinsicInst>(User);
		if (IntrUser && IntrUser->getIntrinsicID() ==
		Intrinsic::aarch64_sve_convert_from_svbool) {
		const auto *IntrUserVTy = cast<ScalableVectorType>(IntrUser->getType());

		// Would some lanes become zeroed by the conversion?
		if (IntrUserVTy->getElementCount().getKnownMinValue() >
		david-armUnsubmitted Done Reply Inline Actions Again, this check is only valid if both types are fixed width or both types are scalable. david-arm: Again, this check is only valid if both types are fixed width or both types are scalable.
		PTrueVTy->getElementCount().getKnownMinValue())
		// This is a promoted ptrue.
		return true;
		}
		}
		}

		// If no matching calls were found, this is not a promoted ptrue.
		return false;
		}

		/// Attempts to coalesce ptrues in a basic block.
		bool SVEIntrinsicOpts::coalescePTrueIntrinsicCalls(
		BasicBlock &BB, SmallSetVector<IntrinsicInst *, 4> &PTrues) {
		if (PTrues.size() <= 1)
		return false;

		// Find the ptrue with the most lanes.
		auto MostEncompassingPTrue = std::max_element(
		PTrues.begin(), PTrues.end(), [](auto PTrue1, auto PTrue2) {
		auto *PTrue1VTy = cast<ScalableVectorType>(PTrue1->getType());
		auto *PTrue2VTy = cast<ScalableVectorType>(PTrue2->getType());
		return PTrue1VTy->getElementCount().getKnownMinValue() <
		david-armUnsubmitted Done Reply Inline Actions Do we ever expect fixed width types here? If not, I wonder if you should cast to ScalableVectorType instead. The reason I mention this is that your comparison below only works when all are fixed width or all are scalable. david-arm: Do we ever expect fixed width types here? If not, I wonder if you should cast to…
		PTrue2VTy->getElementCount().getKnownMinValue();
		});

		// Remove the most encompassing ptrue, as well as any promoted ptrues, leaving
		// behind only the ptrues to be coalesced.
		PTrues.remove(MostEncompassingPTrue);
		PTrues.remove_if([](auto *PTrue) { return isPTruePromoted(PTrue); });

		// Hoist MostEncompassingPTrue to the start of the basic block. It is always
		// safe to do this, since ptrue intrinsic calls are guaranteed to have no
		// predecessors.
		MostEncompassingPTrue->moveBefore(BB, BB.getFirstInsertionPt());

		LLVMContext &Ctx = BB.getContext();
		IRBuilder<> Builder(Ctx);
		Builder.SetInsertPoint(&BB, ++MostEncompassingPTrue->getIterator());

		auto *MostEncompassingPTrueVTy =
		cast<VectorType>(MostEncompassingPTrue->getType());
		auto *ConvertToSVBool = Builder.CreateIntrinsic(
		Intrinsic::aarch64_sve_convert_to_svbool, {MostEncompassingPTrueVTy},
		{MostEncompassingPTrue});

		for (auto *PTrue : PTrues) {
		auto *PTrueVTy = cast<VectorType>(PTrue->getType());

		Builder.SetInsertPoint(&BB, ++ConvertToSVBool->getIterator());
		auto *ConvertFromSVBool =
		david-armUnsubmitted Not Done Reply Inline Actions What if there are multiple ptrues with the same type as the most encompassing? We'll be needlessly creating convert_from_svbool intrinsics here I think? I think in this case we can just call replaceAllUsesWith(MostEncompassingPTrueVTy) david-arm: What if there are multiple ptrues with the same type as the most encompassing? We'll be…
		joechrisellisAuthorUnsubmitted Done Reply Inline Actions Good question -- this case is actually captured in the test below, see `coalesce_test_same_size`. 😄 ; Two calls to the SVE ptrue intrinsic which are both of the same size. In this case, one should be identified ; as redundant and rewritten and an SVE reinterpret of the other via the convert.{to,from}.svbool intrinsics. ; This introduces a redundant conversion which will then be eliminated. joechrisellis: Good question -- this case is actually captured in the test below, see…
		Builder.CreateIntrinsic(Intrinsic::aarch64_sve_convert_from_svbool,
		{PTrueVTy}, {ConvertToSVBool});
		PTrue->replaceAllUsesWith(ConvertFromSVBool);
		PTrue->eraseFromParent();
		}

		return true;
		}

		/// The goal of this function is to remove redundant calls to the SVE ptrue
		/// intrinsic in each basic block within the given functions.
		///
		/// SVE ptrues have two representations in LLVM IR:
		/// - a logical representation -- an arbitrary-width scalable vector of i1s,
		/// i.e. <vscale x N x i1>.
		/// - a physical representation (svbool, <vscale x 16 x i1>) -- a 16-element
		/// scalable vector of i1s, i.e. <vscale x 16 x i1>.
		///
		/// The SVE ptrue intrinsic is used to create a logical representation of an SVE
		/// predicate. Suppose that we have two SVE ptrue intrinsic calls: P1 and P2. If
		/// P1 creates a logical SVE predicate that is at least as wide as the logical
		/// SVE predicate created by P2, then all of the bits that are true in the
		/// physical representation of P2 are necessarily also true in the physical
		/// representation of P1. P1 'encompasses' P2, therefore, the intrinsic call to
		/// P2 is redundant and can be replaced by an SVE reinterpret of P1 via
		/// convert.{to,from}.svbool.
		///
		/// Currently, this pass only coalesces calls to SVE ptrue intrinsics
		/// if they match the following conditions:
		///
		/// - the call to the intrinsic uses either the SV_ALL or SV_POW2 patterns.
		/// SV_ALL indicates that all bits of the predicate vector are to be set to
		/// true. SV_POW2 indicates that all bits of the predicate vector up to the
		/// largest power-of-two are to be set to true.
		/// - the result of the call to the intrinsic is not promoted to a wider
		/// predicate. In this case, keeping the extra ptrue leads to better codegen
		/// -- coalescing here would create an irreducible chain of SVE reinterprets
		/// via convert.{to,from}.svbool.
		///
		/// EXAMPLE:
		///
		/// %1 = <vscale x 8 x i1> ptrue(i32 SV_ALL)
		/// ; Logical: <1, 1, 1, 1, 1, 1, 1, 1>
		/// ; Physical: <1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0>
		/// ...
		///
		/// %2 = <vscale x 4 x i1> ptrue(i32 SV_ALL)
		/// ; Logical: <1, 1, 1, 1>
		/// ; Physical: <1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0>
		/// ...
		///
		/// Here, %2 can be replaced by an SVE reinterpret of %1, giving, for instance:
		///
		/// %1 = <vscale x 8 x i1> ptrue(i32 i31)
		/// %2 = <vscale x 16 x i1> convert.to.svbool(<vscale x 8 x i1> %1)
		/// %3 = <vscale x 4 x i1> convert.from.svbool(<vscale x 16 x i1> %2)
		///
		bool SVEIntrinsicOpts::optimizePTrueIntrinsicCalls(
		SmallSetVector<Function *, 4> &Functions) {
		bool Changed = false;

		for (auto *F : Functions) {
		for (auto &BB : *F) {
		SmallSetVector<IntrinsicInst *, 4> SVAllPTrues;
		SmallSetVector<IntrinsicInst *, 4> SVPow2PTrues;
		fhahnUnsubmitted Not Done Reply Inline Actions you just need to iterate over all instructions in a function, right? There's `instruction(F)` which does so directly. fhahn: you just need to iterate over all instructions in a function, right? There's `instruction(F)`…
		joechrisellisAuthorUnsubmitted Done Reply Inline Actions These are looking at basic-block local ptrue intrinsic calls at the minute -- we're passing the basic block to coalescePTrueIntrinsicCalls below too. 🙂 joechrisellis: These are looking at basic-block local ptrue intrinsic calls at the minute -- we're passing the…

		// For each basic block, collect the used ptrues and try to coalesce them.
		for (Instruction &I : BB) {
		if (I.use_empty())
		continue;

		auto *IntrI = dyn_cast<IntrinsicInst>(&I);
		if (!IntrI \|\| IntrI->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)
		continue;

		fhahnUnsubmitted Done Reply Inline Actions nit: it would be simpler to read if the `!` would be moved inside? fhahn: nit: it would be simpler to read if the `!` would be moved inside?
		const auto PTruePattern =
		cast<ConstantInt>(IntrI->getOperand(0))->getZExtValue();

		if (PTruePattern == AArch64SVEPredPattern::all)
		SVAllPTrues.insert(IntrI);
		if (PTruePattern == AArch64SVEPredPattern::pow2)
		SVPow2PTrues.insert(IntrI);
		}

		Changed \|= coalescePTrueIntrinsicCalls(BB, SVAllPTrues);
		Changed \|= coalescePTrueIntrinsicCalls(BB, SVPow2PTrues);
		}
		}

		return Changed;
		}

/// The function will remove redundant reinterprets casting in the presence		/// The function will remove redundant reinterprets casting in the presence
/// of the control flow		/// of the control flow
bool SVEIntrinsicOpts::processPhiNode(IntrinsicInst *X) {		bool SVEIntrinsicOpts::processPhiNode(IntrinsicInst *X) {

SmallVector<Instruction *, 32> Worklist;		SmallVector<Instruction *, 32> Worklist;
auto RequiredType = X->getType();		auto RequiredType = X->getType();

auto *PN = dyn_cast<PHINode>(X->getArgOperand(0));		auto *PN = dyn_cast<PHINode>(X->getArgOperand(0));
▲ Show 20 Lines • Show All 132 Lines • ▼ Show 20 Lines	case Intrinsic::aarch64_sve_ptest_last:
return optimizePTest(IntrI);		return optimizePTest(IntrI);
default:		default:
return false;		return false;
}		}

return true;		return true;
}		}

bool SVEIntrinsicOpts::optimizeFunctions(		bool SVEIntrinsicOpts::optimizeIntrinsicCalls(
SmallSetVector<Function *, 4> &Functions) {		SmallSetVector<Function *, 4> &Functions) {
bool Changed = false;		bool Changed = false;
for (auto *F : Functions) {		for (auto *F : Functions) {
DominatorTree DT = &getAnalysis<DominatorTreeWrapperPass>(F).getDomTree();		DominatorTree DT = &getAnalysis<DominatorTreeWrapperPass>(F).getDomTree();

// Traverse the DT with an rpo walk so we see defs before uses, allowing		// Traverse the DT with an rpo walk so we see defs before uses, allowing
// simplification to be done incrementally.		// simplification to be done incrementally.
BasicBlock *Root = DT->getRoot();		BasicBlock *Root = DT->getRoot();
ReversePostOrderTraversal<BasicBlock *> RPOT(Root);		ReversePostOrderTraversal<BasicBlock *> RPOT(Root);
for (auto *BB : RPOT)		for (auto *BB : RPOT)
for (Instruction &I : make_early_inc_range(*BB))		for (Instruction &I : make_early_inc_range(*BB))
Changed \|= optimizeIntrinsic(&I);		Changed \|= optimizeIntrinsic(&I);
}		}
return Changed;		return Changed;
}		}

		bool SVEIntrinsicOpts::optimizeFunctions(
		SmallSetVector<Function *, 4> &Functions) {
		bool Changed = false;

		Changed \|= optimizePTrueIntrinsicCalls(Functions);
		Changed \|= optimizeIntrinsicCalls(Functions);

		return Changed;
		}

bool SVEIntrinsicOpts::runOnModule(Module &M) {		bool SVEIntrinsicOpts::runOnModule(Module &M) {
bool Changed = false;		bool Changed = false;
SmallSetVector<Function *, 4> Functions;		SmallSetVector<Function *, 4> Functions;
		CarolineConcattoUnsubmitted Done Reply Inline Actions nit CarolineConcatto: nit

// Check for SVE intrinsic declarations first so that we only iterate over		// Check for SVE intrinsic declarations first so that we only iterate over
// relevant functions. Where an appropriate declaration is found, store the		// relevant functions. Where an appropriate declaration is found, store the
// function(s) where it is used so we can target these only.		// function(s) where it is used so we can target these only.
for (auto &F : M.getFunctionList()) {		for (auto &F : M.getFunctionList()) {
if (!F.isDeclaration())		if (!F.isDeclaration())
continue;		continue;

switch (F.getIntrinsicID()) {		switch (F.getIntrinsicID()) {
case Intrinsic::aarch64_sve_convert_from_svbool:		case Intrinsic::aarch64_sve_convert_from_svbool:
case Intrinsic::aarch64_sve_ptest_any:		case Intrinsic::aarch64_sve_ptest_any:
case Intrinsic::aarch64_sve_ptest_first:		case Intrinsic::aarch64_sve_ptest_first:
case Intrinsic::aarch64_sve_ptest_last:		case Intrinsic::aarch64_sve_ptest_last:
		case Intrinsic::aarch64_sve_ptrue:
for (User *U : F.users())		for (User *U : F.users())
Functions.insert(cast<Instruction>(U)->getFunction());		Functions.insert(cast<Instruction>(U)->getFunction());
break;		break;
default:		default:
break;		break;
}		}
}		}

if (!Functions.empty())		if (!Functions.empty())
Changed \|= optimizeFunctions(Functions);		Changed \|= optimizeFunctions(Functions);

return Changed;		return Changed;
}		}

llvm/test/CodeGen/AArch64/sve-coalesce-ptrue-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -aarch64-sve-intrinsic-opts -mtriple=aarch64-linux-gnu -mattr=+sve < %s 2>%t \| FileCheck %s
				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				declare <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 immarg)
				declare <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 immarg)
				declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
				declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

				declare <vscale x 16 x i32> @llvm.aarch64.sve.ld1.nxv16i32(<vscale x 16 x i1>, i32*)
				declare <vscale x 2 x i32> @llvm.aarch64.sve.ld1.nxv2i32(<vscale x 2 x i1>, i32*)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1>, i32*)
				declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1>, i16*)
				declare <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1>, i32*)

				declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1>)
				declare <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1>)

				; Two calls to the SVE ptrue intrinsic. %1 is redundant, and can be expressed as an SVE reinterpret of %3 via
				; convert.{to,from}.svbool.
				define <vscale x 8 x i32> @coalesce_test_basic(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_basic(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> [[TMP1]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP4:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP3]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP5:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP1]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 8 x i32> [[TMP5]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%4 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %3, i32* %addr)
				ret <vscale x 8 x i32> %4
				}

				; Two calls to the SVE ptrue intrinsic with the SV_POW2 pattern. This should reduce to the same output as
				; coalesce_test_basic.
				define <vscale x 8 x i32> @coalesce_test_pow2(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_pow2(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 0)
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> [[TMP1]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP4:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP3]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP5:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP1]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 8 x i32> [[TMP5]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 0)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 0)
				%4 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %3, i32* %addr)
				ret <vscale x 8 x i32> %4
				}

				; Four calls to the SVE ptrue intrinsic; two with the SV_ALL patterns, and two with the SV_POW2 pattern. The
				; two SV_ALL ptrue intrinsics should be coalesced, and the two SV_POW2 intrinsics should be colaesced.
				define <vscale x 8 x i32> @coalesce_test_all_and_pow2(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_all_and_pow2(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 0)
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> [[TMP1]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP4:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> [[TMP4]])
				; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP5]])
				; CHECK-NEXT: [[TMP7:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP3]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP8:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP1]], i32 [[ADDR]])
				; CHECK-NEXT: [[TMP9:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP6]], i32 [[ADDR]])
				; CHECK-NEXT: [[TMP10:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP4]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 8 x i32> [[TMP10]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 0)
				%2 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 0)
				%3 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%4 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)

				%5 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%6 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %2, i32* %addr)
				%7 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %3, i32* %addr)
				%8 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %4, i32* %addr)
				ret <vscale x 8 x i32> %8
				}


				; Two calls to the SVE ptrue intrinsic: one with the SV_ALL pattern, another with the SV_POW2 pattern. The
				; patterns are incompatible, so they should not be coalesced.
				define <vscale x 8 x i32> @coalesce_test_pattern_mismatch2(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_pattern_mismatch2(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 0)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP1]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				; CHECK-NEXT: [[TMP4:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP3]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 8 x i32> [[TMP4]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 0)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%4 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %3, i32* %addr)
				ret <vscale x 8 x i32> %4
				}

				; Two calls to the SVE ptrue intrinsic with the SV_VL1 pattern. This pattern is not currently recognised, so
				; nothing should be done here.
				define <vscale x 8 x i32> @coalesce_test_bad_pattern(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_bad_pattern(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 1)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP1]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 1)
				; CHECK-NEXT: [[TMP4:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP3]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 8 x i32> [[TMP4]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 1)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 1)
				%4 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %3, i32* %addr)
				ret <vscale x 8 x i32> %4
				}

				; Four calls to the SVE ptrue intrinsic. %7 is the most encompassing, and the others can be expressed as an
				; SVE reinterprets of %7 via convert.{to,from}.svbool.
				define <vscale x 16 x i32> @coalesce_test_multiple(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_multiple(
				; CHECK-NEXT: [[TMP1:%.*]] = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 31)
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv16i1(<vscale x 16 x i1> [[TMP1]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP4:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> [[TMP2]])
				; CHECK-NEXT: [[TMP6:%.]] = call <vscale x 2 x i32> @llvm.aarch64.sve.ld1.nxv2i32(<vscale x 2 x i1> [[TMP5]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP7:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP4]], i32 [[ADDR]])
				; CHECK-NEXT: [[TMP8:%.]] = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> [[TMP3]], i32 [[ADDR]])
				; CHECK-NEXT: [[TMP9:%.]] = call <vscale x 16 x i32> @llvm.aarch64.sve.ld1.nxv16i32(<vscale x 16 x i1> [[TMP1]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 16 x i32> [[TMP9]]
				;
				%1 = tail call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x i32> @llvm.aarch64.sve.ld1.nxv2i32(<vscale x 2 x i1> %1, i32* %addr)
				%3 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %3, i32* %addr)
				%5 = tail call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%6 = call <vscale x 8 x i32> @llvm.aarch64.sve.ld1.nxv8i32(<vscale x 8 x i1> %5, i32* %addr)
				%7 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 31)
				%8 = call <vscale x 16 x i32> @llvm.aarch64.sve.ld1.nxv16i32(<vscale x 16 x i1> %7, i32* %addr)
				ret <vscale x 16 x i32> %8
				}

				; Two calls to the SVE ptrue intrinsic which are both of the same size. In this case, one should be identified
				; as redundant and rewritten and an SVE reinterpret of the other via the convert.{to,from}.svbool intrinsics.
				; This introduces a redundant conversion which will then be eliminated.
				define <vscale x 4 x i32> @coalesce_test_same_size(i32* %addr) {
				; CHECK-LABEL: @coalesce_test_same_size(
				; CHECK-NEXT: [[TMP1:%.*]] = tail call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				; CHECK-NEXT: [[TMP2:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP1]], i32 [[ADDR:%.*]])
				; CHECK-NEXT: [[TMP3:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP1]], i32 [[ADDR]])
				; CHECK-NEXT: ret <vscale x 4 x i32> [[TMP3]]
				;
				%1 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr)
				%3 = tail call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %3, i32* %addr)
				ret <vscale x 4 x i32> %4
				}

				; Two calls to the SVE ptrue intrinsic, but neither can be eliminated; %1 is promoted to become %3, which
				; means eliminating this call to the SVE ptrue intrinsic would involve creating a longer, irreducible chain of
				; conversions. Better codegen is achieved by just leaving the ptrue as-is.
				define <vscale x 8 x i16> @coalesce_test_promoted_ptrue(i32* %addr1, i16* %addr2) {
				; CHECK-LABEL: @coalesce_test_promoted_ptrue(
				; CHECK-NEXT: [[TMP1:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv8i1(<vscale x 8 x i1> [[TMP1]])
				; CHECK-NEXT: [[TMP3:%.*]] = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				; CHECK-NEXT: [[TMP4:%.*]] = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv8i1(<vscale x 16 x i1> [[TMP4]])
				; CHECK-NEXT: [[TMP6:%.]] = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> [[TMP3]], i32 [[ADDR1:%.*]])
				; CHECK-NEXT: [[TMP7:%.]] = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> [[TMP5]], i16 [[ADDR2:%.*]])
				; CHECK-NEXT: [[TMP8:%.]] = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> [[TMP1]], i16 [[ADDR2]])
				; CHECK-NEXT: ret <vscale x 8 x i16> [[TMP8]]
				;
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv4i1(<vscale x 4 x i1> %1)
				%3 = call <vscale x 8 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv4i1(<vscale x 16 x i1> %2)

				%4 = call <vscale x 4 x i32> @llvm.aarch64.sve.ld1.nxv4i32(<vscale x 4 x i1> %1, i32* %addr1)
				%5 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %3, i16* %addr2)

				%6 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%7 = call <vscale x 8 x i16> @llvm.aarch64.sve.ld1.nxv8i16(<vscale x 8 x i1> %6, i16* %addr2)
				ret <vscale x 8 x i16> %7
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Coalesce ptrue instrinsic calls where possibleClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 321691

llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp

llvm/test/CodeGen/AArch64/sve-coalesce-ptrue-intrinsics.ll

[AArch64][SVE] Coalesce ptrue instrinsic calls where possible
ClosedPublic