This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
1
TargetLowering.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
1
SelectionDAGBuilder.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64ISelLowering.h
1/2
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
active_lane_mask.ll

Differential D114542

[CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask
ClosedPublic

Authored by david-arm on Nov 24 2021, 8:42 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
SjoerdMeijer
c-rhodes
kmclaughlin
efriedma
paulwalker-arm

Commits

rGa31f4bdfe821: [CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask

Summary

In most common cases the @llvm.get.active.lane.mask intrinsic maps directly
to the SVE whilelo instruction, which already takes overflow into account.
However, currently in SelectionDAGBuilder::visitIntrinsicCall we always lower
this immediately to a generic sequence of instructions that explicitly
take overflow into account. This makes it very difficult to then later
transform back into a single whilelo instruction. Therefore, this patch
introduces a new TLI function called shouldExpandGetActiveLaneMask that asks if
we should lower/expand this to a sequence of generic ISD nodes, or instead
just leave it as an intrinsic for the target to lower.

You can see the significant improvement in code quality for some of the
tests in this file:

CodeGen/AArch64/active_lane_mask.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Nov 24 2021, 8:42 AM

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald TranscriptNov 24 2021, 8:42 AM

david-arm requested review of this revision.Nov 24 2021, 8:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 24 2021, 8:42 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added a parent revision: D114541: [CodeGen] Add scalable vector support for lowering of llvm.get.active.lane.mask.Nov 24 2021, 8:43 AM

Harbormaster completed remote builds in B135854: Diff 389512.Nov 24 2021, 9:18 AM

Just wanted to check how this works for SVE.
For MVE, we have to work quite hard to lower this to MVE specific intrinsic. In MVETailPredication.cpp we have a function IsSafeActiveMask that performs the following checks:

// The active lane intrinsic has this form:
//
//    @llvm.get.active.lane.mask(IV, TC)
//
// Here we perform checks that this intrinsic behaves as expected,
// which means:
//
// 1) Check that the TripCount (TC) belongs to this loop (originally).
// 2) The element count (TC) needs to be sufficiently large that the decrement
//    of element counter doesn't overflow, which means that we need to prove:
//        ceil(ElementCount / VectorWidth) >= TripCount
//    by rounding up ElementCount up:
//        ((ElementCount + (VectorWidth - 1)) / VectorWidth
//    and evaluate if expression isKnownNonNegative:
//        (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
// 3) The IV must be an induction phi with an increment equal to the
//    vector width.

All these checks are performed because we transform the loop into a different form (a tail-predicated one ) and new expressions are introduced that should not overflow. The rationale was that it wouldn't be difficult to imagine that a user-facing intrinsic will be introduced one day, in which we need these checks to see if things behave as expected (@efriedma can correct me if I'm wrong here). Please note that currently this intrinsic is emitted only by the loop vectoriser, and these issues should not occur.

Now the question is, we don't perform any of these checks, but do we need them?
The sanity check that the IV belongs to a loop would be nice, but that seems problematic here because in SelectionDAG we don't have the Loop view.
The overflow check that we do for MVE and the element count expression and decrement that we introduce does not seem to be applicable for SVE, is that right? That would be good news because that was the hardest part for MVE.

llvm/include/llvm/CodeGen/TargetLowering.h
430	... or should remain as an intrinsic. Perhaps clarify that intrinsic is a target specific intrinsic in that case.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1509	I was expecting `return false` here, now I am a bit confused...

Hi @SjoerdMeijer, so after looking into the intrinsic a bit more and talking with @paulwalker-arm, we realised that LoopVectorize.cpp isn't doing any specific overflow checks, which means that when lowering the intrinsic we have to start worrying about overflow. Fortunately for SVE, the whilelo instruction actually takes care of overflow already, i.e. there is a 1:1 mapping between the intrinsic and the instruction for many cases. This means that we don't really need to do any of the checks that MVE does for overflow because there is an already efficient way of doing this for SVE. Also, according to the documentation the whilelo instruction takes a start and end value and increments a counter from start to end based on the number of elements requested, i.e. <vscale x 16> elements for a <vscale x 16 x i1> type. So we know exactly which types are safe to map to the instruction and which are not. For some types like <vscale x 32 x i1> (which are illegal and have no mapping) we may also benefit from some of the same checks we do for MVE if we want to improve code quality.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1509	Perhaps the name of the TLI function could be better. What I mean by `lowerGetActiveLaneMask` is lower to generic ISD nodes (UADDO, etc.), and do the overflow checks explicitly. So returning true here means we let SelectionDAGBuilder lower the intrinsic at that point. If we return false here we're basically asking to leave it as an intrinsic.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

In D114542#3153093, @david-arm wrote:

Hi @SjoerdMeijer, so after looking into the intrinsic a bit more and talking with @paulwalker-arm, we realised that LoopVectorize.cpp isn't doing any specific overflow checks, which means that when lowering the intrinsic we have to start worrying about overflow. Fortunately for SVE, the whilelo instruction actually takes care of overflow already, i.e. there is a 1:1 mapping between the intrinsic and the instruction for many cases. This means that we don't really need to do any of the checks that MVE does for overflow because there is an already efficient way of doing this for SVE. Also, according to the documentation the whilelo instruction takes a start and end value and increments a counter from start to end based on the number of elements requested, i.e. <vscale x 16> elements for a <vscale x 16 x i1> type. So we know exactly which types are safe to map to the instruction and which are not. For some types like <vscale x 32 x i1> (which are illegal and have no mapping) we may also benefit from some of the same checks we do for MVE if we want to improve code quality.

Okay, perhaps I should have been clearer with "overflow checks" and the LoopVectorizer. Basically what I wanted to say is that a vectorized loop/body provides us with a few guarantees, i.e. we know the loop executes at least once, and the loop body processes at least VectorWidth elements. If we could assume this in the MVE lowering pass, then we wouldn't have to do all these overflow and sanity checks. We can't make these assume because the intrinsic could be emitted in the LV, but also somewhere else where we don't have these guarantees.

In D114542#3153102, @david-arm wrote:

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

In D114542#3153129, @SjoerdMeijer wrote:

In D114542#3153102, @david-arm wrote:

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Matt added a subscriber: Matt.Nov 25 2021, 8:52 AM

paulwalker-arm added inline comments.Nov 25 2021, 10:46 AM

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7112	Probably bike shedding but I feel if there was an ISD node for this intrinsic then the code this function prevents would likely sit in a function called `expandGetActiveLaneMask` so perhaps a better name is `shouldExpandGetActiveLaneMask`? I also think that given such a name the documentation can be very lightweight given expansion is a well established concept within the code generator.

In D114542#3153131, @tschuett wrote:

In D114542#3153129, @SjoerdMeijer wrote:

In D114542#3153102, @david-arm wrote:

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Something along those lines would probably be possible, but defeats the purpose of the intrinsic, which is about communicating information from the middle-end to the back-end. If we need an intrinsic and also meta-data to achieve this, then we should probably revise the definition of the intrinsic. But it looks like we don't need to do that at the moment, and we are also good for SVE.

In D114542#3155048, @SjoerdMeijer wrote:

In D114542#3153131, @tschuett wrote:

In D114542#3153129, @SjoerdMeijer wrote:

In D114542#3153102, @david-arm wrote:

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Something along those lines would probably be possible, but defeats the purpose of the intrinsic, which is about communicating information from the middle-end to the back-end. If we need an intrinsic and also meta-data to achieve this, then we should probably revise the definition of the intrinsic. But it looks like we don't need to do that at the moment, and we are also good for SVE.

Totally up to you. Either state in langref that is does not overflow. Or have a different intrinsic. `llvm.get.active.lane.mask_without_overflow'.

Changed TLI interface to shouldExpandGetActiveLaneMask and updated the function comment.

paulwalker-arm accepted this revision.Nov 26 2021, 4:03 AM

This revision is now accepted and ready to land.Nov 26 2021, 4:03 AM

Harbormaster completed remote builds in B136191: Diff 389987.Nov 26 2021, 4:35 AM

Closed by commit rGa31f4bdfe821: [CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask (authored by david-arm). · Explain WhyNov 29 2021, 12:30 AM

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGa31f4bdfe821: [CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask.

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

6 lines

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.cpp

10 lines

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

24 lines

test/

CodeGen/

AArch64/

active_lane_mask.ll

214 lines

Diff 390265

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	MachineMemOperand::Flags getStoreMemOperandFlags(const StoreInst &SI,
const DataLayout &DL) const;		const DataLayout &DL) const;
MachineMemOperand::Flags getAtomicMemOperandFlags(const Instruction &AI,		MachineMemOperand::Flags getAtomicMemOperandFlags(const Instruction &AI,
const DataLayout &DL) const;		const DataLayout &DL) const;

virtual bool isSelectSupported(SelectSupportKind /kind/) const {		virtual bool isSelectSupported(SelectSupportKind /kind/) const {
return true;		return true;
}		}

		/// Return true if the @llvm.get.active.lane.mask intrinsic should be expanded
		/// using generic code in SelectionDAGBuilder.
		virtual bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const {
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions ... or should remain as an intrinsic. Perhaps clarify that intrinsic is a target specific intrinsic in that case. SjoerdMeijer: > ... or should remain as an intrinsic. Perhaps clarify that intrinsic is a target specific…
		return true;
		}

/// Return true if it is profitable to convert a select of FP constants into		/// Return true if it is profitable to convert a select of FP constants into
/// a constant pool load whose address depends on the select condition. The		/// a constant pool load whose address depends on the select condition. The
/// parameter may be used to differentiate a select with FP compare from		/// parameter may be used to differentiate a select with FP compare from
/// integer compare.		/// integer compare.
virtual bool reduceSelectOfFPConstantLoads(EVT CmpOpVT) const {		virtual bool reduceSelectOfFPConstantLoads(EVT CmpOpVT) const {
return true;		return true;
}		}

▲ Show 20 Lines • Show All 4,277 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,099 Lines • ▼ Show 20 Lines	case Intrinsic::ptrmask: {
SDValue Const = getValue(I.getOperand(1));		SDValue Const = getValue(I.getOperand(1));

EVT PtrVT = Ptr.getValueType();		EVT PtrVT = Ptr.getValueType();
setValue(&I, DAG.getNode(ISD::AND, sdl, PtrVT, Ptr,		setValue(&I, DAG.getNode(ISD::AND, sdl, PtrVT, Ptr,
DAG.getZExtOrTrunc(Const, sdl, PtrVT)));		DAG.getZExtOrTrunc(Const, sdl, PtrVT)));
return;		return;
}		}
case Intrinsic::get_active_lane_mask: {		case Intrinsic::get_active_lane_mask: {
		EVT CCVT = TLI.getValueType(DAG.getDataLayout(), I.getType());
SDValue Index = getValue(I.getOperand(0));		SDValue Index = getValue(I.getOperand(0));
SDValue TripCount = getValue(I.getOperand(1));
EVT ElementVT = Index.getValueType();		EVT ElementVT = Index.getValueType();
EVT CCVT = TLI.getValueType(DAG.getDataLayout(), I.getType());
		if (!TLI.shouldExpandGetActiveLaneMask(CCVT, ElementVT)) {
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Probably bike shedding but I feel if there was an ISD node for this intrinsic then the code this function prevents would likely sit in a function called `expandGetActiveLaneMask` so perhaps a better name is `shouldExpandGetActiveLaneMask`? I also think that given such a name the documentation can be very lightweight given expansion is a well established concept within the code generator. paulwalker-arm: Probably bike shedding but I feel if there was an ISD node for this intrinsic then the code…
		visitTargetIntrinsic(I, Intrinsic);
		return;
		}

		SDValue TripCount = getValue(I.getOperand(1));
auto VecTy = CCVT.changeVectorElementType(ElementVT);		auto VecTy = CCVT.changeVectorElementType(ElementVT);

SDValue VectorIndex, VectorTripCount;		SDValue VectorIndex, VectorTripCount;
if (VecTy.isScalableVector()) {		if (VecTy.isScalableVector()) {
VectorIndex = DAG.getSplatVector(VecTy, sdl, Index);		VectorIndex = DAG.getSplatVector(VecTy, sdl, Index);
VectorTripCount = DAG.getSplatVector(VecTy, sdl, TripCount);		VectorTripCount = DAG.getSplatVector(VecTy, sdl, TripCount);
} else {		} else {
VectorIndex = DAG.getSplatBuildVector(VecTy, sdl, Index);		VectorIndex = DAG.getSplatBuildVector(VecTy, sdl, Index);
▲ Show 20 Lines • Show All 4,135 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 838 Lines • ▼ Show 20 Lines	public:
}		}

bool isAllActivePredicate(SDValue N) const;		bool isAllActivePredicate(SDValue N) const;
EVT getPromotedVTForPredicate(EVT VT) const;		EVT getPromotedVTForPredicate(EVT VT) const;

EVT getAsmOperandValueType(const DataLayout &DL, Type *Ty,		EVT getAsmOperandValueType(const DataLayout &DL, Type *Ty,
bool AllowUnknown = false) const override;		bool AllowUnknown = false) const override;

		bool shouldExpandGetActiveLaneMask(EVT VT, EVT OpVT) const override;

private:		private:
/// Keep a pointer to the AArch64Subtarget around so that we can		/// Keep a pointer to the AArch64Subtarget around so that we can
/// make the right decision when generating code for different targets.		/// make the right decision when generating code for different targets.
const AArch64Subtarget *Subtarget;		const AArch64Subtarget *Subtarget;

bool isExtFreeImpl(const Instruction *Ext) const override;		bool isExtFreeImpl(const Instruction *Ext) const override;

void addTypeForNEON(MVT VT);		void addTypeForNEON(MVT VT);
▲ Show 20 Lines • Show All 291 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,497 Lines • ▼ Show 20 Lines	if (Subtarget->isLittleEndian()) {
for (unsigned im = (unsigned)ISD::PRE_INC;		for (unsigned im = (unsigned)ISD::PRE_INC;
im != (unsigned)ISD::LAST_INDEXED_MODE; ++im) {		im != (unsigned)ISD::LAST_INDEXED_MODE; ++im) {
setIndexedLoadAction(im, VT, Legal);		setIndexedLoadAction(im, VT, Legal);
setIndexedStoreAction(im, VT, Legal);		setIndexedStoreAction(im, VT, Legal);
}		}
}		}
}		}

		bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
		EVT OpVT) const {
		// Only SVE has a 1:1 mapping from intrinsic -> instruction (whilelo).
		if (!Subtarget->hasSVE())
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I was expecting `return false` here, now I am a bit confused... SjoerdMeijer: I was expecting `return false` here, now I am a bit confused...
		david-armAuthorUnsubmitted Done Reply Inline Actions Perhaps the name of the TLI function could be better. What I mean by `lowerGetActiveLaneMask` is lower to generic ISD nodes (UADDO, etc.), and do the overflow checks explicitly. So returning true here means we let SelectionDAGBuilder lower the intrinsic at that point. If we return false here we're basically asking to leave it as an intrinsic. david-arm: Perhaps the name of the TLI function could be better. What I mean by `lowerGetActiveLaneMask`…
		return true;

		// We can only support legal predicate result types.
		if (ResVT != MVT::nxv2i1 && ResVT != MVT::nxv4i1 && ResVT != MVT::nxv8i1 &&
		ResVT != MVT::nxv16i1)
		return true;

		// The whilelo instruction only works with i32 or i64 scalar inputs.
		if (OpVT != MVT::i32 && OpVT != MVT::i64)
		return true;

		return false;
		}

void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {		void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");		assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

// By default everything must be expanded.		// By default everything must be expanded.
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)		for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)
setOperationAction(Op, VT, Expand);		setOperationAction(Op, VT, Expand);

// We use EXTRACT_SUBVECTOR to "cast" a scalable vector to a fixed length one.		// We use EXTRACT_SUBVECTOR to "cast" a scalable vector to a fixed length one.
▲ Show 20 Lines • Show All 2,771 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
case Intrinsic::aarch64_sve_udot: {		case Intrinsic::aarch64_sve_udot: {
unsigned Opcode = (IntNo == Intrinsic::aarch64_neon_udot \|\|		unsigned Opcode = (IntNo == Intrinsic::aarch64_neon_udot \|\|
IntNo == Intrinsic::aarch64_sve_udot)		IntNo == Intrinsic::aarch64_sve_udot)
? AArch64ISD::UDOT		? AArch64ISD::UDOT
: AArch64ISD::SDOT;		: AArch64ISD::SDOT;
return DAG.getNode(Opcode, dl, Op.getValueType(), Op.getOperand(1),		return DAG.getNode(Opcode, dl, Op.getValueType(), Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3));		Op.getOperand(2), Op.getOperand(3));
}		}
		case Intrinsic::get_active_lane_mask: {
		SDValue ID =
		DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), ID,
		Op.getOperand(1), Op.getOperand(2));
		}
}		}
}		}

bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {		bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {
if (VT.getVectorElementType() == MVT::i8 \|\|		if (VT.getVectorElementType() == MVT::i8 \|\|
VT.getVectorElementType() == MVT::i16) {		VT.getVectorElementType() == MVT::i16) {
EltTy = MVT::i32;		EltTy = MVT::i32;
return true;		return true;
▲ Show 20 Lines • Show All 15,165 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/active_lane_mask.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

	define <vscale x 16 x i1> @lane_mask_nxv16i1_i32(i32 %index, i32 %TC) {			define <vscale x 16 x i1> @lane_mask_nxv16i1_i32(i32 %index, i32 %TC) {
	; CHECK-LABEL: lane_mask_nxv16i1_i32:			; CHECK-LABEL: lane_mask_nxv16i1_i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill			; CHECK-NEXT: whilelo p0.b, w0, w1
	; CHECK-NEXT: addvl sp, sp, #-1
	; CHECK-NEXT: str p4, [sp, #7, mul vl] // 2-byte Folded Spill
	; CHECK-NEXT: .cfi_escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0x2e, 0x00, 0x1e, 0x22 // sp + 16 + 8 * VG
	; CHECK-NEXT: .cfi_offset w29, -16
	; CHECK-NEXT: index z0.s, #0, #1
	; CHECK-NEXT: mov z2.s, w0
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: ptrue p0.s
	; CHECK-NEXT: incw z1.s
	; CHECK-NEXT: add z3.s, z2.s, z0.s
	; CHECK-NEXT: incw z0.s, all, mul #2
	; CHECK-NEXT: add z4.s, z2.s, z1.s
	; CHECK-NEXT: incw z1.s, all, mul #2
	; CHECK-NEXT: cmphi p1.s, p0/z, z2.s, z3.s
	; CHECK-NEXT: add z0.s, z2.s, z0.s
	; CHECK-NEXT: cmphi p2.s, p0/z, z2.s, z4.s
	; CHECK-NEXT: add z1.s, z2.s, z1.s
	; CHECK-NEXT: uzp1 p1.h, p1.h, p2.h
	; CHECK-NEXT: cmphi p2.s, p0/z, z2.s, z0.s
	; CHECK-NEXT: cmphi p3.s, p0/z, z2.s, z1.s
	; CHECK-NEXT: mov z2.s, w1
	; CHECK-NEXT: uzp1 p2.h, p2.h, p3.h
	; CHECK-NEXT: cmphi p3.s, p0/z, z2.s, z4.s
	; CHECK-NEXT: cmphi p4.s, p0/z, z2.s, z3.s
	; CHECK-NEXT: uzp1 p1.b, p1.b, p2.b
	; CHECK-NEXT: uzp1 p2.h, p4.h, p3.h
	; CHECK-NEXT: cmphi p3.s, p0/z, z2.s, z0.s
	; CHECK-NEXT: cmphi p0.s, p0/z, z2.s, z1.s
	; CHECK-NEXT: ptrue p4.b
	; CHECK-NEXT: uzp1 p0.h, p3.h, p0.h
	; CHECK-NEXT: not p1.b, p4/z, p1.b
	; CHECK-NEXT: uzp1 p0.b, p2.b, p0.b
	; CHECK-NEXT: and p0.b, p4/z, p1.b, p0.b
	; CHECK-NEXT: ldr p4, [sp, #7, mul vl] // 2-byte Folded Reload
	; CHECK-NEXT: addvl sp, sp, #1
	; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 %index, i32 %TC)			%active.lane.mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i32(i32 %index, i32 %TC)
	ret <vscale x 16 x i1> %active.lane.mask			ret <vscale x 16 x i1> %active.lane.mask
	}			}

	define <vscale x 8 x i1> @lane_mask_nxv8i1_i32(i32 %index, i32 %TC) {			define <vscale x 8 x i1> @lane_mask_nxv8i1_i32(i32 %index, i32 %TC) {
	; CHECK-LABEL: lane_mask_nxv8i1_i32:			; CHECK-LABEL: lane_mask_nxv8i1_i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: index z0.s, #0, #1			; CHECK-NEXT: whilelo p0.h, w0, w1
	; CHECK-NEXT: mov z2.s, w0
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: add z0.s, z2.s, z0.s
	; CHECK-NEXT: incw z1.s
	; CHECK-NEXT: ptrue p0.s
	; CHECK-NEXT: add z1.s, z2.s, z1.s
	; CHECK-NEXT: cmphi p2.s, p0/z, z2.s, z0.s
	; CHECK-NEXT: cmphi p3.s, p0/z, z2.s, z1.s
	; CHECK-NEXT: mov z2.s, w1
	; CHECK-NEXT: ptrue p1.h
	; CHECK-NEXT: uzp1 p2.h, p2.h, p3.h
	; CHECK-NEXT: cmphi p3.s, p0/z, z2.s, z1.s
	; CHECK-NEXT: cmphi p0.s, p0/z, z2.s, z0.s
	; CHECK-NEXT: not p2.b, p1/z, p2.b
	; CHECK-NEXT: uzp1 p0.h, p0.h, p3.h
	; CHECK-NEXT: and p0.b, p1/z, p2.b, p0.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i32(i32 %index, i32 %TC)			%active.lane.mask = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i32(i32 %index, i32 %TC)
	ret <vscale x 8 x i1> %active.lane.mask			ret <vscale x 8 x i1> %active.lane.mask
	}			}

	define <vscale x 4 x i1> @lane_mask_nxv4i1_i32(i32 %index, i32 %TC) {			define <vscale x 4 x i1> @lane_mask_nxv4i1_i32(i32 %index, i32 %TC) {
	; CHECK-LABEL: lane_mask_nxv4i1_i32:			; CHECK-LABEL: lane_mask_nxv4i1_i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.s			; CHECK-NEXT: whilelo p0.s, w0, w1
	; CHECK-NEXT: index z0.s, w0, #1
	; CHECK-NEXT: mov z1.s, w0
	; CHECK-NEXT: mov z2.s, w1
	; CHECK-NEXT: cmphi p1.s, p0/z, z1.s, z0.s
	; CHECK-NEXT: cmphi p2.s, p0/z, z2.s, z0.s
	; CHECK-NEXT: not p1.b, p0/z, p1.b
	; CHECK-NEXT: and p0.b, p0/z, p1.b, p2.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 %index, i32 %TC)			%active.lane.mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i32(i32 %index, i32 %TC)
	ret <vscale x 4 x i1> %active.lane.mask			ret <vscale x 4 x i1> %active.lane.mask
	}			}

	define <vscale x 2 x i1> @lane_mask_nxv2i1_i32(i32 %index, i32 %TC) {			define <vscale x 2 x i1> @lane_mask_nxv2i1_i32(i32 %index, i32 %TC) {
	; CHECK-LABEL: lane_mask_nxv2i1_i32:			; CHECK-LABEL: lane_mask_nxv2i1_i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0			; CHECK-NEXT: whilelo p0.d, w0, w1
	; CHECK-NEXT: mov z1.d, x0
	; CHECK-NEXT: index z0.d, #0, #1
	; CHECK-NEXT: and z1.d, z1.d, #0xffffffff
	; CHECK-NEXT: // kill: def $w1 killed $w1 def $x1
	; CHECK-NEXT: mov z2.d, x1
	; CHECK-NEXT: adr z0.d, [z1.d, z0.d, uxtw]
	; CHECK-NEXT: ptrue p0.d
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: and z2.d, z2.d, #0xffffffff
	; CHECK-NEXT: and z1.d, z1.d, #0xffffffff
	; CHECK-NEXT: cmpne p1.d, p0/z, z1.d, z0.d
	; CHECK-NEXT: cmphi p2.d, p0/z, z2.d, z1.d
	; CHECK-NEXT: not p1.b, p0/z, p1.b
	; CHECK-NEXT: and p0.b, p0/z, p1.b, p2.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i32(i32 %index, i32 %TC)			%active.lane.mask = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i32(i32 %index, i32 %TC)
	ret <vscale x 2 x i1> %active.lane.mask			ret <vscale x 2 x i1> %active.lane.mask
	}			}

	define <vscale x 16 x i1> @lane_mask_nxv16i1_i64(i64 %index, i64 %TC) {			define <vscale x 16 x i1> @lane_mask_nxv16i1_i64(i64 %index, i64 %TC) {
	; CHECK-LABEL: lane_mask_nxv16i1_i64:			; CHECK-LABEL: lane_mask_nxv16i1_i64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill			; CHECK-NEXT: whilelo p0.b, x0, x1
	; CHECK-NEXT: addvl sp, sp, #-1
	; CHECK-NEXT: str p6, [sp, #5, mul vl] // 2-byte Folded Spill
	; CHECK-NEXT: str p5, [sp, #6, mul vl] // 2-byte Folded Spill
	; CHECK-NEXT: str p4, [sp, #7, mul vl] // 2-byte Folded Spill
	; CHECK-NEXT: .cfi_escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0x2e, 0x00, 0x1e, 0x22 // sp + 16 + 8 * VG
	; CHECK-NEXT: .cfi_offset w29, -16
	; CHECK-NEXT: index z0.d, #0, #1
	; CHECK-NEXT: mov z3.d, x0
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: mov z2.d, z0.d
	; CHECK-NEXT: incd z1.d
	; CHECK-NEXT: incd z2.d, all, mul #2
	; CHECK-NEXT: mov z5.d, z1.d
	; CHECK-NEXT: ptrue p0.d
	; CHECK-NEXT: incd z5.d, all, mul #2
	; CHECK-NEXT: add z4.d, z3.d, z0.d
	; CHECK-NEXT: add z6.d, z3.d, z1.d
	; CHECK-NEXT: add z7.d, z3.d, z2.d
	; CHECK-NEXT: add z24.d, z3.d, z5.d
	; CHECK-NEXT: incd z0.d, all, mul #4
	; CHECK-NEXT: cmphi p1.d, p0/z, z3.d, z4.d
	; CHECK-NEXT: incd z1.d, all, mul #4
	; CHECK-NEXT: cmphi p2.d, p0/z, z3.d, z6.d
	; CHECK-NEXT: cmphi p3.d, p0/z, z3.d, z7.d
	; CHECK-NEXT: cmphi p4.d, p0/z, z3.d, z24.d
	; CHECK-NEXT: incd z2.d, all, mul #4
	; CHECK-NEXT: incd z5.d, all, mul #4
	; CHECK-NEXT: add z0.d, z3.d, z0.d
	; CHECK-NEXT: uzp1 p1.s, p1.s, p2.s
	; CHECK-NEXT: uzp1 p2.s, p3.s, p4.s
	; CHECK-NEXT: add z1.d, z3.d, z1.d
	; CHECK-NEXT: add z2.d, z3.d, z2.d
	; CHECK-NEXT: add z5.d, z3.d, z5.d
	; CHECK-NEXT: uzp1 p1.h, p1.h, p2.h
	; CHECK-NEXT: cmphi p2.d, p0/z, z3.d, z0.d
	; CHECK-NEXT: cmphi p3.d, p0/z, z3.d, z1.d
	; CHECK-NEXT: cmphi p4.d, p0/z, z3.d, z2.d
	; CHECK-NEXT: cmphi p5.d, p0/z, z3.d, z5.d
	; CHECK-NEXT: uzp1 p2.s, p2.s, p3.s
	; CHECK-NEXT: uzp1 p3.s, p4.s, p5.s
	; CHECK-NEXT: mov z3.d, x1
	; CHECK-NEXT: uzp1 p2.h, p2.h, p3.h
	; CHECK-NEXT: cmphi p3.d, p0/z, z3.d, z6.d
	; CHECK-NEXT: cmphi p4.d, p0/z, z3.d, z4.d
	; CHECK-NEXT: uzp1 p1.b, p1.b, p2.b
	; CHECK-NEXT: uzp1 p2.s, p4.s, p3.s
	; CHECK-NEXT: cmphi p3.d, p0/z, z3.d, z7.d
	; CHECK-NEXT: cmphi p4.d, p0/z, z3.d, z24.d
	; CHECK-NEXT: cmphi p5.d, p0/z, z3.d, z0.d
	; CHECK-NEXT: cmphi p6.d, p0/z, z3.d, z1.d
	; CHECK-NEXT: uzp1 p3.s, p3.s, p4.s
	; CHECK-NEXT: uzp1 p4.s, p5.s, p6.s
	; CHECK-NEXT: cmphi p5.d, p0/z, z3.d, z2.d
	; CHECK-NEXT: cmphi p0.d, p0/z, z3.d, z5.d
	; CHECK-NEXT: uzp1 p0.s, p5.s, p0.s
	; CHECK-NEXT: ptrue p5.b
	; CHECK-NEXT: uzp1 p2.h, p2.h, p3.h
	; CHECK-NEXT: uzp1 p0.h, p4.h, p0.h
	; CHECK-NEXT: not p1.b, p5/z, p1.b
	; CHECK-NEXT: uzp1 p0.b, p2.b, p0.b
	; CHECK-NEXT: and p0.b, p5/z, p1.b, p0.b
	; CHECK-NEXT: ldr p6, [sp, #5, mul vl] // 2-byte Folded Reload
	; CHECK-NEXT: ldr p5, [sp, #6, mul vl] // 2-byte Folded Reload
	; CHECK-NEXT: ldr p4, [sp, #7, mul vl] // 2-byte Folded Reload
	; CHECK-NEXT: addvl sp, sp, #1
	; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %TC)			%active.lane.mask = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 %index, i64 %TC)
	ret <vscale x 16 x i1> %active.lane.mask			ret <vscale x 16 x i1> %active.lane.mask
	}			}

	define <vscale x 8 x i1> @lane_mask_nxv8i1_i64(i64 %index, i64 %TC) {			define <vscale x 8 x i1> @lane_mask_nxv8i1_i64(i64 %index, i64 %TC) {
	; CHECK-LABEL: lane_mask_nxv8i1_i64:			; CHECK-LABEL: lane_mask_nxv8i1_i64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill			; CHECK-NEXT: whilelo p0.h, x0, x1
	; CHECK-NEXT: addvl sp, sp, #-1
	; CHECK-NEXT: str p4, [sp, #7, mul vl] // 2-byte Folded Spill
	; CHECK-NEXT: .cfi_escape 0x0f, 0x0c, 0x8f, 0x00, 0x11, 0x10, 0x22, 0x11, 0x08, 0x92, 0x2e, 0x00, 0x1e, 0x22 // sp + 16 + 8 * VG
	; CHECK-NEXT: .cfi_offset w29, -16
	; CHECK-NEXT: index z0.d, #0, #1
	; CHECK-NEXT: mov z2.d, x0
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: ptrue p0.d
	; CHECK-NEXT: incd z1.d
	; CHECK-NEXT: add z3.d, z2.d, z0.d
	; CHECK-NEXT: incd z0.d, all, mul #2
	; CHECK-NEXT: add z4.d, z2.d, z1.d
	; CHECK-NEXT: incd z1.d, all, mul #2
	; CHECK-NEXT: cmphi p1.d, p0/z, z2.d, z3.d
	; CHECK-NEXT: add z0.d, z2.d, z0.d
	; CHECK-NEXT: cmphi p2.d, p0/z, z2.d, z4.d
	; CHECK-NEXT: add z1.d, z2.d, z1.d
	; CHECK-NEXT: uzp1 p1.s, p1.s, p2.s
	; CHECK-NEXT: cmphi p2.d, p0/z, z2.d, z0.d
	; CHECK-NEXT: cmphi p3.d, p0/z, z2.d, z1.d
	; CHECK-NEXT: mov z2.d, x1
	; CHECK-NEXT: uzp1 p2.s, p2.s, p3.s
	; CHECK-NEXT: cmphi p3.d, p0/z, z2.d, z4.d
	; CHECK-NEXT: cmphi p4.d, p0/z, z2.d, z3.d
	; CHECK-NEXT: uzp1 p1.h, p1.h, p2.h
	; CHECK-NEXT: uzp1 p2.s, p4.s, p3.s
	; CHECK-NEXT: cmphi p3.d, p0/z, z2.d, z0.d
	; CHECK-NEXT: cmphi p0.d, p0/z, z2.d, z1.d
	; CHECK-NEXT: ptrue p4.h
	; CHECK-NEXT: uzp1 p0.s, p3.s, p0.s
	; CHECK-NEXT: not p1.b, p4/z, p1.b
	; CHECK-NEXT: uzp1 p0.h, p2.h, p0.h
	; CHECK-NEXT: and p0.b, p4/z, p1.b, p0.b
	; CHECK-NEXT: ldr p4, [sp, #7, mul vl] // 2-byte Folded Reload
	; CHECK-NEXT: addvl sp, sp, #1
	; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 %index, i64 %TC)			%active.lane.mask = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 %index, i64 %TC)
	ret <vscale x 8 x i1> %active.lane.mask			ret <vscale x 8 x i1> %active.lane.mask
	}			}

	define <vscale x 4 x i1> @lane_mask_nxv4i1_i64(i64 %index, i64 %TC) {			define <vscale x 4 x i1> @lane_mask_nxv4i1_i64(i64 %index, i64 %TC) {
	; CHECK-LABEL: lane_mask_nxv4i1_i64:			; CHECK-LABEL: lane_mask_nxv4i1_i64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: index z0.d, #0, #1			; CHECK-NEXT: whilelo p0.s, x0, x1
	; CHECK-NEXT: mov z2.d, x0
	; CHECK-NEXT: mov z1.d, z0.d
	; CHECK-NEXT: add z0.d, z2.d, z0.d
	; CHECK-NEXT: incd z1.d
	; CHECK-NEXT: ptrue p0.d
	; CHECK-NEXT: add z1.d, z2.d, z1.d
	; CHECK-NEXT: cmphi p2.d, p0/z, z2.d, z0.d
	; CHECK-NEXT: cmphi p3.d, p0/z, z2.d, z1.d
	; CHECK-NEXT: mov z2.d, x1
	; CHECK-NEXT: ptrue p1.s
	; CHECK-NEXT: uzp1 p2.s, p2.s, p3.s
	; CHECK-NEXT: cmphi p3.d, p0/z, z2.d, z1.d
	; CHECK-NEXT: cmphi p0.d, p0/z, z2.d, z0.d
	; CHECK-NEXT: not p2.b, p1/z, p2.b
	; CHECK-NEXT: uzp1 p0.s, p0.s, p3.s
	; CHECK-NEXT: and p0.b, p1/z, p2.b, p0.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %index, i64 %TC)			%active.lane.mask = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %index, i64 %TC)
	ret <vscale x 4 x i1> %active.lane.mask			ret <vscale x 4 x i1> %active.lane.mask
	}			}

	define <vscale x 2 x i1> @lane_mask_nxv2i1_i64(i64 %index, i64 %TC) {			define <vscale x 2 x i1> @lane_mask_nxv2i1_i64(i64 %index, i64 %TC) {
	; CHECK-LABEL: lane_mask_nxv2i1_i64:			; CHECK-LABEL: lane_mask_nxv2i1_i64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.d			; CHECK-NEXT: whilelo p0.d, x0, x1
	; CHECK-NEXT: index z0.d, x0, #1
	; CHECK-NEXT: mov z1.d, x0
	; CHECK-NEXT: mov z2.d, x1
	; CHECK-NEXT: cmphi p1.d, p0/z, z1.d, z0.d
	; CHECK-NEXT: cmphi p2.d, p0/z, z2.d, z0.d
	; CHECK-NEXT: not p1.b, p0/z, p1.b
	; CHECK-NEXT: and p0.b, p0/z, p1.b, p2.b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%active.lane.mask = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 %index, i64 %TC)			%active.lane.mask = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 %index, i64 %TC)
	ret <vscale x 2 x i1> %active.lane.mask			ret <vscale x 2 x i1> %active.lane.mask
	}			}

	define <vscale x 16 x i1> @lane_mask_nxv16i1_i8(i8 %index, i8 %TC) {			define <vscale x 16 x i1> @lane_mask_nxv16i1_i8(i8 %index, i8 %TC) {
	; CHECK-LABEL: lane_mask_nxv16i1_i8:			; CHECK-LABEL: lane_mask_nxv16i1_i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.maskClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 390265

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/active_lane_mask.ll

[CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask
ClosedPublic