This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-ptrue.ll

Differential D120152

[AArch64][SVE] Match VLS all-1's masks to PTRUE
AbandonedPublic

Authored by cameron.mcinally on Feb 18 2022, 11:42 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
bsmith
david-arm
efriedma

Summary

Here's a patch to match VLS all-1s masks to PTRUE. There were a few places, before and after legalization, that this could be done, but I think SETCC_MERGE_ZERO combines are the best fit.

Diff Detail

Unit TestsFailed

	Time	Test
	60,050 ms	x64 debian > BOLT.runtime/X86::exceptions-pic.test
	60,110 ms	x64 debian > LLVM.CodeGen/NVPTX::wmma.py
	60,070 ms	x64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp
	60,110 ms	x64 debian > libFuzzer.libFuzzer::large.test

Event Timeline

cameron.mcinally created this revision.Feb 18 2022, 11:42 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald TranscriptFeb 18 2022, 11:42 AM

cameron.mcinally requested review of this revision.Feb 18 2022, 11:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2022, 11:42 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Updated Diff.

Why didn't or cannot InstCombine catch this?

Harbormaster completed remote builds in B150467: Diff 409989.Feb 18 2022, 1:20 PM

In D120152#3332664, @tschuett wrote:

Why didn't or cannot InstCombine catch this?

I may be misunderstanding, but this pattern is just a legalized truncate, e.g. ({1,1,1,1} & splat(1)) != splat(0). The VLS->VLA transition is creating a bunch of extra nodes that need to be matched.

Fix formatting for the Lint bots.

Notice the "sign_extend" -> "sext" change to fit in 80 columns. This no longer matches the ISD node naming scheme, so it's a little weird. I didn't see a better fix for it though.

Harbormaster completed remote builds in B150486: Diff 410015.Feb 18 2022, 3:37 PM

Hi @cameron.mcinally, I really like what you're trying to do in this patch and the codegen indeed looks a lot better! I just had some suggestions about a possibly simpler, and more comprehensive approach that might give us more overall benefit.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17054	I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do: APInt SplatVal; if (isAllActivePredicate(DAG, Pred) && LHS.getOpcode() == ISD::AND && ISD::isConstantSplatVector(LHS.getOperand(1), SplatVal) && SplatVal == 1) { The only additional value that `isPow2Splat` adds here is that it also checks for AArch64ISD::DUP nodes, but I imagine at the point we're doing the DAG combines here we haven't generated an AArch64 ISD node yet?
17060	It feels like we should have a more basic DAG combine here, i.e. something in either `DAGCombiner::visitINSERT_SUBVECTOR` or in `performInsertSubvectorCombine`, that basically combines <vscale x M x iXY> insert_subvector <vscale x M x iXY> undef, <N x iXY> <splat of iXY A> into <vscale x M x iXY> <splat of iXY A> when we know that vscale x M == N. If you implement such a DAG combine it might benefit other parts of the code too? It would then mean here you should only have to check if `Trunc` is a splat of 1, which may also catch more cases.
17062	Again, here I think you need to check `&& !TruncNegated`

Hi @cameron.mcinally, sorry I mentioned @craig.topper in my previous comment, but I meant you. It's because I've also just reviewed one of Craig's patches so I got mixed up. :)

Does D120328 achieve the effect you're after @cameron.mcinally? I need to pull out and extend the LowerSPLAT_VECTOR related change but figured I'd push up my current work in case it helps.

Updated patch based on @david-arm's review.

@paulwalker-arm, D120328 looks good too. I'm happy to go with that one. But I notice that it will define bits that may be undef otherwise. E.g. we're inserting a 1/4 width vector into a full width vector, or Idx != 0.

cameron.mcinally added inline comments.Feb 22 2022, 12:04 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14458	I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the cleanest solution. I also wonder if all `insert_subvector(undef, splat(X), 0)->splat(X)` should be canonicalized here. I don't have a strong opinion on it though.

Harbormaster completed remote builds in B150910: Diff 410604.Feb 22 2022, 1:02 PM

I also wondered about defining bits that may be undef otherwise, but am unsure how much is really matters. I'll see if I can limit (when or perhaps just handle the constant case) the combine to reduce any potential downsides and report back. I'll also note that I believe this patch suffers the same problem because getVScaleForTuning() is only a hint. You can use getMinSVEVectorSizeInBits and getMaxSVEVectorSizeInBits to see if the true size is known but the downside if that you'll only be optimising the cases when the fixed length vector is the same size as the scalar equivalent, which means not all vectorised loops will see the benefit.

Good point. Replacing the lowered truncates with ptrue sounds like a win in the general case. Abandoning this Diff.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

22 lines

test/

CodeGen/

AArch64/

sve-fixed-length-ptrue.ll

51 lines

Diff 410604

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,424 Lines • ▼ Show 20 Lines	static SDValue performConcatVectorsCombine(SDNode *N,
return DAG.getNode(ISD::BITCAST, dl, VT,		return DAG.getNode(ISD::BITCAST, dl, VT,
DAG.getNode(ISD::CONCAT_VECTORS, dl, ConcatTy,		DAG.getNode(ISD::CONCAT_VECTORS, dl, ConcatTy,
DAG.getNode(ISD::BITCAST, dl, RHSTy, N0),		DAG.getNode(ISD::BITCAST, dl, RHSTy, N0),
RHS));		RHS));
}		}

static SDValue		static SDValue
performInsertSubvectorCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,		performInsertSubvectorCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG) {		SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
SDLoc DL(N);		SDLoc DL(N);
SDValue Vec = N->getOperand(0);		SDValue Vec = N->getOperand(0);
SDValue SubVec = N->getOperand(1);		SDValue SubVec = N->getOperand(1);
uint64_t IdxVal = N->getConstantOperandVal(2);		uint64_t IdxVal = N->getConstantOperandVal(2);
EVT VecVT = Vec.getValueType();		EVT VecVT = Vec.getValueType();
EVT SubVT = SubVec.getValueType();		EVT SubVT = SubVec.getValueType();

		// Check for fixed vector mask splats inserted into scalable vectors.
		if (VecVT.isScalableVector() &&
		DAG.getTargetLoweringInfo().isTypeLegal(VecVT) &&
		SubVT.isFixedLengthVector() && SubVT.getVectorElementType() == MVT::i1) {
		uint64_t VecEC = VecVT.getVectorElementCount().getKnownMinValue();
		uint64_t SubVecEC = SubVT.getVectorElementCount().getKnownMinValue();

		bool Negated;
		uint64_t SplatVal;
		if (Vec.isUndef() && IdxVal == 0 &&
		isPow2Splat(SubVec, SplatVal, Negated) &&
		SplatVal == 1 && !Negated &&
		VecEC * Subtarget->getVScaleForTuning() == SubVecEC)
		return DAG.getNode(ISD::SPLAT_VECTOR, DL, VecVT,
		DAG.getConstant(1, DL, MVT::i32));
		}

		cameron.mcinallyAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the cleanest solution. I also wonder if all `insert_subvector(undef, splat(X), 0)->splat(X)` should be canonicalized here. I don't have a strong opinion on it though. cameron.mcinally: I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the…
// Only do this for legal fixed vector types.		// Only do this for legal fixed vector types.
if (!VecVT.isFixedLengthVector() \|\|		if (!VecVT.isFixedLengthVector() \|\|
!DAG.getTargetLoweringInfo().isTypeLegal(VecVT) \|\|		!DAG.getTargetLoweringInfo().isTypeLegal(VecVT) \|\|
!DAG.getTargetLoweringInfo().isTypeLegal(SubVT))		!DAG.getTargetLoweringInfo().isTypeLegal(SubVT))
return SDValue();		return SDValue();

// Ignore widening patterns.		// Ignore widening patterns.
if (IdxVal == 0 && Vec.isUndef())		if (IdxVal == 0 && Vec.isUndef())
Show All 36 Lines	static SDValue tryCombineFixedPointConvert(SDNode *N,
// The second form interacts better with instruction selection and the		// The second form interacts better with instruction selection and the
// register allocator to avoid cross-class register copies that aren't		// register allocator to avoid cross-class register copies that aren't
// coalescable due to a lane reference.		// coalescable due to a lane reference.

// Check the operand and see if it originates from a lane extract.		// Check the operand and see if it originates from a lane extract.
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);
if (Op1.getOpcode() == ISD::EXTRACT_VECTOR_ELT) {		if (Op1.getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
// Yep, no additional predication needed. Perform the transform.		// Yep, no additional predication needed. Perform the transform.
SDValue IID = N->getOperand(0);		SDValue IID = N->getOperand(0);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - + Lint: Pre-merge checks: clang-format: please reformat the code ``` - + ```
SDValue Shift = N->getOperand(2);		SDValue Shift = N->getOperand(2);
SDValue Vec = Op1.getOperand(0);		SDValue Vec = Op1.getOperand(0);
SDValue Lane = Op1.getOperand(1);		SDValue Lane = Op1.getOperand(1);
EVT ResTy = N->getValueType(0);		EVT ResTy = N->getValueType(0);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - isPow2Splat(SubVec, SplatVal, Negated) && - SplatVal == 1 && !Negated && + isPow2Splat(SubVec, SplatVal, Negated) && SplatVal == 1 && !Negated && Lint: Pre-merge checks: clang-format: please reformat the code ``` - isPow2Splat(SubVec, SplatVal, Negated) &&…
EVT VecResTy;		EVT VecResTy;
SDLoc DL(N);		SDLoc DL(N);

// The vector width should be 128 bits by the time we get here, even		// The vector width should be 128 bits by the time we get here, even
// if it started as 64 bits (the extract_vector handling will have		// if it started as 64 bits (the extract_vector handling will have
// done so).		// done so).
assert(Vec.getValueSizeInBits() == 128 &&		assert(Vec.getValueSizeInBits() == 128 &&
"unexpected vector size on extract_vector_elt!");		"unexpected vector size on extract_vector_elt!");
▲ Show 20 Lines • Show All 2,522 Lines • ▼ Show 20 Lines

// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test		// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test
// as well as whether the test should be inverted. This code is required to		// as well as whether the test should be inverted. This code is required to
// catch these cases (as opposed to standard dag combines) because		// catch these cases (as opposed to standard dag combines) because
// AArch64ISD::TBZ is matched during legalization.		// AArch64ISD::TBZ is matched during legalization.
static SDValue getTestBitOperand(SDValue Op, unsigned &Bit, bool &Invert,		static SDValue getTestBitOperand(SDValue Op, unsigned &Bit, bool &Invert,
SelectionDAG &DAG) {		SelectionDAG &DAG) {

if (!Op->hasOneUse())		if (!Op->hasOneUse())
		david-armUnsubmitted Not Done Reply Inline Actions I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do: APInt SplatVal; if (isAllActivePredicate(DAG, Pred) && LHS.getOpcode() == ISD::AND && ISD::isConstantSplatVector(LHS.getOperand(1), SplatVal) && SplatVal == 1) { The only additional value that `isPow2Splat` adds here is that it also checks for AArch64ISD::DUP nodes, but I imagine at the point we're doing the DAG combines here we haven't generated an AArch64 ISD node yet? david-arm: I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do…
return Op;		return Op;

// We don't handle undef/constant-fold cases below, as they should have		// We don't handle undef/constant-fold cases below, as they should have
// already been taken care of (e.g. and of 0, test of undefined shifted bits,		// already been taken care of (e.g. and of 0, test of undefined shifted bits,
// etc.)		// etc.)

		david-armUnsubmitted Not Done Reply Inline Actions It feels like we should have a more basic DAG combine here, i.e. something in either `DAGCombiner::visitINSERT_SUBVECTOR` or in `performInsertSubvectorCombine`, that basically combines <vscale x M x iXY> insert_subvector <vscale x M x iXY> undef, <N x iXY> <splat of iXY A> into <vscale x M x iXY> <splat of iXY A> when we know that vscale x M == N. If you implement such a DAG combine it might benefit other parts of the code too? It would then mean here you should only have to check if `Trunc` is a splat of 1, which may also catch more cases. david-arm: It feels like we should have a more basic DAG combine here, i.e. something in either…
// (tbz (trunc x), b) -> (tbz x, b)		// (tbz (trunc x), b) -> (tbz x, b)
// This case is just here to enable more of the below cases to be caught.		// This case is just here to enable more of the below cases to be caught.
		david-armUnsubmitted Not Done Reply Inline Actions Again, here I think you need to check `&& !TruncNegated` david-arm: Again, here I think you need to check `&& !TruncNegated`
if (Op->getOpcode() == ISD::TRUNCATE &&		if (Op->getOpcode() == ISD::TRUNCATE &&
Bit < Op->getValueType(0).getSizeInBits()) {		Bit < Op->getValueType(0).getSizeInBits()) {
return getTestBitOperand(Op->getOperand(0), Bit, Invert, DAG);		return getTestBitOperand(Op->getOperand(0), Bit, Invert, DAG);
}		}

// (tbz (any_ext x), b) -> (tbz x, b) if we don't use the extended bits.		// (tbz (any_ext x), b) -> (tbz x, b) if we don't use the extended bits.
if (Op->getOpcode() == ISD::ANY_EXTEND &&		if (Op->getOpcode() == ISD::ANY_EXTEND &&
Bit < Op->getOperand(0).getValueSizeInBits()) {		Bit < Op->getOperand(0).getValueSizeInBits()) {
▲ Show 20 Lines • Show All 883 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
return performExtendCombine(N, DCI, DAG);		return performExtendCombine(N, DCI, DAG);
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
return performSignExtendInRegCombine(N, DCI, DAG);		return performSignExtendInRegCombine(N, DCI, DAG);
case ISD::CONCAT_VECTORS:		case ISD::CONCAT_VECTORS:
return performConcatVectorsCombine(N, DCI, DAG);		return performConcatVectorsCombine(N, DCI, DAG);
case ISD::INSERT_SUBVECTOR:		case ISD::INSERT_SUBVECTOR:
return performInsertSubvectorCombine(N, DCI, DAG);		return performInsertSubvectorCombine(N, DCI, DAG, Subtarget);
case ISD::SELECT:		case ISD::SELECT:
return performSelectCombine(N, DCI);		return performSelectCombine(N, DCI);
case ISD::VSELECT:		case ISD::VSELECT:
return performVSelectCombine(N, DCI.DAG);		return performVSelectCombine(N, DCI.DAG);
case ISD::SETCC:		case ISD::SETCC:
return performSETCCCombine(N, DAG);		return performSETCCCombine(N, DAG);
case ISD::LOAD:		case ISD::LOAD:
if (performTBISimplification(N->getOperand(1), DCI, DAG))		if (performTBISimplification(N->getOperand(1), DCI, DAG))
▲ Show 20 Lines • Show All 2,307 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-ptrue.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define <vscale x 2 x i1> @ptest_v8i1() #0 {
				; CHECK-LABEL: ptest_v8i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 2 x i1> @llvm.experimental.vector.insert.nxv2i1.v8i1 (<vscale x 2 x i1> undef, <8 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 2 x i1> %0
				}

				define <vscale x 4 x i1> @ptest_v16i1() #0 {
				; CHECK-LABEL: ptest_v16i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 4 x i1> @llvm.experimental.vector.insert.nxv4i1.v16i1 (<vscale x 4 x i1> undef, <16 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 4 x i1> %0
				}

				define <vscale x 8 x i1> @ptest_v32i1() #0 {
				; CHECK-LABEL: ptest_v32i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 8 x i1> @llvm.experimental.vector.insert.nxv8i1.v32i1 (<vscale x 8 x i1> undef, <32 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 8 x i1> %0
				}

				define <vscale x 16 x i1> @ptest_v64i1() #0 {
				; CHECK-LABEL: ptest_v64i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 16 x i1> @llvm.experimental.vector.insert.nxv16i1.v64i1 (<vscale x 16 x i1> undef, <64 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 16 x i1> %0
				}

				declare <vscale x 2 x i1> @llvm.experimental.vector.insert.nxv2i1.v8i1(<vscale x 2 x i1>, <8 x i1>, i64)
				declare <vscale x 4 x i1> @llvm.experimental.vector.insert.nxv4i1.v16i1(<vscale x 4 x i1>, <16 x i1>, i64)
				declare <vscale x 8 x i1> @llvm.experimental.vector.insert.nxv8i1.v32i1(<vscale x 8 x i1>, <32 x i1>, i64)
				declare <vscale x 16 x i1> @llvm.experimental.vector.insert.nxv16i1.v64i1(<vscale x 16 x i1>, <64 x i1>, i64)

				attributes #0 = { vscale_range(4,4) "target-features"="+sve" "target-cpu"="a64fx" }