This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-ptrue.ll

Differential D120152

[AArch64][SVE] Match VLS all-1's masks to PTRUE
AbandonedPublic

Authored by cameron.mcinally on Feb 18 2022, 11:42 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
bsmith
david-arm
efriedma

Summary

Here's a patch to match VLS all-1s masks to PTRUE. There were a few places, before and after legalization, that this could be done, but I think SETCC_MERGE_ZERO combines are the best fit.

Diff Detail

Event Timeline

cameron.mcinally created this revision.Feb 18 2022, 11:42 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald TranscriptFeb 18 2022, 11:42 AM

cameron.mcinally requested review of this revision.Feb 18 2022, 11:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2022, 11:42 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Updated Diff.

Why didn't or cannot InstCombine catch this?

Harbormaster completed remote builds in B150467: Diff 409989.Feb 18 2022, 1:20 PM

In D120152#3332664, @tschuett wrote:

Why didn't or cannot InstCombine catch this?

I may be misunderstanding, but this pattern is just a legalized truncate, e.g. ({1,1,1,1} & splat(1)) != splat(0). The VLS->VLA transition is creating a bunch of extra nodes that need to be matched.

Fix formatting for the Lint bots.

Notice the "sign_extend" -> "sext" change to fit in 80 columns. This no longer matches the ISD node naming scheme, so it's a little weird. I didn't see a better fix for it though.

Harbormaster completed remote builds in B150486: Diff 410015.Feb 18 2022, 3:37 PM

Hi @cameron.mcinally, I really like what you're trying to do in this patch and the codegen indeed looks a lot better! I just had some suggestions about a possibly simpler, and more comprehensive approach that might give us more overall benefit.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
17030	I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do: APInt SplatVal; if (isAllActivePredicate(DAG, Pred) && LHS.getOpcode() == ISD::AND && ISD::isConstantSplatVector(LHS.getOperand(1), SplatVal) && SplatVal == 1) { The only additional value that `isPow2Splat` adds here is that it also checks for AArch64ISD::DUP nodes, but I imagine at the point we're doing the DAG combines here we haven't generated an AArch64 ISD node yet?
17036	It feels like we should have a more basic DAG combine here, i.e. something in either `DAGCombiner::visitINSERT_SUBVECTOR` or in `performInsertSubvectorCombine`, that basically combines <vscale x M x iXY> insert_subvector <vscale x M x iXY> undef, <N x iXY> <splat of iXY A> into <vscale x M x iXY> <splat of iXY A> when we know that vscale x M == N. If you implement such a DAG combine it might benefit other parts of the code too? It would then mean here you should only have to check if `Trunc` is a splat of 1, which may also catch more cases.
17038	Again, here I think you need to check `&& !TruncNegated`

Hi @cameron.mcinally, sorry I mentioned @craig.topper in my previous comment, but I meant you. It's because I've also just reviewed one of Craig's patches so I got mixed up. :)

Does D120328 achieve the effect you're after @cameron.mcinally? I need to pull out and extend the LowerSPLAT_VECTOR related change but figured I'd push up my current work in case it helps.

Updated patch based on @david-arm's review.

@paulwalker-arm, D120328 looks good too. I'm happy to go with that one. But I notice that it will define bits that may be undef otherwise. E.g. we're inserting a 1/4 width vector into a full width vector, or Idx != 0.

cameron.mcinally added inline comments.Feb 22 2022, 12:04 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14457	I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the cleanest solution. I also wonder if all `insert_subvector(undef, splat(X), 0)->splat(X)` should be canonicalized here. I don't have a strong opinion on it though.

Harbormaster completed remote builds in B150910: Diff 410604.Feb 22 2022, 1:02 PM

I also wondered about defining bits that may be undef otherwise, but am unsure how much is really matters. I'll see if I can limit (when or perhaps just handle the constant case) the combine to reduce any potential downsides and report back. I'll also note that I believe this patch suffers the same problem because getVScaleForTuning() is only a hint. You can use getMinSVEVectorSizeInBits and getMaxSVEVectorSizeInBits to see if the true size is known but the downside if that you'll only be optimising the cases when the fixed length vector is the same size as the scalar equivalent, which means not all vectorised loops will see the benefit.

Good point. Replacing the lowered truncates with ptrue sounds like a win in the general case. Abandoning this Diff.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

45 lines

test/

CodeGen/

AArch64/

sve-fixed-length-ptrue.ll

51 lines

Diff 409984

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,448 Lines • ▼ Show 20 Lines	if (IdxVal == 0 && Vec.isUndef())
return SDValue();		return SDValue();

// Subvector must be half the width and an "aligned" insertion.		// Subvector must be half the width and an "aligned" insertion.
unsigned NumSubElts = SubVT.getVectorNumElements();		unsigned NumSubElts = SubVT.getVectorNumElements();
if ((SubVT.getSizeInBits() * 2) != VecVT.getSizeInBits() \|\|		if ((SubVT.getSizeInBits() * 2) != VecVT.getSizeInBits() \|\|
(IdxVal != 0 && IdxVal != NumSubElts))		(IdxVal != 0 && IdxVal != NumSubElts))
return SDValue();		return SDValue();

// Fold insert_subvector -> concat_vectors		// Fold insert_subvector -> concat_vectors
		cameron.mcinallyAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the cleanest solution. I also wonder if all `insert_subvector(undef, splat(X), 0)->splat(X)` should be canonicalized here. I don't have a strong opinion on it though. cameron.mcinally: I'm not sure if `getVScaleForTuning` is the right way to go here, but it seemed like the…
// insert_subvector(Vec,Sub,lo) -> concat_vectors(Sub,extract(Vec,hi))		// insert_subvector(Vec,Sub,lo) -> concat_vectors(Sub,extract(Vec,hi))
// insert_subvector(Vec,Sub,hi) -> concat_vectors(extract(Vec,lo),Sub)		// insert_subvector(Vec,Sub,hi) -> concat_vectors(extract(Vec,lo),Sub)
SDValue Lo, Hi;		SDValue Lo, Hi;
if (IdxVal == 0) {		if (IdxVal == 0) {
Lo = SubVec;		Lo = SubVec;
Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT, Vec,		Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, SubVT, Vec,
DAG.getVectorIdxConstant(NumSubElts, DL));		DAG.getVectorIdxConstant(NumSubElts, DL));
} else {		} else {
▲ Show 20 Lines • Show All 2,539 Lines • ▼ Show 20 Lines
static SDValue performSetccMergeZeroCombine(SDNode *N, SelectionDAG &DAG) {		static SDValue performSetccMergeZeroCombine(SDNode *N, SelectionDAG &DAG) {
assert(N->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&		assert(N->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&
"Unexpected opcode!");		"Unexpected opcode!");

SDValue Pred = N->getOperand(0);		SDValue Pred = N->getOperand(0);
SDValue LHS = N->getOperand(1);		SDValue LHS = N->getOperand(1);
SDValue RHS = N->getOperand(2);		SDValue RHS = N->getOperand(2);
ISD::CondCode Cond = cast<CondCodeSDNode>(N->getOperand(3))->get();		ISD::CondCode Cond = cast<CondCodeSDNode>(N->getOperand(3))->get();
		SDLoc dl(N);

		// Common `X != 0` combines.
		if (Cond == ISD::SETNE && isZerosVector(RHS.getNode())) {
// setcc_merge_zero pred (sign_extend (setcc_merge_zero ... pred ...)), 0, ne		// setcc_merge_zero pred (sign_extend (setcc_merge_zero ... pred ...)), 0, ne
// => inner setcc_merge_zero		// => inner setcc_merge_zero
if (Cond == ISD::SETNE && isZerosVector(RHS.getNode()) &&		if (LHS->getOpcode() == ISD::SIGN_EXTEND &&
LHS->getOpcode() == ISD::SIGN_EXTEND &&
LHS->getOperand(0)->getValueType(0) == N->getValueType(0) &&		LHS->getOperand(0)->getValueType(0) == N->getValueType(0) &&
LHS->getOperand(0)->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&		LHS->getOperand(0)->getOpcode() == AArch64ISD::SETCC_MERGE_ZERO &&
LHS->getOperand(0)->getOperand(0) == Pred)		LHS->getOperand(0)->getOperand(0) == Pred)
return LHS->getOperand(0);		return LHS->getOperand(0);

		// setcc_merge_zero pred (fixed_vector_splat & 1), 0, ne
		// => ptrue
		uint64_t SplatVal;
		bool Negated;
		if (LHS.getOpcode() == ISD::AND &&
		isPow2Splat(LHS.getOperand(1), SplatVal, Negated) &&
		david-armUnsubmitted Not Done Reply Inline Actions I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do: APInt SplatVal; if (isAllActivePredicate(DAG, Pred) && LHS.getOpcode() == ISD::AND && ISD::isConstantSplatVector(LHS.getOperand(1), SplatVal) && SplatVal == 1) { The only additional value that `isPow2Splat` adds here is that it also checks for AArch64ISD::DUP nodes, but I imagine at the point we're doing the DAG combines here we haven't generated an AArch64 ISD node yet? david-arm: I think you also need to check for `&& !Negated` here. Alternatively, I think you could just do…
		SplatVal == 1) {
		// We found a vXi1 truncate. Now check if we're truncating a fixed width
		// splat.
		SDValue Trunc = LHS.getOperand(0);
		uint64_t TruncSplatVal;
		bool TruncNegated;
		david-armUnsubmitted Not Done Reply Inline Actions It feels like we should have a more basic DAG combine here, i.e. something in either `DAGCombiner::visitINSERT_SUBVECTOR` or in `performInsertSubvectorCombine`, that basically combines <vscale x M x iXY> insert_subvector <vscale x M x iXY> undef, <N x iXY> <splat of iXY A> into <vscale x M x iXY> <splat of iXY A> when we know that vscale x M == N. If you implement such a DAG combine it might benefit other parts of the code too? It would then mean here you should only have to check if `Trunc` is a splat of 1, which may also catch more cases. david-arm: It feels like we should have a more basic DAG combine here, i.e. something in either…
		if (Trunc.getOpcode() == ISD::INSERT_SUBVECTOR &&
		Trunc.getOperand(0).isUndef() &&
		david-armUnsubmitted Not Done Reply Inline Actions Again, here I think you need to check `&& !TruncNegated` david-arm: Again, here I think you need to check `&& !TruncNegated`
		isPow2Splat(Trunc.getOperand(1), TruncSplatVal, TruncNegated) &&
		TruncSplatVal == 1) {
		// Generate a PTRUE. The VL pattern is the number of elements
		// of the fixed width splat input.
		EVT FixedVT = Trunc.getOperand(1).getValueType();
		Optional<unsigned> PredPattern =
		getSVEPredPatternFromNumElements(FixedVT.getVectorNumElements());
		auto PredTy = LHS.getValueType().changeVectorElementType(MVT::i1);
		return getPTrue(DAG, dl, PredTy, *PredPattern);
		}
		}
		}

if (SDValue V = performSetCCPunpkCombine(N, DAG))		if (SDValue V = performSetCCPunpkCombine(N, DAG))
return V;		return V;

return SDValue();		return SDValue();
}		}

// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test		// Optimize some simple tbz/tbnz cases. Returns the new operand and bit to test
// as well as whether the test should be inverted. This code is required to		// as well as whether the test should be inverted. This code is required to
▲ Show 20 Lines • Show All 3,229 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-ptrue.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define <vscale x 2 x i1> @ptest_v8i1() #0 {
				; CHECK-LABEL: ptest_v8i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.d, vl8
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 2 x i1> @llvm.experimental.vector.insert.nxv2i1.v8i1 (<vscale x 2 x i1> undef, <8 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 2 x i1> %0
				}

				define <vscale x 4 x i1> @ptest_v16i1() #0 {
				; CHECK-LABEL: ptest_v16i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.s, vl16
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 4 x i1> @llvm.experimental.vector.insert.nxv4i1.v16i1 (<vscale x 4 x i1> undef, <16 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 4 x i1> %0
				}

				define <vscale x 8 x i1> @ptest_v32i1() #0 {
				; CHECK-LABEL: ptest_v32i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.h, vl32
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 8 x i1> @llvm.experimental.vector.insert.nxv8i1.v32i1 (<vscale x 8 x i1> undef, <32 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 8 x i1> %0
				}

				define <vscale x 16 x i1> @ptest_v64i1() #0 {
				; CHECK-LABEL: ptest_v64i1:
				; CHECK: // %bb.0: // %L.entry
				; CHECK-NEXT: ptrue p0.b, vl64
				; CHECK-NEXT: ret
				L.entry:
				%0 = call <vscale x 16 x i1> @llvm.experimental.vector.insert.nxv16i1.v64i1 (<vscale x 16 x i1> undef, <64 x i1> <i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1, i1 1>, i64 0)
				ret <vscale x 16 x i1> %0
				}

				declare <vscale x 2 x i1> @llvm.experimental.vector.insert.nxv2i1.v8i1(<vscale x 2 x i1>, <8 x i1>, i64)
				declare <vscale x 4 x i1> @llvm.experimental.vector.insert.nxv4i1.v16i1(<vscale x 4 x i1>, <16 x i1>, i64)
				declare <vscale x 8 x i1> @llvm.experimental.vector.insert.nxv8i1.v32i1(<vscale x 8 x i1>, <32 x i1>, i64)
				declare <vscale x 16 x i1> @llvm.experimental.vector.insert.nxv16i1.v64i1(<vscale x 16 x i1>, <64 x i1>, i64)

				attributes #0 = { vscale_range(4,4) "target-features"="+sve" }