This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
12/12
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-tbl.ll

Differential D133491

[AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.
ClosedPublic

Authored by fhahn on Sep 8 2022, 7:04 AM.

Download Raw Diff

Details

Reviewers

dmgreen
t.p.northover
ab
samparker

Commits

rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.

Summary

shuffle (tbl2, tbl2) can be folded into a single tbl4 if the shuffle selects
the first 8 elements of each tbl2 and the tbl2 masks can be merged.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Sep 8 2022, 7:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Sep 8 2022, 7:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM

Harbormaster completed remote builds in B185614: Diff 458726.Sep 8 2022, 7:40 AM

fhahn added a child revision: D120571: [CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl..Sep 14 2022, 8:16 AM

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10702–10703	Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle.
10707	Won't this overflow if it's a `tbl2` produding an `<8 x i8>`?
10716	Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line.
10719	Have we checked anywhere that the lower 8 operands are actually constant?

fhahn mentioned this in rG9f2c39418b85: [AArch64] Add tests with 2 x tbl2 for v8i8 and nonconst masks..Sep 16 2022, 2:26 AM

Address latest comments, thanks!

In D133491#3791661, @t.p.northover wrote:

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10702–10703	Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants.
10707	Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up.
10716	Done, thanks!
10719	That's checked in the latest version and I added extra test cases in 9f2c39418b85.

Harbormaster completed remote builds in B187093: Diff 460683.Sep 16 2022, 2:32 AM

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

if (ShuffleMask[I] < 16)
  TblMaskParts[I] = Mask1->getOperand(ShuffleMask[I]);
else {
  auto *C = cast<ConstantSDNode>(Mask2->getOperand(ShuffleMask[I] - 16));
  TblMaskParts[I] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32);
}

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10706	This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from `isExtractLowerHalfMask` in the same patch seems a bit unmotivated. Not insisting on one style or the other (or even consistency), just making sure it's an active decision.

t.p.northover added inline comments.Sep 20 2022, 1:46 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10709	Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later.

In D133491#3802133, @t.p.northover wrote:

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

Thanks Tim! Generalized as suggested.

fhahn marked 2 inline comments as done.Sep 21 2022, 6:29 AM

fhahn added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10706	This is not needed any longer, because it is checked in the loop below now.
10709	Done in the loop below, thanks!

Harbormaster completed remote builds in B187955: Diff 461877.Sep 21 2022, 6:29 AM

Thanks Florian. LGTM!

This revision is now accepted and ready to land.Sep 21 2022, 9:16 AM

This revision was landed with ongoing or failed builds.Sep 21 2022, 11:16 AM

Closed by commit rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn marked 2 inline comments as done.

fhahn added a commit: rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4..

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

69 lines

test/

CodeGen/

AArch64/

arm64-tbl.ll

28 lines

Diff 458726

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,657 Lines • ▼ Show 20 Lines	if (DAG.getTargetLoweringInfo().isTypeLegal(NewVT)) {
return DAG.getBitcast(VT,		return DAG.getBitcast(VT,
DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));		DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));
}		}
}		}

return SDValue();		return SDValue();
}		}

		static bool isExtractLowerHalfMask(ArrayRef<int> Mask) {
		if (Mask.size() != 16)
		return false;

		for (int I = 0; I < 8; ++I) {
		if (Mask[I] != I)
		return false;
		if (Mask[I + 8] != 16 + I)
		return false;
		}
		return true;
		}

		// Try to fold shuffle (tbl2, tbl2) into a single tbl4 if the shuffle selects
		// the first 8 elements of each tbl2 and the tbl2 masks can be merged.
		static SDValue tryToConvertShuffleOfTbl2ToTbl4(SDValue Op,
		ArrayRef<int> ShuffleMask,
		SelectionDAG &DAG) {
		if (!isExtractLowerHalfMask(ShuffleMask))
		return SDValue();

		SDValue Tbl1 = Op->getOperand(0);
		SDValue Tbl2 = Op->getOperand(1);
		SDLoc dl(Op);
		SDValue Tbl2ID =
		DAG.getTargetConstant(Intrinsic::aarch64_neon_tbl2, dl, MVT::i64);

		EVT VT = Op.getValueType();
		if (Tbl1->getOpcode() != ISD::INTRINSIC_WO_CHAIN \|\|
		Tbl1->getOperand(0) != Tbl2ID \|\|
		Tbl2->getOpcode() != ISD::INTRINSIC_WO_CHAIN \|\|
		Tbl2->getOperand(0) != Tbl2ID)
		return SDValue();

		SDValue Mask1 = Tbl1->getOperand(3);
		SDValue Mask2 = Tbl2->getOperand(3);
		// Make sure the tbl2 mask only selects values in the first 8 lanes (i.e. the
		// last 8 lanes all have an index of -1).
		t.p.northoverUnsubmitted Done Reply Inline Actions Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle. t.p.northover: Why do we care about this? It looks like we've already checked that lanes being filled by this…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants. fhahn: Good point, I removed the value checks and re-purposed the helper to check if the first 8…
		auto IsLowerExtractMask = [](SDValue Mask) {
		if (Mask->getOpcode() != ISD::BUILD_VECTOR)
		return false;
		t.p.northoverUnsubmitted Done Reply Inline Actions This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from `isExtractLowerHalfMask` in the same patch seems a bit unmotivated. Not insisting on one style or the other (or even consistency), just making sure it's an active decision. t.p.northover: This doesn't appear to capture anything, so maybe consider making it a static function instead?
		fhahnAuthorUnsubmitted Done Reply Inline Actions This is not needed any longer, because it is checked in the loop below now. fhahn: This is not needed any longer, because it is checked in the loop below now.
		for (unsigned I = 8; I < 16; I++) {
		t.p.northoverUnsubmitted Done Reply Inline Actions Won't this overflow if it's a `tbl2` produding an `<8 x i8>`? t.p.northover: Won't this overflow if it's a `tbl2` produding an `<8 x i8>`?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up. fhahn: Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further…
		auto *C = dyn_cast<ConstantSDNode>(Mask->getOperand(I));
		if (!C \|\| C->getSExtValue() != -1)
		t.p.northoverUnsubmitted Done Reply Inline Actions Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later. t.p.northover: Would also need to extend this check. Or perhaps better only check the indices that are…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done in the loop below, thanks! fhahn: Done in the loop below, thanks!
		return false;
		}
		return true;
		};
		if (!IsLowerExtractMask(Mask1) \|\| !IsLowerExtractMask(Mask2))
		return SDValue();
		SmallVector<SDValue, 16> TBLMaskParts(16, Mask1->getOperand(0));
		t.p.northoverUnsubmitted Done Reply Inline Actions Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line. t.p.northover: Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done, thanks! fhahn: Done, thanks!
		for (unsigned I = 0; I < 8; I++) {
		TBLMaskParts[I] = Mask1->getOperand(I);
		auto *C = cast<ConstantSDNode>(Mask2->getOperand(I));
		t.p.northoverUnsubmitted Done Reply Inline Actions Have we checked anywhere that the lower 8 operands are actually constant? t.p.northover: Have we checked anywhere that the lower 8 operands are actually constant?
		fhahnAuthorUnsubmitted Done Reply Inline Actions That's checked in the latest version and I added extra test cases in 9f2c39418b85. fhahn: That's checked in the latest version and I added extra test cases in 9f2c39418b85.
		TBLMaskParts[I + 8] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32);
		}

		SDValue TBLMask = DAG.getBuildVector(VT, dl, TBLMaskParts);
		SDValue ID =
		DAG.getTargetConstant(Intrinsic::aarch64_neon_tbl4, dl, MVT::i64);

		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v16i8,
		{ID, Tbl1->getOperand(1), Tbl1->getOperand(2),
		Tbl2->getOperand(1), Tbl2->getOperand(2), TBLMask});
		}

SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);		return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);

// Convert shuffles that are directly supported on NEON to target-specific		// Convert shuffles that are directly supported on NEON to target-specific
// DAG nodes, instead of keeping them as shuffles and matching them again		// DAG nodes, instead of keeping them as shuffles and matching them again
// during code selection. This is more efficient and avoids the possibility		// during code selection. This is more efficient and avoids the possibility
// of inconsistencies between legalization and selection.		// of inconsistencies between legalization and selection.
ArrayRef<int> ShuffleMask = SVN->getMask();		ArrayRef<int> ShuffleMask = SVN->getMask();

SDValue V1 = Op.getOperand(0);		SDValue V1 = Op.getOperand(0);
SDValue V2 = Op.getOperand(1);		SDValue V2 = Op.getOperand(1);

assert(V1.getValueType() == VT && "Unexpected VECTOR_SHUFFLE type!");		assert(V1.getValueType() == VT && "Unexpected VECTOR_SHUFFLE type!");
assert(ShuffleMask.size() == VT.getVectorNumElements() &&		assert(ShuffleMask.size() == VT.getVectorNumElements() &&
"Unexpected VECTOR_SHUFFLE mask size!");		"Unexpected VECTOR_SHUFFLE mask size!");

		if (SDValue Res = tryToConvertShuffleOfTbl2ToTbl4(Op, ShuffleMask, DAG))
		return Res;

if (SVN->isSplat()) {		if (SVN->isSplat()) {
int Lane = SVN->getSplatIndex();		int Lane = SVN->getSplatIndex();
// If this is undef splat, generate it via "just" vdup, if possible.		// If this is undef splat, generate it via "just" vdup, if possible.
if (Lane == -1)		if (Lane == -1)
Lane = 0;		Lane = 0;

if (Lane == 0 && V1.getOpcode() == ISD::SCALAR_TO_VECTOR)		if (Lane == 0 && V1.getOpcode() == ISD::SCALAR_TO_VECTOR)
return DAG.getNode(AArch64ISD::DUP, dl, V1.getValueType(),		return DAG.getNode(AArch64ISD::DUP, dl, V1.getValueType(),
▲ Show 20 Lines • Show All 11,622 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-tbl.ll

	Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: .byte 0 // 0x0			; CHECK-NEXT: .byte 0 // 0x0
	; CHECK-NEXT: .byte 4 // 0x4			; CHECK-NEXT: .byte 4 // 0x4
	; CHECK-NEXT: .byte 8 // 0x8			; CHECK-NEXT: .byte 8 // 0x8
	; CHECK-NEXT: .byte 12 // 0xc			; CHECK-NEXT: .byte 12 // 0xc
	; CHECK-NEXT: .byte 16 // 0x10			; CHECK-NEXT: .byte 16 // 0x10
	; CHECK-NEXT: .byte 20 // 0x14			; CHECK-NEXT: .byte 20 // 0x14
	; CHECK-NEXT: .byte 24 // 0x18			; CHECK-NEXT: .byte 24 // 0x18
	; CHECK-NEXT: .byte 28 // 0x1c			; CHECK-NEXT: .byte 28 // 0x1c
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 32 // 0x20
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 36 // 0x24
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 40 // 0x28
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 44 // 0x2c
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 48 // 0x30
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 52 // 0x34
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 56 // 0x38
	; CHECK-NEXT: .byte 255 // 0xff			; CHECK-NEXT: .byte 60 // 0x3c

	define <16 x i8> @shuffled_tbl2_to_tbl4(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {			define <16 x i8> @shuffled_tbl2_to_tbl4(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
	; CHECK-LABEL: shuffled_tbl2_to_tbl4:			; CHECK-LABEL: shuffled_tbl2_to_tbl4:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: adrp x8, .LCPI8_0			; CHECK-NEXT: adrp x8, .LCPI8_0
	; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1			; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
	; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3			; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
	; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1			; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
	; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3
	; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI8_0]			; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI8_0]
	; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4			; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
	; CHECK-NEXT: tbl.16b v1, { v2, v3 }, v4			; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
	; CHECK-NEXT: mov.d v0[1], v1[0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)			%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
	%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)			%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
	%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>			%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
	ret <16 x i8> %s			ret <16 x i8> %s
	}			}

	define <16 x i8> @shuffled_tbl2_to_tbl4_incompatible_shuffle(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {			define <16 x i8> @shuffled_tbl2_to_tbl4_incompatible_shuffle(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
	▲ Show 20 Lines • Show All 174 Lines • Show Last 20 Lines