This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
12/12
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-tbl.ll
-
fp-conversion-to-tbl.ll

Differential D133491

[AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.
ClosedPublic

Authored by fhahn on Sep 8 2022, 7:04 AM.

Download Raw Diff

Details

Reviewers

dmgreen
t.p.northover
ab
samparker

Commits

rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4.

Summary

shuffle (tbl2, tbl2) can be folded into a single tbl4 if the shuffle selects
the first 8 elements of each tbl2 and the tbl2 masks can be merged.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Sep 8 2022, 7:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

fhahn requested review of this revision.Sep 8 2022, 7:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2022, 7:04 AM

Harbormaster completed remote builds in B185614: Diff 458726.Sep 8 2022, 7:40 AM

fhahn added a child revision: D120571: [CGP,AArch64] Replace zexts with shuffle that can be lowered using tbl..Sep 14 2022, 8:16 AM

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10834–10835	Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle.
10839	Won't this overflow if it's a `tbl2` produding an `<8 x i8>`?
10848	Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line.
10851	Have we checked anywhere that the lower 8 operands are actually constant?

fhahn mentioned this in rG9f2c39418b85: [AArch64] Add tests with 2 x tbl2 for v8i8 and nonconst masks..Sep 16 2022, 2:26 AM

Address latest comments, thanks!

In D133491#3791661, @t.p.northover wrote:

At first glance this seems like a hyper-specific optimization, I take it there's some reasonably common idiom that motivates us even bothering?

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10834–10835	Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants.
10839	Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up.
10848	Done, thanks!
10851	That's checked in the latest version and I added extra test cases in 9f2c39418b85.

Harbormaster completed remote builds in B187093: Diff 460683.Sep 16 2022, 2:32 AM

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

if (ShuffleMask[I] < 16)
  TblMaskParts[I] = Mask1->getOperand(ShuffleMask[I]);
else {
  auto *C = cast<ConstantSDNode>(Mask2->getOperand(ShuffleMask[I] - 16));
  TblMaskParts[I] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32);
}

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10838	This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from `isExtractLowerHalfMask` in the same patch seems a bit unmotivated. Not insisting on one style or the other (or even consistency), just making sure it's an active decision.

t.p.northover added inline comments.Sep 20 2022, 1:46 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10841	Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later.

In D133491#3802133, @t.p.northover wrote:

It is quite specific but this kind of pattern can be produced by loop-vectorization, in combination with the recent changes using tbl instructions for extends/truncates.

I think it's more restrictive than it needs to be, and we should drop the isExtractLowerHalfMask check entirely. Any constant shuffle of two constant tbl2 operations ought to be representable with a constant tbl4, and that check's only there because our index-mapping is too naive.

Instead I think we want something like this in the loop where we generate the new tbl indices (now running through all 16 lanes):

Thanks Tim! Generalized as suggested.

fhahn marked 2 inline comments as done.Sep 21 2022, 6:29 AM

fhahn added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
10838	This is not needed any longer, because it is checked in the loop below now.
10841	Done in the loop below, thanks!

Harbormaster completed remote builds in B187955: Diff 461877.Sep 21 2022, 6:29 AM

Thanks Florian. LGTM!

This revision is now accepted and ready to land.Sep 21 2022, 9:16 AM

This revision was landed with ongoing or failed builds.Sep 21 2022, 11:16 AM

Closed by commit rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn marked 2 inline comments as done.

fhahn added a commit: rGac434afed8dd: [AArch64] Try to fold shuffle (tbl2, tbl2) to tbl4..

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

48 lines

test/

CodeGen/

AArch64/

arm64-tbl.ll

287 lines

fp-conversion-to-tbl.ll

153 lines

Diff 461958

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,789 Lines • ▼ Show 20 Lines	if (DAG.getTargetLoweringInfo().isTypeLegal(NewVT)) {
return DAG.getBitcast(VT,		return DAG.getBitcast(VT,
DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));		DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));
}		}
}		}

return SDValue();		return SDValue();
}		}

		// Try to fold shuffle (tbl2, tbl2) into a single tbl4.
		static SDValue tryToConvertShuffleOfTbl2ToTbl4(SDValue Op,
		ArrayRef<int> ShuffleMask,
		SelectionDAG &DAG) {
		SDValue Tbl1 = Op->getOperand(0);
		SDValue Tbl2 = Op->getOperand(1);
		SDLoc dl(Op);
		SDValue Tbl2ID =
		DAG.getTargetConstant(Intrinsic::aarch64_neon_tbl2, dl, MVT::i64);

		EVT VT = Op.getValueType();
		if (Tbl1->getOpcode() != ISD::INTRINSIC_WO_CHAIN \|\|
		Tbl1->getOperand(0) != Tbl2ID \|\|
		Tbl2->getOpcode() != ISD::INTRINSIC_WO_CHAIN \|\|
		Tbl2->getOperand(0) != Tbl2ID)
		return SDValue();

		if (Tbl1->getValueType(0) != MVT::v16i8 \|\|
		Tbl2->getValueType(0) != MVT::v16i8)
		return SDValue();

		SDValue Mask1 = Tbl1->getOperand(3);
		SDValue Mask2 = Tbl2->getOperand(3);
		SmallVector<SDValue, 16> TBLMaskParts(16, SDValue());
		for (unsigned I = 0; I < 16; I++) {
		if (ShuffleMask[I] < 16)
		TBLMaskParts[I] = Mask1->getOperand(ShuffleMask[I]);
		else {
		auto *C =
		dyn_cast<ConstantSDNode>(Mask2->getOperand(ShuffleMask[I] - 16));
		if (!C)
		return SDValue();
		TBLMaskParts[I] = DAG.getConstant(C->getSExtValue() + 32, dl, MVT::i32);
		}
		}

		SDValue TBLMask = DAG.getBuildVector(VT, dl, TBLMaskParts);
		SDValue ID =
		t.p.northoverUnsubmitted Done Reply Inline Actions Why do we care about this? It looks like we've already checked that lanes being filled by this check are discarded by the shuffle. t.p.northover: Why do we care about this? It looks like we've already checked that lanes being filled by this…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Good point, I removed the value checks and re-purposed the helper to check if the first 8 elements are constants. fhahn: Good point, I removed the value checks and re-purposed the helper to check if the first 8…
		DAG.getTargetConstant(Intrinsic::aarch64_neon_tbl4, dl, MVT::i64);

		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, MVT::v16i8,
		t.p.northoverUnsubmitted Done Reply Inline Actions This doesn't appear to capture anything, so maybe consider making it a static function instead? The difference from `isExtractLowerHalfMask` in the same patch seems a bit unmotivated. Not insisting on one style or the other (or even consistency), just making sure it's an active decision. t.p.northover: This doesn't appear to capture anything, so maybe consider making it a static function instead?
		fhahnAuthorUnsubmitted Done Reply Inline Actions This is not needed any longer, because it is checked in the loop below now. fhahn: This is not needed any longer, because it is checked in the loop below now.
		{ID, Tbl1->getOperand(1), Tbl1->getOperand(2),
		t.p.northoverUnsubmitted Done Reply Inline Actions Won't this overflow if it's a `tbl2` produding an `<8 x i8>`? t.p.northover: Won't this overflow if it's a `tbl2` produding an `<8 x i8>`?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further in a follow-up. fhahn: Yes, added extra tests in 9f2c39418b85 and limit this to v16i8 for now, to be extended further…
		Tbl2->getOperand(1), Tbl2->getOperand(2), TBLMask});
		}
		t.p.northoverUnsubmitted Done Reply Inline Actions Would also need to extend this check. Or perhaps better only check the indices that are actually used by the shuffle, so move this into the main loop later. t.p.northover: Would also need to extend this check. Or perhaps better only check the indices that are…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done in the loop below, thanks! fhahn: Done in the loop below, thanks!

SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());
		t.p.northoverUnsubmitted Done Reply Inline Actions Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards anyway so that'd signal early that the reader doesn't have to care about this line. t.p.northover: Maybe default fill with `SDValue()`? We just overwrite all of them immediately afterwards…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Done, thanks! fhahn: Done, thanks!

if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);		return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);
		t.p.northoverUnsubmitted Done Reply Inline Actions Have we checked anywhere that the lower 8 operands are actually constant? t.p.northover: Have we checked anywhere that the lower 8 operands are actually constant?
		fhahnAuthorUnsubmitted Done Reply Inline Actions That's checked in the latest version and I added extra test cases in 9f2c39418b85. fhahn: That's checked in the latest version and I added extra test cases in 9f2c39418b85.

// Convert shuffles that are directly supported on NEON to target-specific		// Convert shuffles that are directly supported on NEON to target-specific
// DAG nodes, instead of keeping them as shuffles and matching them again		// DAG nodes, instead of keeping them as shuffles and matching them again
// during code selection. This is more efficient and avoids the possibility		// during code selection. This is more efficient and avoids the possibility
// of inconsistencies between legalization and selection.		// of inconsistencies between legalization and selection.
ArrayRef<int> ShuffleMask = SVN->getMask();		ArrayRef<int> ShuffleMask = SVN->getMask();

SDValue V1 = Op.getOperand(0);		SDValue V1 = Op.getOperand(0);
SDValue V2 = Op.getOperand(1);		SDValue V2 = Op.getOperand(1);

assert(V1.getValueType() == VT && "Unexpected VECTOR_SHUFFLE type!");		assert(V1.getValueType() == VT && "Unexpected VECTOR_SHUFFLE type!");
assert(ShuffleMask.size() == VT.getVectorNumElements() &&		assert(ShuffleMask.size() == VT.getVectorNumElements() &&
"Unexpected VECTOR_SHUFFLE mask size!");		"Unexpected VECTOR_SHUFFLE mask size!");

		if (SDValue Res = tryToConvertShuffleOfTbl2ToTbl4(Op, ShuffleMask, DAG))
		return Res;

if (SVN->isSplat()) {		if (SVN->isSplat()) {
int Lane = SVN->getSplatIndex();		int Lane = SVN->getSplatIndex();
// If this is undef splat, generate it via "just" vdup, if possible.		// If this is undef splat, generate it via "just" vdup, if possible.
if (Lane == -1)		if (Lane == -1)
Lane = 0;		Lane = 0;

if (Lane == 0 && V1.getOpcode() == ISD::SCALAR_TO_VECTOR)		if (Lane == 0 && V1.getOpcode() == ISD::SCALAR_TO_VECTOR)
return DAG.getNode(AArch64ISD::DUP, dl, V1.getValueType(),		return DAG.getNode(AArch64ISD::DUP, dl, V1.getValueType(),
▲ Show 20 Lines • Show All 11,702 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-tbl.ll

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
; CHECK-NEXT: .byte 0 // 0x0		; CHECK-NEXT: .byte 0 // 0x0
; CHECK-NEXT: .byte 4 // 0x4		; CHECK-NEXT: .byte 4 // 0x4
; CHECK-NEXT: .byte 8 // 0x8		; CHECK-NEXT: .byte 8 // 0x8
; CHECK-NEXT: .byte 12 // 0xc		; CHECK-NEXT: .byte 12 // 0xc
; CHECK-NEXT: .byte 16 // 0x10		; CHECK-NEXT: .byte 16 // 0x10
; CHECK-NEXT: .byte 20 // 0x14		; CHECK-NEXT: .byte 20 // 0x14
; CHECK-NEXT: .byte 24 // 0x18		; CHECK-NEXT: .byte 24 // 0x18
; CHECK-NEXT: .byte 28 // 0x1c		; CHECK-NEXT: .byte 28 // 0x1c
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 32 // 0x20
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 36 // 0x24
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 40 // 0x28
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 44 // 0x2c
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 48 // 0x30
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 52 // 0x34
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 56 // 0x38
; CHECK-NEXT: .byte 255 // 0xff		; CHECK-NEXT: .byte 60 // 0x3c

define <16 x i8> @shuffled_tbl2_to_tbl4(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {		define <16 x i8> @shuffled_tbl2_to_tbl4(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
; CHECK-LABEL: shuffled_tbl2_to_tbl4:		; CHECK-LABEL: shuffled_tbl2_to_tbl4:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI9_0		; CHECK-NEXT: adrp x8, .LCPI9_0
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3
; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI9_0]		; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI9_0]
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: tbl.16b v1, { v2, v3 }, v4		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
; CHECK-NEXT: mov.d v0[1], v1[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_first_mask(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {		define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_first_mask(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {
; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_first_mask:		; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_first_mask:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi.2d v4, #0xffffffffffffffff		; CHECK-NEXT: fmov s4, w0
; CHECK-NEXT: adrp x8, .LCPI10_0		; CHECK-NEXT: mov w8, #32
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ldr q5, [x8, :lo12:.LCPI10_0]
; CHECK-NEXT: mov.b v4[0], w0
; CHECK-NEXT: tbl.16b v2, { v2, v3 }, v5
; CHECK-NEXT: mov.b v4[1], w0		; CHECK-NEXT: mov.b v4[1], w0
		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: mov.b v4[2], w0		; CHECK-NEXT: mov.b v4[2], w0
; CHECK-NEXT: mov.b v4[3], w0		; CHECK-NEXT: mov.b v4[3], w0
; CHECK-NEXT: mov.b v4[4], w0		; CHECK-NEXT: mov.b v4[4], w0
; CHECK-NEXT: mov.b v4[5], w0		; CHECK-NEXT: mov.b v4[5], w0
; CHECK-NEXT: mov.b v4[6], w0		; CHECK-NEXT: mov.b v4[6], w0
; CHECK-NEXT: mov.b v4[7], w0		; CHECK-NEXT: mov.b v4[7], w0
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4		; CHECK-NEXT: mov.b v4[8], w8
; CHECK-NEXT: mov.d v0[1], v2[0]		; CHECK-NEXT: mov w8, #36
		; CHECK-NEXT: mov.b v4[9], w8
		; CHECK-NEXT: mov w8, #40
		; CHECK-NEXT: mov.b v4[10], w8
		; CHECK-NEXT: mov w8, #44
		; CHECK-NEXT: mov.b v4[11], w8
		; CHECK-NEXT: mov w8, #48
		; CHECK-NEXT: mov.b v4[12], w8
		; CHECK-NEXT: mov w8, #52
		; CHECK-NEXT: mov.b v4[13], w8
		; CHECK-NEXT: mov w8, #56
		; CHECK-NEXT: mov.b v4[14], w8
		; CHECK-NEXT: mov w8, #60
		; CHECK-NEXT: mov.b v4[15], w8
		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%ins.0 = insertelement <16 x i8> poison, i8 %v, i32 0		%ins.0 = insertelement <16 x i8> poison, i8 %v, i32 0
%ins.1 = insertelement <16 x i8> %ins.0, i8 %v, i32 1		%ins.1 = insertelement <16 x i8> %ins.0, i8 %v, i32 1
%ins.2 = insertelement <16 x i8> %ins.1, i8 %v, i32 2		%ins.2 = insertelement <16 x i8> %ins.1, i8 %v, i32 2
%ins.3 = insertelement <16 x i8> %ins.2, i8 %v, i32 3		%ins.3 = insertelement <16 x i8> %ins.2, i8 %v, i32 3
%ins.4 = insertelement <16 x i8> %ins.3, i8 %v, i32 4		%ins.4 = insertelement <16 x i8> %ins.3, i8 %v, i32 4
%ins.5 = insertelement <16 x i8> %ins.4, i8 %v, i32 5		%ins.5 = insertelement <16 x i8> %ins.4, i8 %v, i32 5
%ins.6 = insertelement <16 x i8> %ins.5, i8 %v, i32 6		%ins.6 = insertelement <16 x i8> %ins.5, i8 %v, i32 6
%ins.7 = insertelement <16 x i8> %ins.6, i8 %v, i32 7		%ins.7 = insertelement <16 x i8> %ins.6, i8 %v, i32 7
%ins.8 = insertelement <16 x i8> %ins.7, i8 -1, i32 8		%ins.8 = insertelement <16 x i8> %ins.7, i8 -1, i32 8
%ins.9 = insertelement <16 x i8> %ins.8, i8 -1, i32 9		%ins.9 = insertelement <16 x i8> %ins.8, i8 -1, i32 9
%ins.10 = insertelement <16 x i8> %ins.9, i8 -1, i32 10		%ins.10 = insertelement <16 x i8> %ins.9, i8 -1, i32 10
%ins.11 = insertelement <16 x i8> %ins.10, i8 -1, i32 11		%ins.11 = insertelement <16 x i8> %ins.10, i8 -1, i32 11
%ins.12 = insertelement <16 x i8> %ins.11, i8 -1, i32 12		%ins.12 = insertelement <16 x i8> %ins.11, i8 -1, i32 12
%ins.13 = insertelement <16 x i8> %ins.12, i8 -1, i32 13		%ins.13 = insertelement <16 x i8> %ins.12, i8 -1, i32 13
%ins.14 = insertelement <16 x i8> %ins.13, i8 -1, i32 14		%ins.14 = insertelement <16 x i8> %ins.13, i8 -1, i32 14
%ins.15 = insertelement <16 x i8> %ins.14, i8 -1, i32 15		%ins.15 = insertelement <16 x i8> %ins.14, i8 -1, i32 15
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

		define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_first_mask2(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {
		; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_first_mask2:
		; CHECK: // %bb.0:
		; CHECK-NEXT: mov w8, #1
		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: fmov s4, w8
		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: mov.b v4[1], w8
		; CHECK-NEXT: mov.b v4[2], w8
		; CHECK-NEXT: mov.b v4[3], w8
		; CHECK-NEXT: mov.b v4[4], w8
		; CHECK-NEXT: mov.b v4[5], w8
		; CHECK-NEXT: mov.b v4[6], w8
		; CHECK-NEXT: mov w8, #32
		; CHECK-NEXT: mov.b v4[7], w0
		; CHECK-NEXT: mov.b v4[8], w8
		; CHECK-NEXT: mov w8, #36
		; CHECK-NEXT: mov.b v4[9], w8
		; CHECK-NEXT: mov w8, #40
		; CHECK-NEXT: mov.b v4[10], w8
		; CHECK-NEXT: mov w8, #44
		; CHECK-NEXT: mov.b v4[11], w8
		; CHECK-NEXT: mov w8, #48
		; CHECK-NEXT: mov.b v4[12], w8
		; CHECK-NEXT: mov w8, #52
		; CHECK-NEXT: mov.b v4[13], w8
		; CHECK-NEXT: mov w8, #56
		; CHECK-NEXT: mov.b v4[14], w8
		; CHECK-NEXT: mov w8, #31
		; CHECK-NEXT: mov.b v4[15], w8
		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
		; CHECK-NEXT: ret
		%ins.0 = insertelement <16 x i8> poison, i8 1, i32 0
		%ins.1 = insertelement <16 x i8> %ins.0, i8 1, i32 1
		%ins.2 = insertelement <16 x i8> %ins.1, i8 1, i32 2
		%ins.3 = insertelement <16 x i8> %ins.2, i8 1, i32 3
		%ins.4 = insertelement <16 x i8> %ins.3, i8 1, i32 4
		%ins.5 = insertelement <16 x i8> %ins.4, i8 1, i32 5
		%ins.6 = insertelement <16 x i8> %ins.5, i8 1, i32 6
		%ins.7 = insertelement <16 x i8> %ins.6, i8 1, i32 7
		%ins.8 = insertelement <16 x i8> %ins.7, i8 -1, i32 8
		%ins.9 = insertelement <16 x i8> %ins.8, i8 -1, i32 9
		%ins.10 = insertelement <16 x i8> %ins.9, i8 -1, i32 10
		%ins.11 = insertelement <16 x i8> %ins.10, i8 -1, i32 11
		%ins.12 = insertelement <16 x i8> %ins.11, i8 %v, i32 12
		%ins.13 = insertelement <16 x i8> %ins.12, i8 %v, i32 13
		%ins.14 = insertelement <16 x i8> %ins.13, i8 -1, i32 14
		%ins.15 = insertelement <16 x i8> %ins.14, i8 %v, i32 15
		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)
		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 31>
		ret <16 x i8> %s
		}

define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_second_mask(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {		define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_second_mask(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {
; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_second_mask:		; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_second_mask:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi.2d v4, #0xffffffffffffffff		; CHECK-NEXT: movi.2d v4, #0xffffffffffffffff
; CHECK-NEXT: adrp x8, .LCPI11_0		; CHECK-NEXT: adrp x8, .LCPI12_0
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ldr q5, [x8, :lo12:.LCPI11_0]		; CHECK-NEXT: ldr q5, [x8, :lo12:.LCPI12_0]
; CHECK-NEXT: mov.b v4[0], w0		; CHECK-NEXT: mov.b v4[0], w0
; CHECK-NEXT: tbl.16b v2, { v2, v3 }, v5		; CHECK-NEXT: tbl.16b v2, { v2, v3 }, v5
; CHECK-NEXT: mov.b v4[1], w0		; CHECK-NEXT: mov.b v4[1], w0
; CHECK-NEXT: mov.b v4[2], w0		; CHECK-NEXT: mov.b v4[2], w0
; CHECK-NEXT: mov.b v4[3], w0		; CHECK-NEXT: mov.b v4[3], w0
; CHECK-NEXT: mov.b v4[4], w0		; CHECK-NEXT: mov.b v4[4], w0
; CHECK-NEXT: mov.b v4[5], w0		; CHECK-NEXT: mov.b v4[5], w0
; CHECK-NEXT: mov.b v4[6], w0		; CHECK-NEXT: mov.b v4[6], w0
Show All 19 Lines	; CHECK-NEXT: ret
%ins.14 = insertelement <16 x i8> %ins.13, i8 -1, i32 14		%ins.14 = insertelement <16 x i8> %ins.13, i8 -1, i32 14
%ins.15 = insertelement <16 x i8> %ins.14, i8 -1, i32 15		%ins.15 = insertelement <16 x i8> %ins.14, i8 -1, i32 15
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

define <16 x i8> @shuffled_tbl2_to_tbl4_incompatible_shuffle(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {		define <16 x i8> @shuffled_tbl2_to_tbl4_nonconst_second_mask2(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d, i8 %v) {
; CHECK-LABEL: shuffled_tbl2_to_tbl4_incompatible_shuffle:		; CHECK-LABEL: shuffled_tbl2_to_tbl4_nonconst_second_mask2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI12_0		; CHECK-NEXT: mov w8, #255
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: dup.16b v4, w0
		; CHECK-NEXT: adrp x9, .LCPI13_0
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3
; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI12_0]		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1
; CHECK-NEXT: adrp x8, .LCPI12_1		; CHECK-NEXT: mov.b v4[8], w8
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4		; CHECK-NEXT: ldr q5, [x9, :lo12:.LCPI13_0]
; CHECK-NEXT: tbl.16b v1, { v2, v3 }, v4		; CHECK-NEXT: mov.b v4[9], w8
; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI12_1]		; CHECK-NEXT: tbl.16b v2, { v2, v3 }, v5
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v2		; CHECK-NEXT: mov.b v4[10], w8
		; CHECK-NEXT: mov.b v4[11], w8
		; CHECK-NEXT: mov.b v4[12], w8
		; CHECK-NEXT: mov.b v4[13], w8
		; CHECK-NEXT: adrp x8, .LCPI13_1
		; CHECK-NEXT: tbl.16b v3, { v0, v1 }, v4
		; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI13_1]
		; CHECK-NEXT: tbl.16b v0, { v2, v3 }, v0
		; CHECK-NEXT: ret
		%ins.0 = insertelement <16 x i8> poison, i8 %v, i32 0
		%ins.1 = insertelement <16 x i8> %ins.0, i8 %v, i32 1
		%ins.2 = insertelement <16 x i8> %ins.1, i8 %v, i32 2
		%ins.3 = insertelement <16 x i8> %ins.2, i8 %v, i32 3
		%ins.4 = insertelement <16 x i8> %ins.3, i8 %v, i32 4
		%ins.5 = insertelement <16 x i8> %ins.4, i8 %v, i32 5
		%ins.6 = insertelement <16 x i8> %ins.5, i8 %v, i32 6
		%ins.7 = insertelement <16 x i8> %ins.6, i8 %v, i32 7
		%ins.8 = insertelement <16 x i8> %ins.7, i8 -1, i32 8
		%ins.9 = insertelement <16 x i8> %ins.8, i8 -1, i32 9
		%ins.10 = insertelement <16 x i8> %ins.9, i8 -1, i32 10
		%ins.11 = insertelement <16 x i8> %ins.10, i8 -1, i32 11
		%ins.12 = insertelement <16 x i8> %ins.11, i8 -1, i32 12
		%ins.13 = insertelement <16 x i8> %ins.12, i8 -1, i32 13
		%ins.14 = insertelement <16 x i8> %ins.13, i8 %v, i32 14
		%ins.15 = insertelement <16 x i8> %ins.14, i8 %v, i32 15
		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> %ins.15)
		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 30, i32 31>
		ret <16 x i8> %s
		}


		; CHECK-LABEL: .LCPI14_0:
		; CHECK-NEXT: .byte 0 // 0x0
		; CHECK-NEXT: .byte 4 // 0x4
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 12 // 0xc
		; CHECK-NEXT: .byte 16 // 0x10
		; CHECK-NEXT: .byte 20 // 0x14
		; CHECK-NEXT: .byte 24 // 0x18
		; CHECK-NEXT: .byte 28 // 0x1c
		; CHECK-NEXT: .byte 32 // 0x20
		; CHECK-NEXT: .byte 36 // 0x24
		; CHECK-NEXT: .byte 40 // 0x28
		; CHECK-NEXT: .byte 44 // 0x2c
		; CHECK-NEXT: .byte 48 // 0x30
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 56 // 0x38
		; CHECK-NEXT: .byte 60 // 0x3c

		define <16 x i8> @shuffled_tbl2_to_tbl4_mixed_shuffle(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
		; CHECK-LABEL: shuffled_tbl2_to_tbl4_mixed_shuffle:
		; CHECK: // %bb.0:
		; CHECK-NEXT: adrp x8, .LCPI14_0
		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI14_0]
		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

define <16 x i8> @shuffled_tbl2_to_tbl4_incompatible_tbl2_mask1(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {		; CHECK-LABEL: .LCPI15_0:
; CHECK-LABEL: shuffled_tbl2_to_tbl4_incompatible_tbl2_mask1:		; CHECK-NEXT: .byte 0 // 0x0
		; CHECK-NEXT: .byte 4 // 0x4
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 12 // 0xc
		; CHECK-NEXT: .byte 16 // 0x10
		; CHECK-NEXT: .byte 20 // 0x14
		; CHECK-NEXT: .byte 24 // 0x18
		; CHECK-NEXT: .byte 28 // 0x1c
		; CHECK-NEXT: .byte 32 // 0x20
		; CHECK-NEXT: .byte 36 // 0x24
		; CHECK-NEXT: .byte 40 // 0x28
		; CHECK-NEXT: .byte 44 // 0x2c
		; CHECK-NEXT: .byte 48 // 0x30
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 56 // 0x38
		; CHECK-NEXT: .byte 60 // 0x3c

		define <16 x i8> @shuffled_tbl2_to_tbl4_mixed_tbl2_mask1(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
		; CHECK-LABEL: shuffled_tbl2_to_tbl4_mixed_tbl2_mask1:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI13_0		; CHECK-NEXT: adrp x8, .LCPI15_0
; CHECK-NEXT: adrp x9, .LCPI13_1		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1		; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI15_0]
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI13_0]		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
; CHECK-NEXT: adrp x8, .LCPI13_2
; CHECK-NEXT: ldr q5, [x9, :lo12:.LCPI13_1]
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4
; CHECK-NEXT: tbl.16b v1, { v2, v3 }, v5
; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI13_2]
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

define <16 x i8> @shuffled_tbl2_to_tbl4_incompatible_tbl2_mask2(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {		; CHECK-LABEL: .LCPI16_0:
; CHECK-LABEL: shuffled_tbl2_to_tbl4_incompatible_tbl2_mask2:		; CHECK-NEXT: .byte 0 // 0x0
		; CHECK-NEXT: .byte 4 // 0x4
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 12 // 0xc
		; CHECK-NEXT: .byte 16 // 0x10
		; CHECK-NEXT: .byte 20 // 0x14
		; CHECK-NEXT: .byte 24 // 0x18
		; CHECK-NEXT: .byte 28 // 0x1c
		; CHECK-NEXT: .byte 32 // 0x20
		; CHECK-NEXT: .byte 36 // 0x24
		; CHECK-NEXT: .byte 40 // 0x28
		; CHECK-NEXT: .byte 44 // 0x2c
		; CHECK-NEXT: .byte 48 // 0x30
		; CHECK-NEXT: .byte 52 // 0x34
		; CHECK-NEXT: .byte 56 // 0x38
		; CHECK-NEXT: .byte 60 // 0x3c

		define <16 x i8> @shuffled_tbl2_to_tbl4_mixed_tbl2_mask2(<16 x i8> %a, <16 x i8> %b, <16 x i8> %c, <16 x i8> %d) {
		; CHECK-LABEL: shuffled_tbl2_to_tbl4_mixed_tbl2_mask2:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI14_0		; CHECK-NEXT: adrp x8, .LCPI16_0
; CHECK-NEXT: adrp x9, .LCPI14_1		; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1		; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q3 killed $q3 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1		; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI16_0]
; CHECK-NEXT: // kill: def $q2 killed $q2 killed $q2_q3 def $q2_q3		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
; CHECK-NEXT: ldr q4, [x8, :lo12:.LCPI14_0]		; CHECK-NEXT: tbl.16b v0, { v0, v1, v2, v3 }, v4
; CHECK-NEXT: adrp x8, .LCPI14_2
; CHECK-NEXT: ldr q5, [x9, :lo12:.LCPI14_1]
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v4
; CHECK-NEXT: tbl.16b v1, { v2, v3 }, v5
; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI14_2]
; CHECK-NEXT: tbl.16b v0, { v0, v1 }, v2
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t1 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %a, <16 x i8> %b, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)		%t2 = call <16 x i8> @llvm.aarch64.neon.tbl2.v16i8(<16 x i8> %c, <16 x i8> %d, <16 x i8> <i8 0, i8 4, i8 8, i8 12, i8 16, i8 20, i8 24, i8 28, i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>)
%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>		%s = shufflevector <16 x i8> %t1, <16 x i8> %t2, <16 x i32> <i32 0, i32 1, i32 21, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
ret <16 x i8> %s		ret <16 x i8> %s
}		}

declare <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8>, <8 x i8>) nounwind readnone		declare <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8>, <8 x i8>) nounwind readnone
▲ Show 20 Lines • Show All 107 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/fp-conversion-to-tbl.ll

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines
; CHECK-NEXT: .byte 0 ; 0x0		; CHECK-NEXT: .byte 0 ; 0x0
; CHECK-NEXT: .byte 4 ; 0x4		; CHECK-NEXT: .byte 4 ; 0x4
; CHECK-NEXT: .byte 8 ; 0x8		; CHECK-NEXT: .byte 8 ; 0x8
; CHECK-NEXT: .byte 12 ; 0xc		; CHECK-NEXT: .byte 12 ; 0xc
; CHECK-NEXT: .byte 16 ; 0x10		; CHECK-NEXT: .byte 16 ; 0x10
; CHECK-NEXT: .byte 20 ; 0x14		; CHECK-NEXT: .byte 20 ; 0x14
; CHECK-NEXT: .byte 24 ; 0x18		; CHECK-NEXT: .byte 24 ; 0x18
; CHECK-NEXT: .byte 28 ; 0x1c		; CHECK-NEXT: .byte 28 ; 0x1c
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 32 ; 0x20
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 36 ; 0x24
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 40 ; 0x28
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 44 ; 0x2c
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 48 ; 0x30
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 52 ; 0x34
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 56 ; 0x38
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 60 ; 0x3c

; Tbl can also be used when combining multiple fptoui using a shuffle. The loop		; Tbl can also be used when combining multiple fptoui using a shuffle. The loop
; vectorizer may create such patterns.		; vectorizer may create such patterns.
define void @fptoui_2x_v8f32_to_v8i8_in_loop(ptr %A, ptr %B, ptr %dst) {		define void @fptoui_2x_v8f32_to_v8i8_in_loop(ptr %A, ptr %B, ptr %dst) {
; CHECK-LABEL: fptoui_2x_v8f32_to_v8i8_in_loop:		; CHECK-LABEL: fptoui_2x_v8f32_to_v8i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh2:		; CHECK-NEXT: Lloh2:
; CHECK-NEXT: adrp x9, lCPI2_0@PAGE		; CHECK-NEXT: adrp x9, lCPI2_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh3:		; CHECK-NEXT: Lloh3:
; CHECK-NEXT: ldr q0, [x9, lCPI2_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI2_0@PAGEOFF]
; CHECK-NEXT: LBB2_1: ; %loop		; CHECK-NEXT: LBB2_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: lsl x9, x8, #5		; CHECK-NEXT: lsl x9, x8, #5
; CHECK-NEXT: add x10, x0, x9		; CHECK-NEXT: add x10, x0, x9
; CHECK-NEXT: add x9, x1, x9		; CHECK-NEXT: add x9, x1, x9
; CHECK-NEXT: ldp q1, q2, [x10]		; CHECK-NEXT: ldp q2, q1, [x10]
; CHECK-NEXT: ldp q4, q3, [x9]		; CHECK-NEXT: ldp q4, q3, [x9]
; CHECK-NEXT: fcvtzu.4s v6, v2		; CHECK-NEXT: fcvtzu.4s v17, v1
; CHECK-NEXT: fcvtzu.4s v5, v1		; CHECK-NEXT: fcvtzu.4s v16, v2
; CHECK-NEXT: fcvtzu.4s v2, v3		; CHECK-NEXT: fcvtzu.4s v19, v3
; CHECK-NEXT: fcvtzu.4s v1, v4		; CHECK-NEXT: fcvtzu.4s v18, v4
; CHECK-NEXT: tbl.16b v3, { v5, v6 }, v0		; CHECK-NEXT: tbl.16b v1, { v16, v17, v18, v19 }, v0
; CHECK-NEXT: tbl.16b v1, { v1, v2 }, v0		; CHECK-NEXT: str q1, [x2, x8, lsl #4]
; CHECK-NEXT: mov.d v3[1], v1[0]
; CHECK-NEXT: str q3, [x2, x8, lsl #4]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB2_1		; CHECK-NEXT: b.eq LBB2_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh2, Lloh3		; CHECK-NEXT: .loh AdrpLdr Lloh2, Lloh3
entry:		entry:
br label %loop		br label %loop
Show All 13 Lines	loop:
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

; CHECK-LABEL: lCPI3_0:		; CHECK-LABEL: lCPI3_0:
; CHECK-NEXT: .byte 0 ; 0x0		; CHECK-NEXT: .byte 0 ; 0x0
; CHECK-NEXT: .byte 4 ; 0x4		; CHECK-NEXT: .byte 36 ; 0x24
; CHECK-NEXT: .byte 8 ; 0x8		; CHECK-NEXT: .byte 8 ; 0x8
; CHECK-NEXT: .byte 12 ; 0xc		; CHECK-NEXT: .byte 12 ; 0xc
; CHECK-NEXT: .byte 16 ; 0x10		; CHECK-NEXT: .byte 16 ; 0x10
; CHECK-NEXT: .byte 20 ; 0x14		; CHECK-NEXT: .byte 20 ; 0x14
; CHECK-NEXT: .byte 24 ; 0x18		; CHECK-NEXT: .byte 24 ; 0x18
; CHECK-NEXT: .byte 28 ; 0x1c		; CHECK-NEXT: .byte 44 ; 0x2c
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 32 ; 0x20
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 36 ; 0x24
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 40 ; 0x28
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 44 ; 0x2c
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 48 ; 0x30
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 12 ; 0xc
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 56 ; 0x38
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 60 ; 0x3c
; CHECK-NEXT: lCPI3_1:
; CHECK-NEXT: .byte 0 ; 0x0
; CHECK-NEXT: .byte 17 ; 0x11
; CHECK-NEXT: .byte 2 ; 0x2
; CHECK-NEXT: .byte 3 ; 0x3
; CHECK-NEXT: .byte 4 ; 0x4
; CHECK-NEXT: .byte 5 ; 0x5
; CHECK-NEXT: .byte 6 ; 0x6
; CHECK-NEXT: .byte 19 ; 0x13
; CHECK-NEXT: .byte 16 ; 0x10
; CHECK-NEXT: .byte 17 ; 0x11
; CHECK-NEXT: .byte 18 ; 0x12
; CHECK-NEXT: .byte 19 ; 0x13
; CHECK-NEXT: .byte 20 ; 0x14
; CHECK-NEXT: .byte 3 ; 0x3
; CHECK-NEXT: .byte 22 ; 0x16
; CHECK-NEXT: .byte 23 ; 0x17

; We need multiple tbl for the shuffle.
define void @fptoui_2x_v8f32_to_v8i8_in_loop_no_concat_shuffle(ptr %A, ptr %B, ptr %dst) {		define void @fptoui_2x_v8f32_to_v8i8_in_loop_no_concat_shuffle(ptr %A, ptr %B, ptr %dst) {
; CHECK-LABEL: fptoui_2x_v8f32_to_v8i8_in_loop_no_concat_shuffle:		; CHECK-LABEL: fptoui_2x_v8f32_to_v8i8_in_loop_no_concat_shuffle:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh4:		; CHECK-NEXT: Lloh4:
; CHECK-NEXT: adrp x9, lCPI3_0@PAGE		; CHECK-NEXT: adrp x9, lCPI3_0@PAGE
; CHECK-NEXT: Lloh5:
; CHECK-NEXT: adrp x10, lCPI3_1@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh6:		; CHECK-NEXT: Lloh5:
; CHECK-NEXT: ldr q0, [x9, lCPI3_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI3_0@PAGEOFF]
; CHECK-NEXT: Lloh7:
; CHECK-NEXT: ldr q1, [x10, lCPI3_1@PAGEOFF]
; CHECK-NEXT: LBB3_1: ; %loop		; CHECK-NEXT: LBB3_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: lsl x9, x8, #5		; CHECK-NEXT: lsl x9, x8, #5
; CHECK-NEXT: add x10, x0, x9		; CHECK-NEXT: add x10, x0, x9
; CHECK-NEXT: add x9, x1, x9		; CHECK-NEXT: add x9, x1, x9
; CHECK-NEXT: ldp q2, q3, [x10]		; CHECK-NEXT: ldp q2, q1, [x10]
; CHECK-NEXT: ldp q5, q4, [x9]		; CHECK-NEXT: ldp q4, q3, [x9]
; CHECK-NEXT: fcvtzu.4s v7, v3		; CHECK-NEXT: fcvtzu.4s v17, v1
; CHECK-NEXT: fcvtzu.4s v6, v2		; CHECK-NEXT: fcvtzu.4s v16, v2
; CHECK-NEXT: fcvtzu.4s v3, v4		; CHECK-NEXT: fcvtzu.4s v19, v3
; CHECK-NEXT: fcvtzu.4s v2, v5		; CHECK-NEXT: fcvtzu.4s v18, v4
; CHECK-NEXT: tbl.16b v4, { v6, v7 }, v0		; CHECK-NEXT: tbl.16b v1, { v16, v17, v18, v19 }, v0
; CHECK-NEXT: tbl.16b v5, { v2, v3 }, v0		; CHECK-NEXT: str q1, [x2, x8, lsl #4]
; CHECK-NEXT: tbl.16b v2, { v4, v5 }, v1
; CHECK-NEXT: str q2, [x2, x8, lsl #4]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB3_1		; CHECK-NEXT: b.eq LBB3_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh5, Lloh7		; CHECK-NEXT: .loh AdrpLdr Lloh4, Lloh5
; CHECK-NEXT: .loh AdrpLdr Lloh4, Lloh6
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <8 x float>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <8 x float>, ptr %A, i64 %iv
%gep.B = getelementptr inbounds <8x float>, ptr %B, i64 %iv		%gep.B = getelementptr inbounds <8x float>, ptr %B, i64 %iv
%l.A = load <8 x float>, ptr %gep.A		%l.A = load <8 x float>, ptr %gep.A
Show All 27 Lines
; CHECK-NEXT: .byte 48 ; 0x30		; CHECK-NEXT: .byte 48 ; 0x30
; CHECK-NEXT: .byte 52 ; 0x34		; CHECK-NEXT: .byte 52 ; 0x34
; CHECK-NEXT: .byte 56 ; 0x38		; CHECK-NEXT: .byte 56 ; 0x38
; CHECK-NEXT: .byte 60 ; 0x3c		; CHECK-NEXT: .byte 60 ; 0x3c

define void @fptoui_v16f32_to_v16i8_in_loop(ptr %A, ptr %dst) {		define void @fptoui_v16f32_to_v16i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: fptoui_v16f32_to_v16i8_in_loop:		; CHECK-LABEL: fptoui_v16f32_to_v16i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh8:		; CHECK-NEXT: Lloh6:
; CHECK-NEXT: adrp x9, lCPI4_0@PAGE		; CHECK-NEXT: adrp x9, lCPI4_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh9:		; CHECK-NEXT: Lloh7:
; CHECK-NEXT: ldr q0, [x9, lCPI4_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI4_0@PAGEOFF]
; CHECK-NEXT: LBB4_1: ; %loop		; CHECK-NEXT: LBB4_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #6		; CHECK-NEXT: add x9, x0, x8, lsl #6
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: ldp q2, q1, [x9, #32]		; CHECK-NEXT: ldp q2, q1, [x9, #32]
; CHECK-NEXT: ldp q4, q3, [x9]		; CHECK-NEXT: ldp q4, q3, [x9]
; CHECK-NEXT: fcvtzu.4s v19, v1		; CHECK-NEXT: fcvtzu.4s v19, v1
; CHECK-NEXT: fcvtzu.4s v18, v2		; CHECK-NEXT: fcvtzu.4s v18, v2
; CHECK-NEXT: fcvtzu.4s v17, v3		; CHECK-NEXT: fcvtzu.4s v17, v3
; CHECK-NEXT: fcvtzu.4s v16, v4		; CHECK-NEXT: fcvtzu.4s v16, v4
; CHECK-NEXT: tbl.16b v1, { v16, v17, v18, v19 }, v0		; CHECK-NEXT: tbl.16b v1, { v16, v17, v18, v19 }, v0
; CHECK-NEXT: str q1, [x1], #32		; CHECK-NEXT: str q1, [x1], #32
; CHECK-NEXT: b.eq LBB4_1		; CHECK-NEXT: b.eq LBB4_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh8, Lloh9		; CHECK-NEXT: .loh AdrpLdr Lloh6, Lloh7
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <16 x float>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <16 x float>, ptr %A, i64 %iv
%l.A = load <16 x float>, ptr %gep.A		%l.A = load <16 x float>, ptr %gep.A
%c = fptoui <16 x float> %l.A to <16 x i8>		%c = fptoui <16 x float> %l.A to <16 x i8>
Show All 23 Lines
; CHECK-NEXT: .byte 48 ; 0x30		; CHECK-NEXT: .byte 48 ; 0x30
; CHECK-NEXT: .byte 52 ; 0x34		; CHECK-NEXT: .byte 52 ; 0x34
; CHECK-NEXT: .byte 56 ; 0x38		; CHECK-NEXT: .byte 56 ; 0x38
; CHECK-NEXT: .byte 60 ; 0x3c		; CHECK-NEXT: .byte 60 ; 0x3c

define void @fptoui_2x_v16f32_to_v16i8_in_loop(ptr %A, ptr %B, ptr %dst) {		define void @fptoui_2x_v16f32_to_v16i8_in_loop(ptr %A, ptr %B, ptr %dst) {
; CHECK-LABEL: fptoui_2x_v16f32_to_v16i8_in_loop:		; CHECK-LABEL: fptoui_2x_v16f32_to_v16i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh10:		; CHECK-NEXT: Lloh8:
; CHECK-NEXT: adrp x9, lCPI5_0@PAGE		; CHECK-NEXT: adrp x9, lCPI5_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh11:		; CHECK-NEXT: Lloh9:
; CHECK-NEXT: ldr q0, [x9, lCPI5_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI5_0@PAGEOFF]
; CHECK-NEXT: LBB5_1: ; %loop		; CHECK-NEXT: LBB5_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: lsl x9, x8, #6		; CHECK-NEXT: lsl x9, x8, #6
; CHECK-NEXT: add x10, x0, x9		; CHECK-NEXT: add x10, x0, x9
; CHECK-NEXT: add x9, x1, x9		; CHECK-NEXT: add x9, x1, x9
; CHECK-NEXT: ldp q1, q2, [x10, #32]		; CHECK-NEXT: ldp q1, q2, [x10, #32]
; CHECK-NEXT: ldp q3, q4, [x9, #32]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
Show All 11 Lines
; CHECK-NEXT: fcvtzu.4s v22, v16		; CHECK-NEXT: fcvtzu.4s v22, v16
; CHECK-NEXT: fcvtzu.4s v21, v7		; CHECK-NEXT: fcvtzu.4s v21, v7
; CHECK-NEXT: tbl.16b v1, { v17, v18, v19, v20 }, v0		; CHECK-NEXT: tbl.16b v1, { v17, v18, v19, v20 }, v0
; CHECK-NEXT: tbl.16b v2, { v21, v22, v23, v24 }, v0		; CHECK-NEXT: tbl.16b v2, { v21, v22, v23, v24 }, v0
; CHECK-NEXT: stp q2, q1, [x9]		; CHECK-NEXT: stp q2, q1, [x9]
; CHECK-NEXT: b.eq LBB5_1		; CHECK-NEXT: b.eq LBB5_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh10, Lloh11		; CHECK-NEXT: .loh AdrpLdr Lloh8, Lloh9
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <16 x float>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <16 x float>, ptr %A, i64 %iv
%gep.B = getelementptr inbounds <16 x float>, ptr %B, i64 %iv		%gep.B = getelementptr inbounds <16 x float>, ptr %B, i64 %iv
%l.A = load <16 x float>, ptr %gep.A		%l.A = load <16 x float>, ptr %gep.A
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
; CHECK-NEXT: .byte 3 ; 0x3		; CHECK-NEXT: .byte 3 ; 0x3
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff

define void @uitofp_v8i8_to_v8f32(ptr %src, ptr %dst) {		define void @uitofp_v8i8_to_v8f32(ptr %src, ptr %dst) {
; CHECK-LABEL: uitofp_v8i8_to_v8f32:		; CHECK-LABEL: uitofp_v8i8_to_v8f32:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh12:		; CHECK-NEXT: Lloh10:
; CHECK-NEXT: adrp x9, lCPI8_0@PAGE		; CHECK-NEXT: adrp x9, lCPI8_0@PAGE
; CHECK-NEXT: Lloh13:		; CHECK-NEXT: Lloh11:
; CHECK-NEXT: adrp x10, lCPI8_1@PAGE		; CHECK-NEXT: adrp x10, lCPI8_1@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh14:		; CHECK-NEXT: Lloh12:
; CHECK-NEXT: ldr q0, [x9, lCPI8_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI8_0@PAGEOFF]
; CHECK-NEXT: Lloh15:		; CHECK-NEXT: Lloh13:
; CHECK-NEXT: ldr q1, [x10, lCPI8_1@PAGEOFF]		; CHECK-NEXT: ldr q1, [x10, lCPI8_1@PAGEOFF]
; CHECK-NEXT: LBB8_1: ; %loop		; CHECK-NEXT: LBB8_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldr d2, [x0, x8, lsl #3]		; CHECK-NEXT: ldr d2, [x0, x8, lsl #3]
; CHECK-NEXT: add x9, x1, x8, lsl #5		; CHECK-NEXT: add x9, x1, x8, lsl #5
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: tbl.16b v3, { v2 }, v0		; CHECK-NEXT: tbl.16b v3, { v2 }, v0
; CHECK-NEXT: tbl.16b v2, { v2 }, v1		; CHECK-NEXT: tbl.16b v2, { v2 }, v1
; CHECK-NEXT: ucvtf.4s v3, v3		; CHECK-NEXT: ucvtf.4s v3, v3
; CHECK-NEXT: ucvtf.4s v2, v2		; CHECK-NEXT: ucvtf.4s v2, v2
; CHECK-NEXT: stp q2, q3, [x9]		; CHECK-NEXT: stp q2, q3, [x9]
; CHECK-NEXT: b.eq LBB8_1		; CHECK-NEXT: b.eq LBB8_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh13, Lloh15		; CHECK-NEXT: .loh AdrpLdr Lloh11, Lloh13
; CHECK-NEXT: .loh AdrpLdr Lloh12, Lloh14		; CHECK-NEXT: .loh AdrpLdr Lloh10, Lloh12
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.src = getelementptr inbounds <8 x i8>, ptr %src, i64 %iv		%gep.src = getelementptr inbounds <8 x i8>, ptr %src, i64 %iv
%l = load <8 x i8>, ptr %gep.src		%l = load <8 x i8>, ptr %gep.src
%conv = uitofp <8 x i8> %l to <8 x float>		%conv = uitofp <8 x i8> %l to <8 x float>
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines
; CHECK-NEXT: .byte 3 ; 0x3		; CHECK-NEXT: .byte 3 ; 0x3
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff
; CHECK-NEXT: .byte 255 ; 0xff		; CHECK-NEXT: .byte 255 ; 0xff

define void @uitofp_v16i8_to_v16f32(ptr %src, ptr %dst) {		define void @uitofp_v16i8_to_v16f32(ptr %src, ptr %dst) {
; CHECK-LABEL: uitofp_v16i8_to_v16f32:		; CHECK-LABEL: uitofp_v16i8_to_v16f32:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: Lloh16:		; CHECK-NEXT: Lloh14:
; CHECK-NEXT: adrp x9, lCPI9_0@PAGE		; CHECK-NEXT: adrp x9, lCPI9_0@PAGE
; CHECK-NEXT: Lloh17:		; CHECK-NEXT: Lloh15:
; CHECK-NEXT: adrp x10, lCPI9_1@PAGE		; CHECK-NEXT: adrp x10, lCPI9_1@PAGE
; CHECK-NEXT: Lloh18:		; CHECK-NEXT: Lloh16:
; CHECK-NEXT: adrp x11, lCPI9_2@PAGE		; CHECK-NEXT: adrp x11, lCPI9_2@PAGE
; CHECK-NEXT: Lloh19:		; CHECK-NEXT: Lloh17:
; CHECK-NEXT: adrp x12, lCPI9_3@PAGE		; CHECK-NEXT: adrp x12, lCPI9_3@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: Lloh20:		; CHECK-NEXT: Lloh18:
; CHECK-NEXT: ldr q0, [x9, lCPI9_0@PAGEOFF]		; CHECK-NEXT: ldr q0, [x9, lCPI9_0@PAGEOFF]
; CHECK-NEXT: Lloh21:		; CHECK-NEXT: Lloh19:
; CHECK-NEXT: ldr q1, [x10, lCPI9_1@PAGEOFF]		; CHECK-NEXT: ldr q1, [x10, lCPI9_1@PAGEOFF]
; CHECK-NEXT: Lloh22:		; CHECK-NEXT: Lloh20:
; CHECK-NEXT: ldr q2, [x11, lCPI9_2@PAGEOFF]		; CHECK-NEXT: ldr q2, [x11, lCPI9_2@PAGEOFF]
; CHECK-NEXT: Lloh23:		; CHECK-NEXT: Lloh21:
; CHECK-NEXT: ldr q3, [x12, lCPI9_3@PAGEOFF]		; CHECK-NEXT: ldr q3, [x12, lCPI9_3@PAGEOFF]
; CHECK-NEXT: LBB9_1: ; %loop		; CHECK-NEXT: LBB9_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldr q4, [x0, x8, lsl #4]		; CHECK-NEXT: ldr q4, [x0, x8, lsl #4]
; CHECK-NEXT: add x9, x1, x8, lsl #6		; CHECK-NEXT: add x9, x1, x8, lsl #6
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: tbl.16b v5, { v4 }, v0		; CHECK-NEXT: tbl.16b v5, { v4 }, v0
; CHECK-NEXT: tbl.16b v6, { v4 }, v1		; CHECK-NEXT: tbl.16b v6, { v4 }, v1
; CHECK-NEXT: tbl.16b v7, { v4 }, v2		; CHECK-NEXT: tbl.16b v7, { v4 }, v2
; CHECK-NEXT: tbl.16b v4, { v4 }, v3		; CHECK-NEXT: tbl.16b v4, { v4 }, v3
; CHECK-NEXT: ucvtf.4s v5, v5		; CHECK-NEXT: ucvtf.4s v5, v5
; CHECK-NEXT: ucvtf.4s v6, v6		; CHECK-NEXT: ucvtf.4s v6, v6
; CHECK-NEXT: ucvtf.4s v7, v7		; CHECK-NEXT: ucvtf.4s v7, v7
; CHECK-NEXT: ucvtf.4s v4, v4		; CHECK-NEXT: ucvtf.4s v4, v4
; CHECK-NEXT: stp q6, q5, [x9, #32]		; CHECK-NEXT: stp q6, q5, [x9, #32]
; CHECK-NEXT: stp q4, q7, [x9]		; CHECK-NEXT: stp q4, q7, [x9]
; CHECK-NEXT: b.eq LBB9_1		; CHECK-NEXT: b.eq LBB9_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
; CHECK-NEXT: .loh AdrpLdr Lloh19, Lloh23
; CHECK-NEXT: .loh AdrpLdr Lloh18, Lloh22
; CHECK-NEXT: .loh AdrpLdr Lloh17, Lloh21		; CHECK-NEXT: .loh AdrpLdr Lloh17, Lloh21
; CHECK-NEXT: .loh AdrpLdr Lloh16, Lloh20		; CHECK-NEXT: .loh AdrpLdr Lloh16, Lloh20
		; CHECK-NEXT: .loh AdrpLdr Lloh15, Lloh19
		; CHECK-NEXT: .loh AdrpLdr Lloh14, Lloh18
entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.src = getelementptr inbounds <16 x i8>, ptr %src, i64 %iv		%gep.src = getelementptr inbounds <16 x i8>, ptr %src, i64 %iv
%l = load <16 x i8>, ptr %gep.src		%l = load <16 x i8>, ptr %gep.src
%conv = uitofp <16 x i8> %l to <16 x float>		%conv = uitofp <16 x i8> %l to <16 x float>
Show All 9 Lines