This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Target/
-
llvm/
-
Target/
2
TargetSelectionDAG.td
-
lib/Target/AArch64/
-
Target/
-
AArch64/
5
AArch64ISelLowering.cpp
-
AArch64SVEInstrInfo.td
2
SVEInstrFormats.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2
sve-intrinsics-perm-select.ll

Differential D129758

[AArch64][SVE] Lower DUPLANE128 to LD1RQD
AbandonedPublic

Authored by MattDevereau on Jul 14 2022, 4:57 AM.

Download Raw Diff

Details

Reviewers

peterwaller-arm
paulwalker-arm
c-rhodes
bsmith
dtemirbulatov
david-arm
efriedma

Summary

Following on from https://reviews.llvm.org/D128902, lower DUPLANE128 to LD1RQD. This also introduces some DAGCombine logic to simplify bitcasts out of loading logic to result in less logically redundant patterns being added to instruction selection

Diff Detail

Unit TestsFailed

	Time	Test
	60,120 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,090 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
	60,050 ms	x64 debian > Clang.Driver::emit-reproducer.c
	60,400 ms	x64 debian > Clang.Driver::fsanitize.c
	60,740 ms	x64 debian > Clang.OpenMP::target_update_codegen.cpp

Event Timeline

MattDevereau created this revision.Jul 14 2022, 4:57 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJul 14 2022, 4:57 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: steven.zhang, psnobl, hiraditya and 2 others. · View Herald Transcript

MattDevereau requested review of this revision.Jul 14 2022, 4:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2022, 4:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B175359: Diff 444600.Jul 14 2022, 6:30 AM

Following an offline talk, ld1rqd/w/h/s needs to respect the original width of the load type due to big endian targets

Harbormaster completed remote builds in B175396: Diff 444647.Jul 14 2022, 9:41 AM

Matt added a subscriber: Matt.Jul 14 2022, 10:50 AM

c-rhodes added inline comments.Jul 15 2022, 2:35 AM

llvm/include/llvm/Target/TargetSelectionDAG.td
709	I know copied this from extract above but I don't get why the operands are in reverse order?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19195	nit: this can be moved to closer to use (`NewInsert`)
llvm/lib/Target/AArch64/SVEInstrFormats.td
6910	nit: align with above pattern

paulwalker-arm added inline comments.Jul 15 2022, 3:34 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19186–19193	`NewVT = VT.changeTypeToInteger()`?
19205–19207	Do you need to care what `Bitcast.getOperand(0)` is? I think we're just simplifying the DAG to remove redundant bitcasts to aid isel.

MattDevereau updated this revision to Diff 444953.Jul 15 2022, 5:43 AM

MattDevereau retitled this revision from [AArch64][SVE] Lower DUPELANE128 to LD1RQD to [AArch64][SVE] Lower DUPLANE128 to LD1RQD.

MattDevereau set the repository for this revision to rG LLVM Github Monorepo.

Harbormaster completed remote builds in B175617: Diff 444953.Jul 15 2022, 7:12 AM

Sorry, I rushed to suggest a code improvements without first verifying the correctness of the patches intent.

llvm/include/llvm/Target/TargetSelectionDAG.td
709	Can this be `SDTCisSameAs<0, 1>`
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19183	The goal here is to replace `duplane128(insert_subvector(x, bitcast(y), idx1), idx2)` with `bitcast(duplane128(insert_subvector(new_x,y,idx),idx)` but that is only safe under specific instances of `insert_subvector`. It's critical that `y` is a 128bit fixed length vector and `idx1==idx2`. Given the use of `DAG.getUNDEF` you also require `x` to be `undef`. With these requirements in place I think `NewVT` becomes `getPackedSVEVectorVT(y->getValueType()->getVectorElementType())`.
19191–19192	I still don't know why this matters?
llvm/lib/Target/AArch64/SVEInstrFormats.td
6909–6910	Loads and store are the exception to the rule when it comes to adding patterns to the multiclass. You'll see this with the scalar versions of ld1r. The reason being the B,H,S,D forms are not hidden with the multiclass like they are for say the arithmetic instructions. Having the patterns outside (i.e. within AArch64InstrInfo.td) means we can handle the floating point operations as well as make optimal use of the addressing modes.
llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll
592–594	Based on the patch I think you should simplify all `ld1rq#` tests by removing the constants and just have the test load the data explicitly. This will also help in the future if there turns out to be a better way to compute the constants vectors these tests are doing. So for example: define dso_local <vscale x 2 x double> @dupq_ld1rqd_f64(ptr %a) { %1 = load <2 x double>, ptr %a %2 = tail call fast <vscale x 2 x double> @llvm.vector.insert.nxv2f64.v2f64(<vscale x 2 x double> undef, <2 x double> %1, i64 0) %3 = tail call fast <vscale x 2 x double> @llvm.aarch64.sve.dupq.lane.nxv2f64(<vscale x 2 x double> %2, i64 0) ret <vscale x 2 x double> %3 } I also believe the tests are better placed in `sve-ld1r.ll`.

paulwalker-arm added inline comments.Jul 15 2022, 12:27 PM

llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll
592–594	I'd like to backtrack slightly. I still believe we want simpler tests added to sve-ld1r.ll for the new isel patterns. Then I guess these existing tests are required to show the need for the DAG combine. For this reason I think you want two patches, one for the isel then a second for the DAG combine

This is being split into two patches, the first of which being https://reviews.llvm.org/D130010

MattDevereau mentioned this in D130013: [AArch64][SVE] Add DAG-Combine to push bitcasts from floating point loads after DUPLANE128.Jul 19 2022, 8:32 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Target/

TargetSelectionDAG.td

3 lines

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

42 lines

AArch64SVEInstrInfo.td

8 lines

SVEInstrFormats.td

5 lines

test/

CodeGen/

AArch64/

sve-intrinsics-perm-select.ll

52 lines

Diff 444647

llvm/include/llvm/Target/TargetSelectionDAG.td

	Show First 20 Lines • Show All 699 Lines • ▼ Show 20 Lines
	def concat_vectors : SDNode<"ISD::CONCAT_VECTORS",			def concat_vectors : SDNode<"ISD::CONCAT_VECTORS",
	SDTypeProfile<1, 2, [SDTCisSubVecOfVec<1, 0>, SDTCisSameAs<1, 2>]>,[]>;			SDTypeProfile<1, 2, [SDTCisSubVecOfVec<1, 0>, SDTCisSameAs<1, 2>]>,[]>;

	// This operator does not do subvector type checking. The ARM			// This operator does not do subvector type checking. The ARM
	// backend, at least, needs it.			// backend, at least, needs it.
	def vector_extract_subvec : SDNode<"ISD::EXTRACT_SUBVECTOR",			def vector_extract_subvec : SDNode<"ISD::EXTRACT_SUBVECTOR",
	SDTypeProfile<1, 2, [SDTCisInt<2>, SDTCisVec<1>, SDTCisVec<0>]>,			SDTypeProfile<1, 2, [SDTCisInt<2>, SDTCisVec<1>, SDTCisVec<0>]>,
	[]>;			[]>;
				def vector_insert_subvec : SDNode<"ISD::INSERT_SUBVECTOR",
				SDTypeProfile<1, 3, [SDTCisInt<3>, SDTCisVec<2>, SDTCisVec<1>, SDTCisVec<0>]>,
				c-rhodesUnsubmitted Not Done Reply Inline Actions I know copied this from extract above but I don't get why the operands are in reverse order? c-rhodes: I know copied this from extract above but I don't get why the operands are in reverse order?
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Can this be `SDTCisSameAs<0, 1>` paulwalker-arm: Can this be `SDTCisSameAs<0, 1>`
				[]>;

	// This operator does subvector type checking.			// This operator does subvector type checking.
	def extract_subvector : SDNode<"ISD::EXTRACT_SUBVECTOR", SDTSubVecExtract, []>;			def extract_subvector : SDNode<"ISD::EXTRACT_SUBVECTOR", SDTSubVecExtract, []>;
	def insert_subvector : SDNode<"ISD::INSERT_SUBVECTOR", SDTSubVecInsert, []>;			def insert_subvector : SDNode<"ISD::INSERT_SUBVECTOR", SDTSubVecInsert, []>;

	// Nodes for intrinsics, you should use the intrinsic itself and let tblgen use			// Nodes for intrinsics, you should use the intrinsic itself and let tblgen use
	// these internally. Don't reference these directly.			// these internally. Don't reference these directly.
	def intrinsic_void : SDNode<"ISD::INTRINSIC_VOID",			def intrinsic_void : SDNode<"ISD::INTRINSIC_VOID",
	▲ Show 20 Lines • Show All 1,149 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 19,168 Lines • ▼ Show 20 Lines	DCI.CombineTo(N0.getNode(),
ExtLoad, DAG.getIntPtrConstant(1, SDLoc(N0))),		ExtLoad, DAG.getIntPtrConstant(1, SDLoc(N0))),
ExtLoad.getValue(1));		ExtLoad.getValue(1));
return SDValue(N, 0); // Return N so it doesn't get rechecked!		return SDValue(N, 0); // Return N so it doesn't get rechecked!
}		}

return SDValue();		return SDValue();
}		}

		// Loading floating point literals from the constant-pool results in bitcasts
		// to floats from integer loads. Instead, the whole duplane128 intrinsic can
		// be treated as a v2i64 load as 128 bits are always loaded as integers,
		// and the bitcast can be pushed after the duplane128.
		// Treating all 128 bit combinations (e.g. v4i32) of types as v2i64 results in
		// simpler pattern matching for Instruction Selection to LD1RQD
		static SDValue performDupLane128Combine(SDNode *N, SelectionDAG &DAG) {
		paulwalker-armUnsubmitted Not Done Reply Inline Actions The goal here is to replace `duplane128(insert_subvector(x, bitcast(y), idx1), idx2)` with `bitcast(duplane128(insert_subvector(new_x,y,idx),idx)` but that is only safe under specific instances of `insert_subvector`. It's critical that `y` is a 128bit fixed length vector and `idx1==idx2`. Given the use of `DAG.getUNDEF` you also require `x` to be `undef`. With these requirements in place I think `NewVT` becomes `getPackedSVEVectorVT(y->getValueType()->getVectorElementType())`. paulwalker-arm: The goal here is to replace `duplane128(insert_subvector(x, bitcast(y), idx1), idx2)` with…
		EVT VT = N->getValueType(0);
		EVT NewVT;
		if (VT == MVT::nxv2f64)
		NewVT = MVT::nxv2i64;
		else if (VT == MVT::nxv4f32)
		NewVT = MVT::nxv4i32;
		else if (VT == MVT::nxv8f16 \|\| VT == MVT::nxv8bf16)
		NewVT = MVT::nxv8i16;
		else
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I still don't know why this matters? paulwalker-arm: I still don't know why this matters?
		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions `NewVT = VT.changeTypeToInteger()`? paulwalker-arm: `NewVT = VT.changeTypeToInteger()`?

		SDLoc DL(N);
		c-rhodesUnsubmitted Not Done Reply Inline Actions nit: this can be moved to closer to use (`NewInsert`) c-rhodes: nit: this can be moved to closer to use (`NewInsert`)

		SDValue Insert = N->getOperand(0);
		if (Insert.getOpcode() != ISD::INSERT_SUBVECTOR)
		return SDValue();

		SDValue Bitcast = Insert.getOperand(1);
		if (Bitcast.getOpcode() != ISD::BITCAST)
		return SDValue();

		SDValue Load = Bitcast.getOperand(0);
		if (Load.getOpcode() != ISD::LOAD)
		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Do you need to care what `Bitcast.getOperand(0)` is? I think we're just simplifying the DAG to remove redundant bitcasts to aid isel. paulwalker-arm: Do you need to care what `Bitcast.getOperand(0)` is? I think we're just simplifying the DAG to…

		SDValue NewInsert =
		DAG.getNode(ISD::INSERT_SUBVECTOR, DL, NewVT, DAG.getUNDEF(NewVT), Load,
		Insert->getOperand(2));
		SDValue NewDuplane128 = DAG.getNode(AArch64ISD::DUPLANE128, DL, NewVT,
		NewInsert, N->getOperand(1));
		return DAG.getNode(ISD::BITCAST, DL, VT, NewDuplane128);
		}

static SDValue performBSPExpandForSVE(SDNode *N, SelectionDAG &DAG,		static SDValue performBSPExpandForSVE(SDNode *N, SelectionDAG &DAG,
const AArch64Subtarget *Subtarget,		const AArch64Subtarget *Subtarget,
bool fixedSVEVectorVT) {		bool fixedSVEVectorVT) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Don't expand for SVE2		// Don't expand for SVE2
if (!VT.isScalableVector() \|\| Subtarget->hasSVE2() \|\| Subtarget->hasSME())		if (!VT.isScalableVector() \|\| Subtarget->hasSVE2() \|\| Subtarget->hasSME())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	case ISD::STORE:
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
case ISD::MGATHER:		case ISD::MGATHER:
case ISD::MSCATTER:		case ISD::MSCATTER:
return performMaskedGatherScatterCombine(N, DCI, DAG);		return performMaskedGatherScatterCombine(N, DCI, DAG);
case ISD::VECTOR_SPLICE:		case ISD::VECTOR_SPLICE:
return performSVESpliceCombine(N, DAG);		return performSVESpliceCombine(N, DAG);
case ISD::FP_EXTEND:		case ISD::FP_EXTEND:
return performFPExtendCombine(N, DAG, DCI, Subtarget);		return performFPExtendCombine(N, DAG, DCI, Subtarget);
		case AArch64ISD::DUPLANE128:
		return performDupLane128Combine(N, DAG);
case AArch64ISD::BRCOND:		case AArch64ISD::BRCOND:
return performBRCONDCombine(N, DCI, DAG);		return performBRCONDCombine(N, DCI, DAG);
case AArch64ISD::TBNZ:		case AArch64ISD::TBNZ:
case AArch64ISD::TBZ:		case AArch64ISD::TBZ:
return performTBZCombine(N, DCI, DAG);		return performTBZCombine(N, DCI, DAG);
case AArch64ISD::CSEL:		case AArch64ISD::CSEL:
return performCSELCombine(N, DCI, DAG);		return performCSELCombine(N, DCI, DAG);
case AArch64ISD::DUP:		case AArch64ISD::DUP:
▲ Show 20 Lines • Show All 2,376 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

Show First 20 Lines • Show All 842 Lines • ▼ Show 20 Lines	let Predicates = [HasSVEorSME] in {
defm LD1RW_IMM : sve_mem_ld_dup<0b10, 0b10, "ld1rw", Z_s, ZPR32, uimm6s4>;		defm LD1RW_IMM : sve_mem_ld_dup<0b10, 0b10, "ld1rw", Z_s, ZPR32, uimm6s4>;
defm LD1RW_D_IMM : sve_mem_ld_dup<0b10, 0b11, "ld1rw", Z_d, ZPR64, uimm6s4>;		defm LD1RW_D_IMM : sve_mem_ld_dup<0b10, 0b11, "ld1rw", Z_d, ZPR64, uimm6s4>;
defm LD1RSB_D_IMM : sve_mem_ld_dup<0b11, 0b00, "ld1rsb", Z_d, ZPR64, uimm6s1>;		defm LD1RSB_D_IMM : sve_mem_ld_dup<0b11, 0b00, "ld1rsb", Z_d, ZPR64, uimm6s1>;
defm LD1RSB_S_IMM : sve_mem_ld_dup<0b11, 0b01, "ld1rsb", Z_s, ZPR32, uimm6s1>;		defm LD1RSB_S_IMM : sve_mem_ld_dup<0b11, 0b01, "ld1rsb", Z_s, ZPR32, uimm6s1>;
defm LD1RSB_H_IMM : sve_mem_ld_dup<0b11, 0b10, "ld1rsb", Z_h, ZPR16, uimm6s1>;		defm LD1RSB_H_IMM : sve_mem_ld_dup<0b11, 0b10, "ld1rsb", Z_h, ZPR16, uimm6s1>;
defm LD1RD_IMM : sve_mem_ld_dup<0b11, 0b11, "ld1rd", Z_d, ZPR64, uimm6s8>;		defm LD1RD_IMM : sve_mem_ld_dup<0b11, 0b11, "ld1rd", Z_d, ZPR64, uimm6s8>;

// LD1RQ loads (load quadword-vector and splat to scalable vector)		// LD1RQ loads (load quadword-vector and splat to scalable vector)
defm LD1RQ_B_IMM : sve_mem_ldqr_si<0b00, "ld1rqb", Z_b, ZPR8>;		defm LD1RQ_B_IMM : sve_mem_ldqr_si<0b00, "ld1rqb", Z_b, ZPR8, nxv16i8, v16i8, PTRUE_B>;
defm LD1RQ_H_IMM : sve_mem_ldqr_si<0b01, "ld1rqh", Z_h, ZPR16>;		defm LD1RQ_H_IMM : sve_mem_ldqr_si<0b01, "ld1rqh", Z_h, ZPR16, nxv8i16, v8i16, PTRUE_H>;
defm LD1RQ_W_IMM : sve_mem_ldqr_si<0b10, "ld1rqw", Z_s, ZPR32>;		defm LD1RQ_W_IMM : sve_mem_ldqr_si<0b10, "ld1rqw", Z_s, ZPR32, nxv4i32, v4i32, PTRUE_S>;
defm LD1RQ_D_IMM : sve_mem_ldqr_si<0b11, "ld1rqd", Z_d, ZPR64>;		defm LD1RQ_D_IMM : sve_mem_ldqr_si<0b11, "ld1rqd", Z_d, ZPR64, nxv2i64, v2i64, PTRUE_D>;
defm LD1RQ_B : sve_mem_ldqr_ss<0b00, "ld1rqb", Z_b, ZPR8, GPR64NoXZRshifted8>;		defm LD1RQ_B : sve_mem_ldqr_ss<0b00, "ld1rqb", Z_b, ZPR8, GPR64NoXZRshifted8>;
defm LD1RQ_H : sve_mem_ldqr_ss<0b01, "ld1rqh", Z_h, ZPR16, GPR64NoXZRshifted16>;		defm LD1RQ_H : sve_mem_ldqr_ss<0b01, "ld1rqh", Z_h, ZPR16, GPR64NoXZRshifted16>;
defm LD1RQ_W : sve_mem_ldqr_ss<0b10, "ld1rqw", Z_s, ZPR32, GPR64NoXZRshifted32>;		defm LD1RQ_W : sve_mem_ldqr_ss<0b10, "ld1rqw", Z_s, ZPR32, GPR64NoXZRshifted32>;
defm LD1RQ_D : sve_mem_ldqr_ss<0b11, "ld1rqd", Z_d, ZPR64, GPR64NoXZRshifted64>;		defm LD1RQ_D : sve_mem_ldqr_ss<0b11, "ld1rqd", Z_d, ZPR64, GPR64NoXZRshifted64>;

// continuous load with reg+reg addressing.		// continuous load with reg+reg addressing.
defm LD1B : sve_mem_cld_ss<0b0000, "ld1b", Z_b, ZPR8, GPR64NoXZRshifted8>;		defm LD1B : sve_mem_cld_ss<0b0000, "ld1b", Z_b, ZPR8, GPR64NoXZRshifted8>;
defm LD1B_H : sve_mem_cld_ss<0b0001, "ld1b", Z_h, ZPR16, GPR64NoXZRshifted8>;		defm LD1B_H : sve_mem_cld_ss<0b0001, "ld1b", Z_h, ZPR16, GPR64NoXZRshifted8>;
▲ Show 20 Lines • Show All 2,613 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/SVEInstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,891 Lines • ▼ Show 20 Lines	: I<(outs VecList:$Zt), (ins PPR3bAny:$Pg, GPR64sp:$Rn, simm4s16:$imm4),
let Inst{12-10} = Pg;		let Inst{12-10} = Pg;
let Inst{9-5} = Rn;		let Inst{9-5} = Rn;
let Inst{4-0} = Zt;		let Inst{4-0} = Zt;

let mayLoad = 1;		let mayLoad = 1;
}		}

multiclass sve_mem_ldqr_si<bits<2> sz, string asm, RegisterOperand listty,		multiclass sve_mem_ldqr_si<bits<2> sz, string asm, RegisterOperand listty,
ZPRRegOp zprty> {		ZPRRegOp zprty, ValueType vt1, ValueType vt2, sve_int_ptrue pred> {
def NAME : sve_mem_ldqr_si<sz, asm, listty>;		def NAME : sve_mem_ldqr_si<sz, asm, listty>;
def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn]",		def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn]",
(!cast<Instruction>(NAME) listty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, 0), 1>;		(!cast<Instruction>(NAME) listty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, 0), 1>;
def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn]",		def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn]",
(!cast<Instruction>(NAME) zprty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, 0), 0>;		(!cast<Instruction>(NAME) zprty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, 0), 0>;
def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn, $imm4]",		def : InstAlias<asm # "\t$Zt, $Pg/z, [$Rn, $imm4]",
(!cast<Instruction>(NAME) zprty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, simm4s16:$imm4), 0>;		(!cast<Instruction>(NAME) zprty:$Zt, PPR3bAny:$Pg, GPR64sp:$Rn, simm4s16:$imm4), 0>;

		def : Pat<(vt1 (AArch64duplane128 (vt1 (vector_insert_subvec (vt1 undef), (vt2 (load GPR64sp:$Xn)), (i64 0))), (i64 0))),
		(!cast<Instruction>(NAME) (pred 31), GPR64sp:$Xn, 0)>;
		c-rhodesUnsubmitted Not Done Reply Inline Actions nit: align with above pattern c-rhodes: nit: align with above pattern
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Loads and store are the exception to the rule when it comes to adding patterns to the multiclass. You'll see this with the scalar versions of ld1r. The reason being the B,H,S,D forms are not hidden with the multiclass like they are for say the arithmetic instructions. Having the patterns outside (i.e. within AArch64InstrInfo.td) means we can handle the floating point operations as well as make optimal use of the addressing modes. paulwalker-arm: Loads and store are the exception to the rule when it comes to adding patterns to the…
}		}

class sve_mem_ldqr_ss<bits<2> sz, string asm, RegisterOperand VecList,		class sve_mem_ldqr_ss<bits<2> sz, string asm, RegisterOperand VecList,
RegisterOperand gprty>		RegisterOperand gprty>
: I<(outs VecList:$Zt), (ins PPR3bAny:$Pg, GPR64sp:$Rn, gprty:$Rm),		: I<(outs VecList:$Zt), (ins PPR3bAny:$Pg, GPR64sp:$Rn, gprty:$Rm),
asm, "\t$Zt, $Pg/z, [$Rn, $Rm]", "", []>, Sched<[]> {		asm, "\t$Zt, $Pg/z, [$Rn, $Rm]", "", []>, Sched<[]> {
bits<5> Zt;		bits<5> Zt;
bits<3> Pg;		bits<3> Pg;
▲ Show 20 Lines • Show All 1,693 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsics-perm-select.ll

Show First 20 Lines • Show All 579 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%out = call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %a, i64 4)		%out = call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %a, i64 4)
ret <vscale x 2 x i64> %out		ret <vscale x 2 x i64> %out
}		}

define dso_local <vscale x 2 x double> @dupq_ld1rqd_f64() {		define dso_local <vscale x 2 x double> @dupq_ld1rqd_f64() {
; CHECK-LABEL: dupq_ld1rqd_f64:		; CHECK-LABEL: dupq_ld1rqd_f64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI49_0		; CHECK-NEXT: adrp x8, .LCPI49_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI49_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI49_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.d
		; CHECK-NEXT: ld1rqd { z0.d }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call fast <vscale x 2 x double> @llvm.vector.insert.nxv2f64.v2f64(<vscale x 2 x double> undef, <2 x double> <double 1.000000e+00, double 2.000000e+00>, i64 0)		%1 = tail call fast <vscale x 2 x double> @llvm.vector.insert.nxv2f64.v2f64(<vscale x 2 x double> undef, <2 x double> <double 1.000000e+00, double 2.000000e+00>, i64 0)
%2 = tail call fast <vscale x 2 x double> @llvm.aarch64.sve.dupq.lane.nxv2f64(<vscale x 2 x double> %1, i64 0)		%2 = tail call fast <vscale x 2 x double> @llvm.aarch64.sve.dupq.lane.nxv2f64(<vscale x 2 x double> %1, i64 0)
ret <vscale x 2 x double> %2		ret <vscale x 2 x double> %2
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Based on the patch I think you should simplify all `ld1rq#` tests by removing the constants and just have the test load the data explicitly. This will also help in the future if there turns out to be a better way to compute the constants vectors these tests are doing. So for example: define dso_local <vscale x 2 x double> @dupq_ld1rqd_f64(ptr %a) { %1 = load <2 x double>, ptr %a %2 = tail call fast <vscale x 2 x double> @llvm.vector.insert.nxv2f64.v2f64(<vscale x 2 x double> undef, <2 x double> %1, i64 0) %3 = tail call fast <vscale x 2 x double> @llvm.aarch64.sve.dupq.lane.nxv2f64(<vscale x 2 x double> %2, i64 0) ret <vscale x 2 x double> %3 } I also believe the tests are better placed in `sve-ld1r.ll`. paulwalker-arm: Based on the patch I think you should simplify all `ld1rq#` tests by removing the constants and…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I'd like to backtrack slightly. I still believe we want simpler tests added to sve-ld1r.ll for the new isel patterns. Then I guess these existing tests are required to show the need for the DAG combine. For this reason I think you want two patches, one for the isel then a second for the DAG combine paulwalker-arm: I'd like to backtrack slightly. I still believe we want simpler tests added to sve-ld1r.ll for…
}		}

define dso_local <vscale x 4 x float> @dupq_ld1rqw_f32() {		define dso_local <vscale x 4 x float> @dupq_ld1rqw_f32() {
; CHECK-LABEL: dupq_ld1rqw_f32:		; CHECK-LABEL: dupq_ld1rqw_f32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI50_0		; CHECK-NEXT: adrp x8, .LCPI50_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI50_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI50_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.s
		; CHECK-NEXT: ld1rqw { z0.s }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call fast <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, i64 0)		%1 = tail call fast <vscale x 4 x float> @llvm.vector.insert.nxv4f32.v4f32(<vscale x 4 x float> undef, <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, i64 0)
%2 = tail call fast <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %1, i64 0)		%2 = tail call fast <vscale x 4 x float> @llvm.aarch64.sve.dupq.lane.nxv4f32(<vscale x 4 x float> %1, i64 0)
ret <vscale x 4 x float> %2		ret <vscale x 4 x float> %2
}		}

define dso_local <vscale x 8 x half> @dupq_ld1rqh_f16() {		define dso_local <vscale x 8 x half> @dupq_ld1rqh_f16() {
; CHECK-LABEL: dupq_ld1rqh_f16:		; CHECK-LABEL: dupq_ld1rqh_f16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI51_0		; CHECK-NEXT: adrp x8, .LCPI51_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI51_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI51_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.h
		; CHECK-NEXT: ld1rqh { z0.h }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call fast <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> <half 0xH3C00, half 0xH4000, half 0xH4200, half 0xH4400, half 0xH4500, half 0xH4600, half 0xH4700, half 0xH4800>, i64 0)		%1 = tail call fast <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> undef, <8 x half> <half 0xH3C00, half 0xH4000, half 0xH4200, half 0xH4400, half 0xH4500, half 0xH4600, half 0xH4700, half 0xH4800>, i64 0)
%2 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %1, i64 0)		%2 = tail call fast <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> %1, i64 0)
ret <vscale x 8 x half> %2		ret <vscale x 8 x half> %2
}		}

define dso_local <vscale x 8 x bfloat> @dupq_ld1rqh_bf16() #0 {		define dso_local <vscale x 8 x bfloat> @dupq_ld1rqh_bf16() #0 {
; CHECK-LABEL: dupq_ld1rqh_bf16:		; CHECK-LABEL: dupq_ld1rqh_bf16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI52_0		; CHECK-NEXT: adrp x8, .LCPI52_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI52_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI52_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.h
		; CHECK-NEXT: ld1rqh { z0.h }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = call <vscale x 8 x bfloat> @llvm.vector.insert.nxv8bf16.v8bf16(<vscale x 8 x bfloat> undef, <8 x bfloat> <bfloat 1.000e+00, bfloat 2.000e+00, bfloat 3.000e+00, bfloat 4.000e+00, bfloat 5.000e+00, bfloat 6.000e+00, bfloat 7.000e+00, bfloat 8.000e+00>, i64 0)		%1 = call <vscale x 8 x bfloat> @llvm.vector.insert.nxv8bf16.v8bf16(<vscale x 8 x bfloat> undef, <8 x bfloat> <bfloat 1.000e+00, bfloat 2.000e+00, bfloat 3.000e+00, bfloat 4.000e+00, bfloat 5.000e+00, bfloat 6.000e+00, bfloat 7.000e+00, bfloat 8.000e+00>, i64 0)
%2 = call <vscale x 8 x bfloat> @llvm.aarch64.sve.dupq.lane.nxv8bf16(<vscale x 8 x bfloat> %1, i64 0)		%2 = call <vscale x 8 x bfloat> @llvm.aarch64.sve.dupq.lane.nxv8bf16(<vscale x 8 x bfloat> %1, i64 0)
ret <vscale x 8 x bfloat> %2		ret <vscale x 8 x bfloat> %2
}		}

define dso_local <vscale x 2 x i64> @dupq_ld1rqd_i64() {		define dso_local <vscale x 2 x i64> @dupq_ld1rqd_i64() {
; CHECK-LABEL: dupq_ld1rqd_i64:		; CHECK-LABEL: dupq_ld1rqd_i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI53_0		; CHECK-NEXT: adrp x8, .LCPI53_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI53_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI53_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.d
		; CHECK-NEXT: ld1rqd { z0.d }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call <vscale x 2 x i64> @llvm.vector.insert.nxv2i64.v2i64(<vscale x 2 x i64> undef, <2 x i64> <i64 1, i64 2>, i64 0)		%1 = tail call <vscale x 2 x i64> @llvm.vector.insert.nxv2i64.v2i64(<vscale x 2 x i64> undef, <2 x i64> <i64 1, i64 2>, i64 0)
%2 = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %1, i64 0)		%2 = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dupq.lane.nxv2i64(<vscale x 2 x i64> %1, i64 0)
ret <vscale x 2 x i64> %2		ret <vscale x 2 x i64> %2
}		}

define dso_local <vscale x 4 x i32> @dupq_ld1rqd_i32() {		define dso_local <vscale x 4 x i32> @dupq_ld1rqw_i32() {
; CHECK-LABEL: dupq_ld1rqd_i32:		; CHECK-LABEL: dupq_ld1rqw_i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI54_0		; CHECK-NEXT: adrp x8, .LCPI54_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI54_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI54_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.s
		; CHECK-NEXT: ld1rqw { z0.s }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v4i32(<vscale x 4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 4>, i64 0)		%1 = tail call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v4i32(<vscale x 4 x i32> undef, <4 x i32> <i32 1, i32 2, i32 3, i32 4>, i64 0)
%2 = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dupq.lane.nxv4i32(<vscale x 4 x i32> %1, i64 0)		%2 = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dupq.lane.nxv4i32(<vscale x 4 x i32> %1, i64 0)
ret <vscale x 4 x i32> %2		ret <vscale x 4 x i32> %2
}		}

define dso_local <vscale x 8 x i16> @dupq_ld1rqd_i16() {		define dso_local <vscale x 8 x i16> @dupq_ld1rqh_i16() {
; CHECK-LABEL: dupq_ld1rqd_i16:		; CHECK-LABEL: dupq_ld1rqh_i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI55_0		; CHECK-NEXT: adrp x8, .LCPI55_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI55_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI55_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.h
		; CHECK-NEXT: ld1rqh { z0.h }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call <vscale x 8 x i16> @llvm.vector.insert.nxv8i16.v8i16(<vscale x 8 x i16> undef, <8 x i16> <i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 7, i16 8>, i64 0)		%1 = tail call <vscale x 8 x i16> @llvm.vector.insert.nxv8i16.v8i16(<vscale x 8 x i16> undef, <8 x i16> <i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 7, i16 8>, i64 0)
%2 = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dupq.lane.nxv8i16(<vscale x 8 x i16> %1, i64 0)		%2 = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dupq.lane.nxv8i16(<vscale x 8 x i16> %1, i64 0)
ret <vscale x 8 x i16> %2		ret <vscale x 8 x i16> %2
}		}

define dso_local <vscale x 16 x i8> @dupq_ld1rqd_i8() {		define dso_local <vscale x 16 x i8> @dupq_ld1rqb_i8() {
; CHECK-LABEL: dupq_ld1rqd_i8:		; CHECK-LABEL: dupq_ld1rqb_i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: adrp x8, .LCPI56_0		; CHECK-NEXT: adrp x8, .LCPI56_0
; CHECK-NEXT: ldr q0, [x8, :lo12:.LCPI56_0]		; CHECK-NEXT: add x8, x8, :lo12:.LCPI56_0
; CHECK-NEXT: mov z0.q, q0		; CHECK-NEXT: ptrue p0.b
		; CHECK-NEXT: ld1rqb { z0.b }, p0/z, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call <vscale x 16 x i8> @llvm.vector.insert.nxv16i8.v16i8(<vscale x 16 x i8> undef, <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, i64 0)		%1 = tail call <vscale x 16 x i8> @llvm.vector.insert.nxv16i8.v16i8(<vscale x 16 x i8> undef, <16 x i8> <i8 1, i8 2, i8 3, i8 4, i8 5, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 12, i8 13, i8 14, i8 15, i8 16>, i64 0)
%2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dupq.lane.nxv16i8(<vscale x 16 x i8> %1, i64 0)		%2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dupq.lane.nxv16i8(<vscale x 16 x i8> %1, i64 0)
ret <vscale x 16 x i8> %2		ret <vscale x 16 x i8> %2
}		}

;		;
; EXT		; EXT
▲ Show 20 Lines • Show All 1,891 Lines • Show Last 20 Lines