This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
SVEInstrFormats.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-dup-extract-scalable.ll
-
sve-ld-post-inc.ll

Differential D99324

[AArch64][SVE] Codegen dup_lane for dup(vector_extract)
ClosedPublic

Authored by junparser on Mar 25 2021, 2:13 AM.

Download Raw Diff

Details

Reviewers

david-arm
sdesmalen
peterwaller-arm
paulwalker-arm
efriedma

Commits

rG1af373c67369: [AArch64][SVE] Codegen dup_lane for dup(vector_extract)

Summary

As is title, this patch adds the pattern dup(vector_extract(vec, idx)) which matches dup_lane instruction: DUP <Zd>.<T>, <Zn>.<T>[<imm>]

TestPlan: check-llvm

Diff Detail

Unit TestsFailed

	Time	Test
	70 ms	x64 debian > LLVM.Transforms/SimpleLoopUnswitch::partial-unswitch.ll
	90 ms	x64 windows > LLVM.Transforms/SimpleLoopUnswitch::partial-unswitch.ll

Event Timeline

junparser created this revision.Mar 25 2021, 2:13 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald TranscriptMar 25 2021, 2:13 AM

junparser requested review of this revision.Mar 25 2021, 2:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2021, 2:13 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

junparser edited the summary of this revision. (Show Details)Mar 25 2021, 2:14 AM

junparser edited the summary of this revision. (Show Details)

ChuanqiXu added a subscriber: ChuanqiXu.Mar 25 2021, 2:17 AM

Harbormaster completed remote builds in B95649: Diff 333235.Mar 25 2021, 2:43 AM

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

In D99324#2650064, @paulwalker-arm wrote:

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

In D99324#2650100, @junparser wrote:

In D99324#2650064, @paulwalker-arm wrote:

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

In D99324#2650103, @paulwalker-arm wrote:

In D99324#2650100, @junparser wrote:

In D99324#2650064, @paulwalker-arm wrote:

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

In D99324#2650130, @junparser wrote:

In D99324#2650103, @paulwalker-arm wrote:

In D99324#2650100, @junparser wrote:

In D99324#2650064, @paulwalker-arm wrote:

I'm not saying all the pieces will come for free but this feels like an intrinsic optimisation problem rather than an instruction selection one. What about extending SVEIntrinsicOpts.cpp to convert the pattern to a stock splat_vector(extract_vector_elt(vec, idx)) and then letting the code generator decide how best to lower the LLVM way of doing things. This'll mean we solve the problem once for ACLE and auto-vectorisation.

Actually, it is an isel issue, The svdup_lane in title is just where I find this issue.
1), there is no intrinsic direct map to dup (index) instruction, while vector_extract may lower with dup (index), it is not enough. 2) svdup_lane acle intrinsic generates as sve.dup.x + sve.tbl in llvm ir, and covert to AArch64tbl ( ... splat_vector(..., constant)) , then lower to AArch64tbl ( ... DUP(..., imm)). This is the pattern this patch try to match.

Sure, I understand that. But the problem of good code generation to duplicate a vector lane seems like a generic one and thus we can solve that first. Then we can canonicalise ACLE related intrinsic patterns to stock LLVM IR and thus not require multiple solutions to the same problem. In the future this will also have the benefit of allowing other stock LLVM transforms to kick in that would otherwise not understand the SVE specific intrinsics.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

That said, if you think PerformDAGCombine (presumable performIntrinsicCombine) is the best place today then fine. It can easily be moved up the chain when we're more comfortable.

In D99324#2650216, @paulwalker-arm wrote:

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

I agree, we also wish to use llvm scalar vector ir as well. I'll extend tbl pattern in SVEIntrinsicOpts.cpp as standalone patch, and then update this one.

In D99324#2650216, @paulwalker-arm wrote:

OK, I understand you point. splat_vector(extract_vector_elt(vec, idx)) looks ok for me, and why you prefer do it in in SVEIntrinsicOpts.cpp ? what about do this in performdagcombine with AArch64TBL node?

The reason i prefer to handle in performdagcombine is that what we want to match is AArch64tbl ( ... splat_vector(..., constant)) rather than sve.tbl + sve.dupx. Since shufflevector can also convert to splat_vector.

I feel the higher up the chain/earlier we do this the better. Outside of the ACLE intrinsics I wouldn't expect scalable AArch64ISD::TBL to be created unless that's exactly what the code generator wants. It's worth highlighting that this sort of SVE ACLE intrinsics -> LLVM IR transform will not be an isolated case. We deliberately created intrinsics even for common transforms so that we could minimise use of stock LLVM IR and thus limit failures due to missing scalable vector support. As LLVM matures I would expect us to utilise stock LLVM IR more and more. For example converting dup's to shufflevector, ptrue all predicated operations to normal LLVM bin ops...etc.

That said, if you think PerformDAGCombine (presumable performIntrinsicCombine) is the best place today then fine. It can easily be moved up the chain when we're more comfortable.

Hi Paul,
Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

Thanks

Hi @paulwalker-arm I'll update this patch based on D99412 later.

In D99324#2652288, @bin.cheng-ali wrote:

Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

By stock LLVM IR I'm referring to the LLVM instructions as defined by the LandRef plus non-target specific intrinsics.

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

I'm not sure I fully understand your question but in general when it comes to code generation I'm trying to ensure where possible that we have a canonicalised representation so that we minimise the number of patterns (IR or DAG) that end up resolving to the same instruction.

In D99324#2652779, @paulwalker-arm wrote:

In D99324#2652288, @bin.cheng-ali wrote:

Excuse me, I am new to LLVM/backend, one question is: What does "stock LLVM IR" mean(refer to) in above comment?

By stock LLVM IR I'm referring to the LLVM instructions as defined by the LandRef plus non-target specific intrinsics.

As for the patch, I am trying to understand the issue, do you suggest we should first introduce DUP_LANE pattern similar to SVDOT_LANE_S so that clang CodeGen doesn't generate dup.x when possible?

I'm not sure I fully understand your question but in general when it comes to code generation I'm trying to ensure where possible that we have a canonicalised representation so that we minimise the number of patterns (IR or DAG) that end up resolving to the same instruction.

I was thinking to introduce below like sve intrinsic pattern so that clang CodeGen generates dup_index directly for const imm_index.
def SVDUP_LANE_IMM : SInst<"svdup_lane[_{d}]", "ddi", "csilUcUsUiUlhfd", MergeNone, "aarch64_sve_dup_index", [], [ImmCheck<1, ImmCheck0_31, 1>]>;
However, this seems impossible because this BuiltIn is the same as SVDUP_LANE and results in compilation error.

Address the comment.

junparser retitled this revision from [AArch64][SVE] Simplify codegen of svdup_lane intrinsic to [AArch64][SVE] Codegen dup_lane for dup(vector_extract).Mar 28 2021, 11:21 PM

Harbormaster completed remote builds in B96045: Diff 333771.Mar 28 2021, 11:55 PM

sdesmalen added inline comments.Mar 29 2021, 4:06 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗	(On Diff #333771)	This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.

paulwalker-arm added inline comments.Mar 29 2021, 4:43 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗	(On Diff #333771)	While logically true I think in practice you'd rewrite the patten so the instruction's element type matched that of the "packed" vector associated with the dag result's element count (i.e. D for nxv2, S for nxv4). So in this instance something like: def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))), (DUP_ZZI_S ZPR:$vec, sve_elm_idx_extdup_s:$index)>; So in essense all `nxv4` results are considered to be duplicating floats, with all `nxv2` results the result of duplicating doubles. Is it possible to move the patterns into the multiclass for sve_int_perm_dup_i?

junparser added inline comments.Mar 29 2021, 4:54 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗	(On Diff #333771)	This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case. This is quiet different than what I thought, for nxv4f16, I thought the upper 64bit should be empty. Where can i find these rules? I haven't see such in anywhere
624 ↗	(On Diff #333771)	OK, I'll move them to sve_int_perm_dup_i

Address comments.

sdesmalen added inline comments.Mar 29 2021, 5:27 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗	(On Diff #333771)	We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 \| ... \| 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.

junparser added inline comments.Mar 29 2021, 5:33 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
624 ↗	(On Diff #333771)	We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 \| ... \| 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.
624 ↗	(On Diff #333771)	Thanks for explain this!

Harbormaster completed remote builds in B96088: Diff 333826.Mar 29 2021, 6:01 AM

paulwalker-arm accepted this revision.Mar 29 2021, 7:45 AM

This revision is now accepted and ready to land.Mar 29 2021, 7:45 AM

Just wanted to add that the patch summary no longer matches the intent of the patch.

junparser edited the summary of this revision. (Show Details)Mar 29 2021, 6:51 PM

This revision was landed with ongoing or failed builds.Mar 29 2021, 7:35 PM

Closed by commit rG1af373c67369: [AArch64][SVE] Codegen dup_lane for dup(vector_extract) (authored by junparser). · Explain Why

This revision was automatically updated to reflect the committed changes.

junparser added a commit: rG1af373c67369: [AArch64][SVE] Codegen dup_lane for dup(vector_extract).

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

SVEInstrFormats.td

24 lines

test/

CodeGen/

AArch64/

aarch64-dup-extract-scalable.ll

126 lines

sve-ld-post-inc.ll

2 lines

Diff 333826

llvm/lib/Target/AArch64/SVEInstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,013 Lines • ▼ Show 20 Lines	multiclass sve_int_perm_dup_i<string asm> {
def : InstAlias<"mov $Zd, $Hn",		def : InstAlias<"mov $Zd, $Hn",
(!cast<Instruction>(NAME # _H) ZPR16:$Zd, FPR16asZPR:$Hn, 0), 2>;		(!cast<Instruction>(NAME # _H) ZPR16:$Zd, FPR16asZPR:$Hn, 0), 2>;
def : InstAlias<"mov $Zd, $Sn",		def : InstAlias<"mov $Zd, $Sn",
(!cast<Instruction>(NAME # _S) ZPR32:$Zd, FPR32asZPR:$Sn, 0), 2>;		(!cast<Instruction>(NAME # _S) ZPR32:$Zd, FPR32asZPR:$Sn, 0), 2>;
def : InstAlias<"mov $Zd, $Dn",		def : InstAlias<"mov $Zd, $Dn",
(!cast<Instruction>(NAME # _D) ZPR64:$Zd, FPR64asZPR:$Dn, 0), 2>;		(!cast<Instruction>(NAME # _D) ZPR64:$Zd, FPR64asZPR:$Dn, 0), 2>;
def : InstAlias<"mov $Zd, $Qn",		def : InstAlias<"mov $Zd, $Qn",
(!cast<Instruction>(NAME # _Q) ZPR128:$Zd, FPR128asZPR:$Qn, 0), 2>;		(!cast<Instruction>(NAME # _Q) ZPR128:$Zd, FPR128asZPR:$Qn, 0), 2>;

		// Duplicate extracted element of vector into all vector elements
		def : Pat<(nxv16i8 (AArch64dup (i32 (vector_extract (nxv16i8 ZPR:$vec), sve_elm_idx_extdup_b:$index)))),
		(!cast<Instruction>(NAME # _B) ZPR:$vec, sve_elm_idx_extdup_b:$index)>;
		def : Pat<(nxv8i16 (AArch64dup (i32 (vector_extract (nxv8i16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
		(!cast<Instruction>(NAME # _H) ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
		def : Pat<(nxv4i32 (AArch64dup (i32 (vector_extract (nxv4i32 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
		(!cast<Instruction>(NAME # _S) ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
		def : Pat<(nxv2i64 (AArch64dup (i64 (vector_extract (nxv2i64 ZPR:$vec), sve_elm_idx_extdup_d:$index)))),
		(!cast<Instruction>(NAME # _D) ZPR:$vec, sve_elm_idx_extdup_d:$index)>;
		def : Pat<(nxv8f16 (AArch64dup (f16 (vector_extract (nxv8f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
		(!cast<Instruction>(NAME # _H) ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
		def : Pat<(nxv8bf16 (AArch64dup (bf16 (vector_extract (nxv8bf16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
		(!cast<Instruction>(NAME # _H) ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
		def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
		(!cast<Instruction>(NAME # _S) ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
		def : Pat<(nxv2f16 (AArch64dup (f16 (vector_extract (nxv2f16 ZPR:$vec), sve_elm_idx_extdup_d:$index)))),
		(!cast<Instruction>(NAME # _D) ZPR:$vec, sve_elm_idx_extdup_d:$index)>;
		def : Pat<(nxv4f32 (AArch64dup (f32 (vector_extract (nxv4f32 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
		(!cast<Instruction>(NAME # _S) ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
		def : Pat<(nxv2f32 (AArch64dup (f32 (vector_extract (nxv2f32 ZPR:$vec), sve_elm_idx_extdup_d:$index)))),
		(!cast<Instruction>(NAME # _D) ZPR:$vec, sve_elm_idx_extdup_d:$index)>;
		def : Pat<(nxv2f64 (AArch64dup (f64 (vector_extract (nxv2f64 ZPR:$vec), sve_elm_idx_extdup_d:$index)))),
		(!cast<Instruction>(NAME # _D) ZPR:$vec, sve_elm_idx_extdup_d:$index)>;
}		}

class sve_int_perm_tbl<bits<2> sz8_64, bits<2> opc, string asm, ZPRRegOp zprty,		class sve_int_perm_tbl<bits<2> sz8_64, bits<2> opc, string asm, ZPRRegOp zprty,
RegisterOperand VecList>		RegisterOperand VecList>
: I<(outs zprty:$Zd), (ins VecList:$Zn, zprty:$Zm),		: I<(outs zprty:$Zd), (ins VecList:$Zn, zprty:$Zm),
asm, "\t$Zd, $Zn, $Zm",		asm, "\t$Zd, $Zn, $Zm",
"",		"",
[]>, Sched<[]> {		[]>, Sched<[]> {
▲ Show 20 Lines • Show All 6,994 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-dup-extract-scalable.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple aarch64-none-linux-gnu -mattr=+sve \| FileCheck %s

				define <vscale x 16 x i8> @dup_extract_i8(<vscale x 16 x i8> %data) {
				; CHECK-LABEL: dup_extract_i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.b, z0.b[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 16 x i8> %data, i8 1
				%.splatinsert = insertelement <vscale x 16 x i8> poison, i8 %1, i32 0
				%.splat = shufflevector <vscale x 16 x i8> %.splatinsert, <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer
				ret <vscale x 16 x i8> %.splat
				}

				define <vscale x 8 x i16> @dup_extract_i16(<vscale x 8 x i16> %data) {
				; CHECK-LABEL: dup_extract_i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.h, z0.h[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 8 x i16> %data, i16 1
				%.splatinsert = insertelement <vscale x 8 x i16> poison, i16 %1, i32 0
				%.splat = shufflevector <vscale x 8 x i16> %.splatinsert, <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer
				ret <vscale x 8 x i16> %.splat
				}

				define <vscale x 4 x i32> @dup_extract_i32(<vscale x 4 x i32> %data) {
				; CHECK-LABEL: dup_extract_i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 4 x i32> %data, i32 1
				%.splatinsert = insertelement <vscale x 4 x i32> poison, i32 %1, i32 0
				%.splat = shufflevector <vscale x 4 x i32> %.splatinsert, <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				ret <vscale x 4 x i32> %.splat
				}

				define <vscale x 2 x i64> @dup_extract_i64(<vscale x 2 x i64> %data) {
				; CHECK-LABEL: dup_extract_i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 2 x i64> %data, i64 1
				%.splatinsert = insertelement <vscale x 2 x i64> poison, i64 %1, i32 0
				%.splat = shufflevector <vscale x 2 x i64> %.splatinsert, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				ret <vscale x 2 x i64> %.splat
				}

				define <vscale x 8 x half> @dup_extract_f16(<vscale x 8 x half> %data) {
				; CHECK-LABEL: dup_extract_f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.h, z0.h[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 8 x half> %data, i16 1
				%.splatinsert = insertelement <vscale x 8 x half> poison, half %1, i32 0
				%.splat = shufflevector <vscale x 8 x half> %.splatinsert, <vscale x 8 x half> poison, <vscale x 8 x i32> zeroinitializer
				ret <vscale x 8 x half> %.splat
				}

				define <vscale x 4 x half> @dup_extract_f16_4(<vscale x 4 x half> %data) {
				; CHECK-LABEL: dup_extract_f16_4:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 4 x half> %data, i16 1
				%.splatinsert = insertelement <vscale x 4 x half> poison, half %1, i32 0
				%.splat = shufflevector <vscale x 4 x half> %.splatinsert, <vscale x 4 x half> poison, <vscale x 4 x i32> zeroinitializer
				ret <vscale x 4 x half> %.splat
				}

				define <vscale x 2 x half> @dup_extract_f16_2(<vscale x 2 x half> %data) {
				; CHECK-LABEL: dup_extract_f16_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 2 x half> %data, i16 1
				%.splatinsert = insertelement <vscale x 2 x half> poison, half %1, i32 0
				%.splat = shufflevector <vscale x 2 x half> %.splatinsert, <vscale x 2 x half> poison, <vscale x 2 x i32> zeroinitializer
				ret <vscale x 2 x half> %.splat
				}

				define <vscale x 8 x bfloat> @dup_extract_bf16(<vscale x 8 x bfloat> %data) #0 {
				; CHECK-LABEL: dup_extract_bf16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.h, z0.h[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 8 x bfloat> %data, i16 1
				%.splatinsert = insertelement <vscale x 8 x bfloat> poison, bfloat %1, i32 0
				%.splat = shufflevector <vscale x 8 x bfloat> %.splatinsert, <vscale x 8 x bfloat> poison, <vscale x 8 x i32> zeroinitializer
				ret <vscale x 8 x bfloat> %.splat
				}

				define <vscale x 4 x float> @dup_extract_f32(<vscale x 4 x float> %data) {
				; CHECK-LABEL: dup_extract_f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 4 x float> %data, i32 1
				%.splatinsert = insertelement <vscale x 4 x float> poison, float %1, i32 0
				%.splat = shufflevector <vscale x 4 x float> %.splatinsert, <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
				ret <vscale x 4 x float> %.splat
				}

				define <vscale x 2 x float> @dup_extract_f32_2(<vscale x 2 x float> %data) {
				; CHECK-LABEL: dup_extract_f32_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 2 x float> %data, i32 1
				%.splatinsert = insertelement <vscale x 2 x float> poison, float %1, i32 0
				%.splat = shufflevector <vscale x 2 x float> %.splatinsert, <vscale x 2 x float> poison, <vscale x 2 x i32> zeroinitializer
				ret <vscale x 2 x float> %.splat
				}

				define <vscale x 2 x double> @dup_extract_f64(<vscale x 2 x double> %data) {
				; CHECK-LABEL: dup_extract_f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: ret
				%1 = extractelement <vscale x 2 x double> %data, i64 1
				%.splatinsert = insertelement <vscale x 2 x double> poison, double %1, i32 0
				%.splat = shufflevector <vscale x 2 x double> %.splatinsert, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
				ret <vscale x 2 x double> %.splat
				}

				; +bf16 is required for the bfloat version.
				attributes #0 = { "target-features"="+sve,+bf16" }

llvm/test/CodeGen/AArch64/sve-ld-post-inc.ll

Show All 23 Lines	; CHECK-NEXT: ret
ret <vscale x 4 x i32> %ins		ret <vscale x 4 x i32> %ins
}		}

define <vscale x 2 x double> @test_post_ld1_dup(double* %a, double** %ptr, i64 %inc) {		define <vscale x 2 x double> @test_post_ld1_dup(double* %a, double** %ptr, i64 %inc) {
; CHECK-LABEL: test_post_ld1_dup:		; CHECK-LABEL: test_post_ld1_dup:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldr d0, [x0]		; CHECK-NEXT: ldr d0, [x0]
; CHECK-NEXT: add x8, x0, x2, lsl #3		; CHECK-NEXT: add x8, x0, x2, lsl #3
; CHECK-NEXT: mov z0.d, d0
; CHECK-NEXT: str x8, [x1]		; CHECK-NEXT: str x8, [x1]
		; CHECK-NEXT: mov z0.d, d0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%load = load double, double* %a		%load = load double, double* %a
%dup = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double %load)		%dup = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double %load)
%gep = getelementptr double, double* %a, i64 %inc		%gep = getelementptr double, double* %a, i64 %inc
store double* %gep, double** %ptr		store double* %gep, double** %ptr
ret <vscale x 2 x double> %dup		ret <vscale x 2 x double> %dup
}		}

declare <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double)		declare <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double)