Download Raw Diff

Details

Reviewers

t.p.northover
dmgreen
SjoerdMeijer
peter.smith

Commits

rG232953f9962d: [AArch64] Add pattern for SQDML*Lv1i32_indexed

Summary

There was no pattern to fold into these instructions. This patch adds
the pattern obtained from the following ACLE intrinsics so that they
generate sqdmlal/sqdmlsl instructions instead of separate sqdmull and
sqadd/sqsub instructions:

vqdmlalh_s16, vqdmlslh_s16
vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, vqdmlslh_laneq_s16 (when the lane index is 0)

It also modifies the result of the existing pattern for the latter, when
the lane index is not 0, to use the v1i32_indexed instructions instead
of the v4i16_indexed ones.

Fixes #49997.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

overmighty created this revision.Aug 11 2022, 10:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2022, 10:46 AM

Herald added subscribers: arphaman, hiraditya, kristof.beyls. · View Herald Transcript

overmighty requested review of this revision.Aug 11 2022, 10:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2022, 10:46 AM

overmighty added inline comments.Aug 11 2022, 11:32 AM

llvm/lib/Target/AArch64/AArch64InstrFormats.td
8849–8850	This matches the vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, and vqdmlslh_laneq_s16 ACLE functions when the lane is not 0. I can simply remove this definition and the SQDML*Lv1i32_indexed instructions will be used, however it will result in longer AArch64 code (extra dup instruction). Is this FIXME really an issue?

Harbormaster completed remote builds in B180719: Diff 451908.Aug 11 2022, 12:24 PM

overmighty added inline comments.Aug 11 2022, 2:02 PM

llvm/lib/Target/AArch64/AArch64InstrFormats.td
8896	It looks like the second `FPR32Op` here is the cause of the failing tests, but it is required for the ISel DAG pattern below, which is the pattern of the ACLE functions mentioned in the diff summary. I doubt that we want to change the tests to fit this diff either, as instructions such as `sqdmlal.h s0, h0, v0[7]` are valid and should not result in `llvm-mc` errors.

Sounds OK to me if you can do it without changing the register types.

(I was originally unsure, for some of these it is not correct for sqadd+sqmul to be sqmla. But in this case it seems valid. The sqdmlsl does perform two saturation steps)

It may need a separate Pat so that is can generate an EXTRACT_SUBREG, I'm not sure. What happens if the lane index is not 0?

A separate Pat is now used instead of setting the pattern in the v1i32_indexed instructions' definition, so that changing the instructions' register types is no longer required.

The tests have been updated.

@dmgreen When the lane index is not 0, the Pat from lines 8852 to 8862 in AArch64InstrFormats.td is used. See my comment on line 8862 where I also ask about the FIXME comment before the said Pat.

Harbormaster completed remote builds in B181099: Diff 452428.Aug 13 2022, 9:47 AM

mingmingl added a subscriber: mingmingl.Aug 14 2022, 12:07 AM

The (v4i16 (scalar_to_vector (i32 FPR32Op:$Rn))) part of the pattern was replaced with what it already folds into: (v4i16 V64:$Rn), as in the existing pattern for the vqdml*lh_lane*_s16 functions when given a lane index other than 0. The (i32 FPR32Op:$Rd) part of the result has been simplified into FPR32Op:$Rd too.

The existing Pat for vqdml*lh_lane*_s16 with lane indexes other than 0 has been modified to use the v1i32_indexed instructions instead of the v4i16_indexed ones. I figured this was a better thing to do than ask if the FIXME comment preceding it really is an issue or not. I am not sure what that comment meant by "but an intermediate EXTRACT_SUBREG would be untyped." Relevant tests have been updated.

I tried to follow the indentation style that the other DAGs in the SIMDIndexedLongSQDMLXSDTied multiclass seemed to have.

Harbormaster completed remote builds in B181263: Diff 452644.Aug 15 2022, 7:25 AM

Can you add a testcase where the extractelement is not the 0 lane? It would be good to have tests to make sure that the new pattern with non-zero offset works OK.

The moved pattern looks OK from what I can see. The input hasn't changed, just been reformatted? Just the output of it has changed? I agree that I'm not sure what the FIXME meant. Tablegen can be finicky at times.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
8908	Does this need to be 0, if we use the lowest lane in the output pattern `(EXTRACT_SUBREG V64:$Rn, hsub)`?

The VectorIndexH:$idx part of the new pattern has been replaced with (i64 0) to avoid unwanted matches that would result in incorrect code generation.

Oops, I was about to reply that such test cases already exist, but I think that I misunderstood what you meant. I added the test cases that you wanted if I understand you correctly now.

The input of the moved pattern has indeed just been reformatted, to follow the indentation style that the other DAGs in the multiclass seemed to follow, although different indentation styles are used throughout the entire file. Only the output has changed.

Harbormaster completed remote builds in B181358: Diff 452780.Aug 15 2022, 1:29 PM

Sounds good thanks. Those new tests look good.

LGTM

This revision is now accepted and ready to land.Aug 16 2022, 2:53 AM

Do you have commit access, or do you want us to commit this for you? If so can you provide a "name <email@etc.com>" to attribute the patch to.

I do not have commit access. Please commit this as "OverMighty <its.overmighty@gmail.com>". Thank you. :)

overmighty edited the summary of this revision. (Show Details)Aug 16 2022, 6:04 AM

Closed by commit rG232953f9962d: [AArch64] Add pattern for SQDML*Lv1i32_indexed (authored by overmighty, committed by dmgreen). · Explain WhyAug 17 2022, 4:00 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG232953f9962d: [AArch64] Add pattern for SQDML*Lv1i32_indexed.

Diff 453253

llvm/lib/Target/AArch64/AArch64InstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,840 Lines • ▼ Show 20 Lines	[(set (v4i32 V128:$dst),
(v4i16 V64:$Rn),		(v4i16 V64:$Rn),
(v4i16 (AArch64duplane16 (v8i16 V128_lo:$Rm),		(v4i16 (AArch64duplane16 (v8i16 V128_lo:$Rm),
VectorIndexH:$idx))))))]> {		VectorIndexH:$idx))))))]> {
bits<3> idx;		bits<3> idx;
let Inst{11} = idx{2};		let Inst{11} = idx{2};
let Inst{21} = idx{1};		let Inst{21} = idx{1};
let Inst{20} = idx{0};		let Inst{20} = idx{0};
}		}

// FIXME: it would be nice to use the scalar (v1i32) instruction here, but an
// intermediate EXTRACT_SUBREG would be untyped.
def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
(i32 (vector_extract (v4i32
(int_aarch64_neon_sqdmull (v4i16 V64:$Rn),
(v4i16 (AArch64duplane16 (v8i16 V128_lo:$Rm),
VectorIndexH:$idx)))),
(i64 0))))),
(EXTRACT_SUBREG
(!cast<Instruction>(NAME # v4i16_indexed)
(SUBREG_TO_REG (i32 0), FPR32Op:$Rd, ssub), V64:$Rn,
V128_lo:$Rm, VectorIndexH:$idx),
ssub)>;

def v8i16_indexed : BaseSIMDIndexedTied<1, U, 0, 0b01, opc,		def v8i16_indexed : BaseSIMDIndexedTied<1, U, 0, 0b01, opc,
		overmightyAuthorUnsubmitted Done Reply Inline Actions This matches the vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, and vqdmlslh_laneq_s16 ACLE functions when the lane is not 0. I can simply remove this definition and the SQDMLLv1i32_indexed instructions will be used, however it will result in longer AArch64 code (extra dup instruction). Is this FIXME really an issue? overmighty:* This matches the vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, and…
V128, V128,		V128, V128,
V128_lo, VectorIndexH,		V128_lo, VectorIndexH,
asm#"2", ".4s", ".4s", ".8h", ".h",		asm#"2", ".4s", ".4s", ".8h", ".h",
[(set (v4i32 V128:$dst),		[(set (v4i32 V128:$dst),
(Accum (v4i32 V128:$Rd),		(Accum (v4i32 V128:$Rd),
(v4i32 (int_aarch64_neon_sqdmull		(v4i32 (int_aarch64_neon_sqdmull
(extract_high_v8i16 (v8i16 V128:$Rn)),		(extract_high_v8i16 (v8i16 V128:$Rn)),
(extract_high_dup_v8i16 (v8i16 V128_lo:$Rm), VectorIndexH:$idx)))))]> {		(extract_high_dup_v8i16 (v8i16 V128_lo:$Rm), VectorIndexH:$idx)))))]> {
Show All 29 Lines	[(set (v2i64 V128:$dst),
(extract_high_dup_v4i32 (v4i32 V128:$Rm), VectorIndexS:$idx)))))]> {		(extract_high_dup_v4i32 (v4i32 V128:$Rm), VectorIndexS:$idx)))))]> {
bits<2> idx;		bits<2> idx;
let Inst{11} = idx{1};		let Inst{11} = idx{1};
let Inst{21} = idx{0};		let Inst{21} = idx{0};
}		}

def v1i32_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc,		def v1i32_indexed : BaseSIMDIndexedTied<1, U, 1, 0b01, opc,
FPR32Op, FPR16Op, V128_lo, VectorIndexH,		FPR32Op, FPR16Op, V128_lo, VectorIndexH,
asm, ".h", "", "", ".h", []> {		asm, ".h", "", "", ".h", []> {
		overmightyAuthorUnsubmitted Done Reply Inline Actions It looks like the second `FPR32Op` here is the cause of the failing tests, but it is required for the ISel DAG pattern below, which is the pattern of the ACLE functions mentioned in the diff summary. I doubt that we want to change the tests to fit this diff either, as instructions such as `sqdmlal.h s0, h0, v0[7]` are valid and should not result in `llvm-mc` errors. overmighty: It looks like the second `FPR32Op` here is the cause of the failing tests, but it is required…
bits<3> idx;		bits<3> idx;
let Inst{11} = idx{2};		let Inst{11} = idx{2};
let Inst{21} = idx{1};		let Inst{21} = idx{1};
let Inst{20} = idx{0};		let Inst{20} = idx{0};
}		}

		def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
		(i32 (vector_extract
		(v4i32 (int_aarch64_neon_sqdmull
		(v4i16 V64:$Rn),
		(v4i16 V64:$Rm))),
		(i64 0))))),
		dmgreenUnsubmitted Done Reply Inline Actions Does this need to be 0, if we use the lowest lane in the output pattern `(EXTRACT_SUBREG V64:$Rn, hsub)`? dmgreen: Does this need to be 0, if we use the lowest lane in the output pattern `(EXTRACT_SUBREG V64…
		(!cast<Instruction>(NAME # v1i32_indexed)
		FPR32Op:$Rd,
		(EXTRACT_SUBREG V64:$Rn, hsub),
		(INSERT_SUBREG (IMPLICIT_DEF), V64:$Rm, dsub),
		(i64 0))>;

		def : Pat<(i32 (Accum (i32 FPR32Op:$Rd),
		(i32 (vector_extract
		(v4i32 (int_aarch64_neon_sqdmull
		(v4i16 V64:$Rn),
		(v4i16 (AArch64duplane16
		(v8i16 V128_lo:$Rm),
		VectorIndexH:$idx)))),
		(i64 0))))),
		(!cast<Instruction>(NAME # v1i32_indexed)
		FPR32Op:$Rd,
		(EXTRACT_SUBREG V64:$Rn, hsub),
		V128_lo:$Rm,
		VectorIndexH:$idx)>;

def v1i64_indexed : BaseSIMDIndexedTied<1, U, 1, 0b10, opc,		def v1i64_indexed : BaseSIMDIndexedTied<1, U, 1, 0b10, opc,
FPR64Op, FPR32Op, V128, VectorIndexS,		FPR64Op, FPR32Op, V128, VectorIndexS,
asm, ".s", "", "", ".s",		asm, ".s", "", "", ".s",
[(set (i64 FPR64Op:$dst),		[(set (i64 FPR64Op:$dst),
(Accum (i64 FPR64Op:$Rd),		(Accum (i64 FPR64Op:$Rd),
(i64 (int_aarch64_neon_sqdmulls_scalar		(i64 (int_aarch64_neon_sqdmulls_scalar
(i32 FPR32Op:$Rn),		(i32 FPR32Op:$Rn),
▲ Show 20 Lines • Show All 2,647 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-vmul.ll

	Show First 20 Lines • Show All 1,613 Lines • ▼ Show 20 Lines
	}			}

	define i32 @sqdmlal_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {			define i32 @sqdmlal_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
	; CHECK-LABEL: sqdmlal_lane_1s:			; CHECK-LABEL: sqdmlal_lane_1s:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov s1, w1			; CHECK-NEXT: fmov s1, w1
	; CHECK-NEXT: fmov s2, w0			; CHECK-NEXT: fmov s2, w0
	; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0			; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
	; CHECK-NEXT: sqdmlal.4s v2, v1, v0[1]			; CHECK-NEXT: sqdmlal.h s2, h1, v0[1]
	; CHECK-NEXT: fmov w0, s2			; CHECK-NEXT: fmov w0, s2
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%lhs = insertelement <4 x i16> undef, i16 %B, i32 0			%lhs = insertelement <4 x i16> undef, i16 %B, i32 0
	%rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			%rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
	%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)			%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)
	%prod = extractelement <4 x i32> %prod.vec, i32 0			%prod = extractelement <4 x i32> %prod.vec, i32 0
	%res = call i32 @llvm.aarch64.neon.sqadd.i32(i32 %A, i32 %prod)			%res = call i32 @llvm.aarch64.neon.sqadd.i32(i32 %A, i32 %prod)
	ret i32 %res			ret i32 %res
	}			}
	declare i32 @llvm.aarch64.neon.sqadd.i32(i32, i32)			declare i32 @llvm.aarch64.neon.sqadd.i32(i32, i32)

	define i32 @sqdmlsl_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {			define i32 @sqdmlsl_lane_1s(i32 %A, i16 %B, <4 x i16> %C) nounwind {
	; CHECK-LABEL: sqdmlsl_lane_1s:			; CHECK-LABEL: sqdmlsl_lane_1s:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov s1, w1			; CHECK-NEXT: fmov s1, w1
	; CHECK-NEXT: fmov s2, w0			; CHECK-NEXT: fmov s2, w0
	; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0			; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
	; CHECK-NEXT: sqdmlsl.4s v2, v1, v0[1]			; CHECK-NEXT: sqdmlsl.h s2, h1, v0[1]
	; CHECK-NEXT: fmov w0, s2			; CHECK-NEXT: fmov w0, s2
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%lhs = insertelement <4 x i16> undef, i16 %B, i32 0			%lhs = insertelement <4 x i16> undef, i16 %B, i32 0
	%rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>			%rhs = shufflevector <4 x i16> %C, <4 x i16> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
	%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)			%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %lhs, <4 x i16> %rhs)
	%prod = extractelement <4 x i32> %prod.vec, i32 0			%prod = extractelement <4 x i32> %prod.vec, i32 0
	%res = call i32 @llvm.aarch64.neon.sqsub.i32(i32 %A, i32 %prod)			%res = call i32 @llvm.aarch64.neon.sqsub.i32(i32 %A, i32 %prod)
	ret i32 %res			ret i32 %res
	}			}
	declare i32 @llvm.aarch64.neon.sqsub.i32(i32, i32)			declare i32 @llvm.aarch64.neon.sqsub.i32(i32, i32)

				define i32 @sqadd_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
				; CHECK-LABEL: sqadd_lane1_sqdmull4s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sqdmull.4s v0, v0, v1
				; CHECK-NEXT: fmov s1, w0
				; CHECK-NEXT: mov.s w8, v0[1]
				; CHECK-NEXT: fmov s0, w8
				; CHECK-NEXT: sqadd s0, s1, s0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret
				%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %B, <4 x i16> %C)
				%prod = extractelement <4 x i32> %prod.vec, i32 1
				%res = call i32 @llvm.aarch64.neon.sqadd.i32(i32 %A, i32 %prod)
				ret i32 %res
				}

				define i32 @sqsub_lane1_sqdmull4s(i32 %A, <4 x i16> %B, <4 x i16> %C) nounwind {
				; CHECK-LABEL: sqsub_lane1_sqdmull4s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sqdmull.4s v0, v0, v1
				; CHECK-NEXT: fmov s1, w0
				; CHECK-NEXT: mov.s w8, v0[1]
				; CHECK-NEXT: fmov s0, w8
				; CHECK-NEXT: sqsub s0, s1, s0
				; CHECK-NEXT: fmov w0, s0
				; CHECK-NEXT: ret
				%prod.vec = call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %B, <4 x i16> %C)
				%prod = extractelement <4 x i32> %prod.vec, i32 1
				%res = call i32 @llvm.aarch64.neon.sqsub.i32(i32 %A, i32 %prod)
				ret i32 %res
				}

	define i64 @sqdmlal_lane_1d(i64 %A, i32 %B, <2 x i32> %C) nounwind {			define i64 @sqdmlal_lane_1d(i64 %A, i32 %B, <2 x i32> %C) nounwind {
	; CHECK-LABEL: sqdmlal_lane_1d:			; CHECK-LABEL: sqdmlal_lane_1d:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov s1, w1			; CHECK-NEXT: fmov s1, w1
	; CHECK-NEXT: fmov d2, x0			; CHECK-NEXT: fmov d2, x0
	; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0			; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
	; CHECK-NEXT: sqdmlal.s d2, s1, v0[1]			; CHECK-NEXT: sqdmlal.s d2, s1, v0[1]
	; CHECK-NEXT: fmov x0, d2			; CHECK-NEXT: fmov x0, d2
	▲ Show 20 Lines • Show All 1,229 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: test_fdiv_v1f64:			; CHECK-LABEL: test_fdiv_v1f64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fdiv d0, d0, d1			; CHECK-NEXT: fdiv d0, d0, d1
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%prod = fdiv <1 x double> %L, %R			%prod = fdiv <1 x double> %L, %R
	ret <1 x double> %prod			ret <1 x double> %prod
	}			}

				define i32 @sqdmlal_s(i16 %A, i16 %B, i32 %C) nounwind {
				; CHECK-LABEL: sqdmlal_s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmov s0, w1
				; CHECK-NEXT: fmov s1, w0
				; CHECK-NEXT: fmov s2, w2
				; CHECK-NEXT: sqdmlal.h s2, h1, v0[0]
				; CHECK-NEXT: fmov w0, s2
				; CHECK-NEXT: ret
				%tmp1 = insertelement <4 x i16> undef, i16 %A, i64 0
				%tmp2 = insertelement <4 x i16> undef, i16 %B, i64 0
				%tmp3 = tail call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %tmp1, <4 x i16> %tmp2)
				%tmp4 = extractelement <4 x i32> %tmp3, i64 0
				%tmp5 = tail call i32 @llvm.aarch64.neon.sqadd.i32(i32 %C, i32 %tmp4)
				ret i32 %tmp5
				}

	define i64 @sqdmlal_d(i32 %A, i32 %B, i64 %C) nounwind {			define i64 @sqdmlal_d(i32 %A, i32 %B, i64 %C) nounwind {
	; CHECK-LABEL: sqdmlal_d:			; CHECK-LABEL: sqdmlal_d:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov d0, x2			; CHECK-NEXT: fmov d0, x2
	; CHECK-NEXT: fmov s1, w0			; CHECK-NEXT: fmov s1, w0
	; CHECK-NEXT: fmov s2, w1			; CHECK-NEXT: fmov s2, w1
	; CHECK-NEXT: sqdmlal d0, s1, s2			; CHECK-NEXT: sqdmlal d0, s1, s2
	; CHECK-NEXT: fmov x0, d0			; CHECK-NEXT: fmov x0, d0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%tmp4 = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %A, i32 %B)			%tmp4 = call i64 @llvm.aarch64.neon.sqdmulls.scalar(i32 %A, i32 %B)
	%tmp5 = call i64 @llvm.aarch64.neon.sqadd.i64(i64 %C, i64 %tmp4)			%tmp5 = call i64 @llvm.aarch64.neon.sqadd.i64(i64 %C, i64 %tmp4)
	ret i64 %tmp5			ret i64 %tmp5
	}			}

				define i32 @sqdmlsl_s(i16 %A, i16 %B, i32 %C) nounwind {
				; CHECK-LABEL: sqdmlsl_s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmov s0, w1
				; CHECK-NEXT: fmov s1, w0
				; CHECK-NEXT: fmov s2, w2
				; CHECK-NEXT: sqdmlsl.h s2, h1, v0[0]
				; CHECK-NEXT: fmov w0, s2
				; CHECK-NEXT: ret
				%tmp1 = insertelement <4 x i16> undef, i16 %A, i64 0
				%tmp2 = insertelement <4 x i16> undef, i16 %B, i64 0
				%tmp3 = tail call <4 x i32> @llvm.aarch64.neon.sqdmull.v4i32(<4 x i16> %tmp1, <4 x i16> %tmp2)
				%tmp4 = extractelement <4 x i32> %tmp3, i64 0
				%tmp5 = tail call i32 @llvm.aarch64.neon.sqsub.i32(i32 %C, i32 %tmp4)
				ret i32 %tmp5
				}

	define i64 @sqdmlsl_d(i32 %A, i32 %B, i64 %C) nounwind {			define i64 @sqdmlsl_d(i32 %A, i32 %B, i64 %C) nounwind {
	; CHECK-LABEL: sqdmlsl_d:			; CHECK-LABEL: sqdmlsl_d:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: fmov d0, x2			; CHECK-NEXT: fmov d0, x2
	; CHECK-NEXT: fmov s1, w0			; CHECK-NEXT: fmov s1, w0
	; CHECK-NEXT: fmov s2, w1			; CHECK-NEXT: fmov s2, w1
	; CHECK-NEXT: sqdmlsl d0, s1, s2			; CHECK-NEXT: sqdmlsl d0, s1, s2
	; CHECK-NEXT: fmov x0, d0			; CHECK-NEXT: fmov x0, d0
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add pattern for SQDML*Lv1i32_indexed
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 453253

llvm/lib/Target/AArch64/AArch64InstrFormats.td

llvm/test/CodeGen/AArch64/arm64-vmul.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add pattern for SQDML*Lv1i32_indexedClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 453253

llvm/lib/Target/AArch64/AArch64InstrFormats.td

llvm/test/CodeGen/AArch64/arm64-vmul.ll

[AArch64] Add pattern for SQDML*Lv1i32_indexed
ClosedPublic