This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] support neon_sshl and neon_ushl in performIntrinsicCombine.
ClosedPublic

Authored by fhahn on May 23 2019, 6:11 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
samparker
dmgreen
anemet

Commits

rG3e2fdbee80b0: [AArch64] support neon_sshl and neon_ushl in performIntrinsicCombine.
rL372565: [AArch64] support neon_sshl and neon_ushl in performIntrinsicCombine.

Summary

If we have a constant vector mask with the shift values being all equal,
we can simplify aarch64_neon_sshl to VSHL.

This pattern can be generated by code using vshlq_s32(a,vdupq_n_s32(n))
instead of vshlq_n_s32(a, n), because it is used in contexts where n is
not guaranteed to be constant, before inlining.

We can do a similar combine for aarch64_neon_ushl, but we have to be
a bit more careful, because we can only match ushll/ushll2 for vector
shifts with a zero-extended first operand.

Also adds 2 tests marked with FIXME, where we can further increase
codegen.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 32380
Build 32379: arc lint + arc unit

Event Timeline

fhahn created this revision.May 23 2019, 6:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2019, 6:11 AM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar. · View Herald Transcript

Harbormaster completed remote builds in B32380: Diff 200949.May 23 2019, 6:13 AM

I think there are probably other shifts that we can include while we're in the area. Most obviously aarch64_neon_ushl, but maybe others too.

Add support for neon_ushl.

Harbormaster completed remote builds in B36760: Diff 215173.Aug 14 2019, 11:05 AM

In D62308#1513868, @t.p.northover wrote:

I think there are probably other shifts that we can include while we're in the area. Most obviously aarch64_neon_ushl, but maybe others too.

Thanks Tim. I've added support for neon_ushl. I think w have to be a bit more careful with that one, as we only have ushll/ushll2 that take immediates

fhahn retitled this revision from [AArch64] support neon_sshl in performIntrinsicCombine. to [AArch64] support neon_sshl and neon_ushl in performIntrinsicCombine..Aug 15 2019, 2:04 AM

fhahn edited the summary of this revision. (Show Details)

ping

Ping

Does this also apply to right shifts?

In D62308#1661854, @anemet wrote:

Does this also apply to right shifts?

I think the relevant cases should be handled already. For vector right shifts, it looks like there are only immediate forms and we turn left shifts with negative immediate into the corresponding right shifts.

ping

Some minor test questions/suggestions. Feel free to commit after addressing.

llvm/test/CodeGen/AArch64/arm64-vshift.ll
1204	I don't see any negative tests when we zero-extend not to the next one higher type.
1227	technically this is not sshll (long)
1253	Isn't it used for the extensions?
1268	I think that we should also have other shl tests (.4s non-foldable and perhaps some other sizes).

This revision is now accepted and ready to land.Sep 17 2019, 10:21 PM

Add additional test cases and limit this patch to converting cases with appropriate zext/sext.

I'll submit a separate patch for turning ushl -> shl, if the shift is all constant.

Harbormaster completed remote builds in B38356: Diff 221021.Sep 20 2019, 6:54 AM

In D62308#1676691, @fhahn wrote:

I'll submit a separate patch for turning ushl -> shl, if the shift is all constant.

... and sshl -> shl?

In D62308#1676800, @anemet wrote:

In D62308#1676691, @fhahn wrote:

I'll submit a separate patch for turning ushl -> shl, if the shift is all constant.

... and sshl -> shl?

Yep, let's look at that separately.

llvm/test/CodeGen/AArch64/arm64-vshift.ll
1253	Yeah, I initially thought there might be a long version of sshr, but looks like there is not

Closed by commit rL372565: [AArch64] support neon_sshl and neon_ushl in performIntrinsicCombine. (authored by fhahn). · Explain WhySep 23 2019, 2:38 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

5 lines

test/

CodeGen/

AArch64/

arm64-vshift.ll

59 lines

Diff 200949

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,040 Lines • ▼ Show 20 Lines	static SDValue tryCombineShiftImm(unsigned IID, SDNode *N, SelectionDAG &DAG) {
case Intrinsic::aarch64_neon_urshl:		case Intrinsic::aarch64_neon_urshl:
Opcode = AArch64ISD::URSHR_I;		Opcode = AArch64ISD::URSHR_I;
IsRightShift = true;		IsRightShift = true;
break;		break;
case Intrinsic::aarch64_neon_sqshlu:		case Intrinsic::aarch64_neon_sqshlu:
Opcode = AArch64ISD::SQSHLU_I;		Opcode = AArch64ISD::SQSHLU_I;
IsRightShift = false;		IsRightShift = false;
break;		break;
		case Intrinsic::aarch64_neon_sshl:
		Opcode = AArch64ISD::VSHL;
		IsRightShift = false;
		break;
}		}

if (IsRightShift && ShiftAmount <= -1 && ShiftAmount >= -(int)ElemBits) {		if (IsRightShift && ShiftAmount <= -1 && ShiftAmount >= -(int)ElemBits) {
SDLoc dl(N);		SDLoc dl(N);
return DAG.getNode(Opcode, dl, N->getValueType(0), N->getOperand(1),		return DAG.getNode(Opcode, dl, N->getValueType(0), N->getOperand(1),
DAG.getConstant(-ShiftAmount, dl, MVT::i32));		DAG.getConstant(-ShiftAmount, dl, MVT::i32));
} else if (!IsRightShift && ShiftAmount >= 0 && ShiftAmount < ElemBits) {		} else if (!IsRightShift && ShiftAmount >= 0 && ShiftAmount < ElemBits) {
SDLoc dl(N);		SDLoc dl(N);
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	static SDValue performIntrinsicCombine(SDNode *N,
case Intrinsic::aarch64_neon_pmull:		case Intrinsic::aarch64_neon_pmull:
case Intrinsic::aarch64_neon_sqdmull:		case Intrinsic::aarch64_neon_sqdmull:
return tryCombineLongOpWithDup(IID, N, DCI, DAG);		return tryCombineLongOpWithDup(IID, N, DCI, DAG);
case Intrinsic::aarch64_neon_sqshl:		case Intrinsic::aarch64_neon_sqshl:
case Intrinsic::aarch64_neon_uqshl:		case Intrinsic::aarch64_neon_uqshl:
case Intrinsic::aarch64_neon_sqshlu:		case Intrinsic::aarch64_neon_sqshlu:
case Intrinsic::aarch64_neon_srshl:		case Intrinsic::aarch64_neon_srshl:
case Intrinsic::aarch64_neon_urshl:		case Intrinsic::aarch64_neon_urshl:
		case Intrinsic::aarch64_neon_sshl:
return tryCombineShiftImm(IID, N, DAG);		return tryCombineShiftImm(IID, N, DAG);
case Intrinsic::aarch64_crc32b:		case Intrinsic::aarch64_crc32b:
case Intrinsic::aarch64_crc32cb:		case Intrinsic::aarch64_crc32cb:
return tryCombineCRC32(0xff, N, DAG);		return tryCombineCRC32(0xff, N, DAG);
case Intrinsic::aarch64_crc32h:		case Intrinsic::aarch64_crc32h:
case Intrinsic::aarch64_crc32ch:		case Intrinsic::aarch64_crc32ch:
return tryCombineCRC32(0xffff, N, DAG);		return tryCombineCRC32(0xffff, N, DAG);
}		}
▲ Show 20 Lines • Show All 1,905 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-vshift.ll

	Show First 20 Lines • Show All 1,195 Lines • ▼ Show 20 Lines
	;CHECK-LABEL: sshll8h:			;CHECK-LABEL: sshll8h:
	;CHECK: sshll.8h v0, {{v[0-9]+}}, #1			;CHECK: sshll.8h v0, {{v[0-9]+}}, #1
	%tmp1 = load <8 x i8>, <8 x i8>* %A			%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = sext <8 x i8> %tmp1 to <8 x i16>			%tmp2 = sext <8 x i8> %tmp1 to <8 x i16>
	%tmp3 = shl <8 x i16> %tmp2, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			%tmp3 = shl <8 x i16> %tmp2, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	ret <8 x i16> %tmp3			ret <8 x i16> %tmp3
	}			}

	define <4 x i32> @sshll4s(<4 x i16>* %A) nounwind {			define <4 x i32> @sshll4s(<4 x i16>* %A) nounwind {
				anemetUnsubmitted Not Done Reply Inline Actions I don't see any negative tests when we zero-extend not to the next one higher type. anemet: I don't see any negative tests when we zero-extend not to the next one higher type.
	;CHECK-LABEL: sshll4s:			;CHECK-LABEL: sshll4s:
	;CHECK: sshll.4s v0, {{v[0-9]+}}, #1			;CHECK: sshll.4s v0, {{v[0-9]+}}, #1
	%tmp1 = load <4 x i16>, <4 x i16>* %A			%tmp1 = load <4 x i16>, <4 x i16>* %A
	%tmp2 = sext <4 x i16> %tmp1 to <4 x i32>			%tmp2 = sext <4 x i16> %tmp1 to <4 x i32>
	%tmp3 = shl <4 x i32> %tmp2, <i32 1, i32 1, i32 1, i32 1>			%tmp3 = shl <4 x i32> %tmp2, <i32 1, i32 1, i32 1, i32 1>
	ret <4 x i32> %tmp3			ret <4 x i32> %tmp3
	}			}

	define <2 x i64> @sshll2d(<2 x i32>* %A) nounwind {			define <2 x i64> @sshll2d(<2 x i32>* %A) nounwind {
	;CHECK-LABEL: sshll2d:			;CHECK-LABEL: sshll2d:
	;CHECK: sshll.2d v0, {{v[0-9]+}}, #1			;CHECK: sshll.2d v0, {{v[0-9]+}}, #1
	%tmp1 = load <2 x i32>, <2 x i32>* %A			%tmp1 = load <2 x i32>, <2 x i32>* %A
	%tmp2 = sext <2 x i32> %tmp1 to <2 x i64>			%tmp2 = sext <2 x i32> %tmp1 to <2 x i64>
	%tmp3 = shl <2 x i64> %tmp2, <i64 1, i64 1>			%tmp3 = shl <2 x i64> %tmp2, <i64 1, i64 1>
	ret <2 x i64> %tmp3			ret <2 x i64> %tmp3
	}			}

				declare <16 x i8> @llvm.aarch64.neon.sshl.v16i8(<16 x i8>, <16 x i8>)
				declare <8 x i16> @llvm.aarch64.neon.sshl.v8i16(<8 x i16>, <8 x i16>)
				declare <4 x i32> @llvm.aarch64.neon.sshl.v4i32(<4 x i32>, <4 x i32>)
				declare <2 x i64> @llvm.aarch64.neon.sshl.v2i64(<2 x i64>, <2 x i64>)

				define <16 x i8> @neon.sshll16b_constant_shift(<16 x i8>* %A) nounwind {
				anemetUnsubmitted Done Reply Inline Actions technically this is not sshll (long) anemet: technically this is not sshll (long)
				;CHECK-LABEL: neon.sshll16b_constant_shift
				;CHECK: shl.16b v0, {{v[0-9]+}}, #1
				%tmp1 = load <16 x i8>, <16 x i8>* %A
				%tmp2 = call <16 x i8> @llvm.aarch64.neon.sshl.v16i8(<16 x i8> %tmp1, <16 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>)
				ret <16 x i8> %tmp2
				}

				define <8 x i16> @neon.sshll8h_constant_shift(<8 x i8>* %A) nounwind {
				;CHECK-LABEL: neon.sshll8h_constant_shift
				;CHECK: sshll.8h v0, {{v[0-9]+}}, #1
				%tmp1 = load <8 x i8>, <8 x i8>* %A
				%tmp2 = sext <8 x i8> %tmp1 to <8 x i16>
				%tmp3 = call <8 x i16> @llvm.aarch64.neon.sshl.v8i16(<8 x i16> %tmp2, <8 x i16> <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>)
				ret <8 x i16> %tmp3
				}

				define <4 x i32> @neon.sshll4s_constant_shift(<4 x i16>* %A) nounwind {
				;CHECK-LABEL: neon.sshll4s_constant_shift
				;CHECK: sshll.4s v0, {{v[0-9]+}}, #1
				%tmp1 = load <4 x i16>, <4 x i16>* %A
				%tmp2 = sext <4 x i16> %tmp1 to <4 x i32>
				%tmp3 = call <4 x i32> @llvm.aarch64.neon.sshl.v4i32(<4 x i32> %tmp2, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)
				ret <4 x i32> %tmp3
				}

				; FIXME: unnecessary sshll.4s v0, v0, #0?
				anemetUnsubmitted Not Done Reply Inline Actions Isn't it used for the extensions? anemet: Isn't it used for the extensions?
				fhahnAuthorUnsubmitted Done Reply Inline Actions Yeah, I initially thought there might be a long version of sshr, but looks like there is not fhahn: Yeah, I initially thought there might be a long version of sshr, but looks like there is not
				define <4 x i32> @neon.sshll4s_neg_constant_shift(<4 x i16>* %A) nounwind {
				;CHECK-LABEL: neon.sshll4s_neg_constant_shift
				;CHECK: movi.2d v1, #0xffffffffffffffff
				;CHECK: sshll.4s v0, v0, #0
				;CHECK: sshl.4s v0, v0, v1
				%tmp1 = load <4 x i16>, <4 x i16>* %A
				%tmp2 = sext <4 x i16> %tmp1 to <4 x i32>
				%tmp3 = call <4 x i32> @llvm.aarch64.neon.sshl.v4i32(<4 x i32> %tmp2, <4 x i32> <i32 -1, i32 -1, i32 -1, i32 -1>)
				ret <4 x i32> %tmp3
				}

				; FIXME: should be constant folded.
				define <4 x i32> @neon.sshll4s_constant_fold() nounwind {
				;CHECK-LABEL: neon.sshll4s_constant_fold
				;CHECK: shl.4s v0, {{v[0-9]+}}, #1
				anemetUnsubmitted Not Done Reply Inline Actions I think that we should also have other shl tests (.4s non-foldable and perhaps some other sizes). anemet: I think that we should also have other shl tests (.4s non-foldable and perhaps some other…
				%tmp3 = call <4 x i32> @llvm.aarch64.neon.sshl.v4i32(<4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)
				ret <4 x i32> %tmp3
				}

				define <2 x i64> @neon.sshll2d_constant_shift(<2 x i32>* %A) nounwind {
				;CHECK-LABEL: neon.sshll2d_constant_shift
				;CHECK: sshll.2d v0, {{v[0-9]+}}, #1
				%tmp1 = load <2 x i32>, <2 x i32>* %A
				%tmp2 = sext <2 x i32> %tmp1 to <2 x i64>
				%tmp3 = call <2 x i64> @llvm.aarch64.neon.sshl.v2i64(<2 x i64> %tmp2, <2 x i64> <i64 1, i64 1>)
				ret <2 x i64> %tmp3
				}
	define <8 x i16> @sshll2_8h(<16 x i8>* %A) nounwind {			define <8 x i16> @sshll2_8h(<16 x i8>* %A) nounwind {
	;CHECK-LABEL: sshll2_8h:			;CHECK-LABEL: sshll2_8h:
	;CHECK: sshll.8h v0, {{v[0-9]+}}, #1			;CHECK: sshll.8h v0, {{v[0-9]+}}, #1
	%load1 = load <16 x i8>, <16 x i8>* %A			%load1 = load <16 x i8>, <16 x i8>* %A
	%tmp1 = shufflevector <16 x i8> %load1, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>			%tmp1 = shufflevector <16 x i8> %load1, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%tmp2 = sext <8 x i8> %tmp1 to <8 x i16>			%tmp2 = sext <8 x i8> %tmp1 to <8 x i16>
	%tmp3 = shl <8 x i16> %tmp2, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>			%tmp3 = shl <8 x i16> %tmp2, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
	ret <8 x i16> %tmp3			ret <8 x i16> %tmp3
	▲ Show 20 Lines • Show All 697 Lines • Show Last 20 Lines