Download Raw Diff

Details

Reviewers

majnemer
t.p.northover
javed.absar
spatel
efriedma

Commits

rG52457d33b23c: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
rL333550: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector

Summary

Turning a table lookup intrinsic into a shuffle vector instruction can be beneficial. If the mask used for the lookup is the constant vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64 instructions instead.

Diff Detail

Repository: rL LLVM

Event Timeline

labrinea created this revision.Apr 26 2018, 9:51 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 26 2018, 9:51 AM

Herald added subscribers: chrib, kristof.beyls, rengolin. · View Herald Transcript

The constant folding of the vld1 intrinsic might worth moving in lib/Analysis/ConstantFolding.cpp as a separate patch, but I wasn't sure whether we always want to fold vld1, even when it's not used as a table lookup mask. I've asked about the matter in llvm-dev.

lebedev.ri edited reviewers, added: spatel; removed: llvm-commits.Apr 26 2018, 9:58 AM

efriedma added a subscriber: efriedma.Apr 26 2018, 12:50 PM

efriedma added inline comments.

lib/Transforms/InstCombine/InstCombineCalls.cpp
951 ↗	(On Diff #144141)	Please use llvm::ConstantFoldLoadFromConstPtr.
974 ↗	(On Diff #144141)	Why does the mask pattern matter?

javed.absar added inline comments.Apr 26 2018, 1:31 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
938 ↗	(On Diff #144141)	Should we assert here or simply return nullptr?
test/Transforms/InstCombine/AArch64/table-lookup.ll
2 ↗	(On Diff #144141)	It would be nice to add a comment here to explain the purpose of this test.

labrinea added inline comments.Apr 27 2018, 1:55 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
938 ↗	(On Diff #144141)	I followed the same practice as other conversion functions of this file. We call simplifyTableLookup only when handling the Arm/AArch64 tbl1 intrinsics, so we certainly know that NumElts is 8. The assertion makes sure we don't call this routine from another context.
951 ↗	(On Diff #144141)	This won't work. ConstantFoldLoadFromConstPtr first tries ConstantFoldLoadThroughGEPConstantExpr. This returns just the first element of the [8xi8] array, since vld1 expects i8, which is obtained by i8 getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0) in my reproducer.
974 ↗	(On Diff #144141)	Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This particular pattern allows the back-end to generate byte reverse instructions instead of a table lookup, which is better. Generalising the transformation for every pattern probably doesn't hurt but it's not beneficial either. However, applying the transformation for larger table lookups (tbl2,tbl3,tbl4) results in worse codegen from the back-end.
test/Transforms/InstCombine/AArch64/table-lookup.ll
2 ↗	(On Diff #144141)	Sure, thanks.

efriedma added inline comments.Apr 27 2018, 12:30 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
951 ↗	(On Diff #144141)	`ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo())` or something like that should work.
974 ↗	(On Diff #144141)	Sure, it's not always beneficial, but it's likely we can perform some useful transform on the resulting shuffle, and worst-case the backend just turns the shuffle back into a vtbl. And I don't really want to have code here to try to figure out whether a given shuffle is legal; that gets really complicated in general.

@efriedma I am going to create a new revision for converting a vld1 into an llvm load. I'll then update this revision to keep only the tbl1~>shufflevector conversion. I'll also make the patch accept any constant mask pattern.

I've moved the constant folding to a new revision: https://reviews.llvm.org/D46273. I've also added comments to the tests explaining the reason of this transformation.

Please add a testcase for llvm.aarch64.neon.tbl1.v16i8.

Good catch. The current patch hits the assertion when handling the llvm.aarch64.neon.tbl1.v16i8 inrinsic, because NumElts is 16. Does it make sense to perform the transformation in this case? I could get rid of the assert and bail the optimization if NumElts neither 8 nor 16 (or just 8).

labrinea updated this revision to Diff 144867.May 2 2018, 6:13 AM

@efriedma, ping.

Ping?

efriedma added inline comments.May 18 2018, 1:40 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1463 ↗	(On Diff #144867)	I guess you could also check that the vector element type is i8? It should be for code generated by clang, but the verifier doesn't actually enforce it, and this code can crash if it isn't.
1479 ↗	(On Diff #144867)	getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and avoids a potential crash on invalid input.
1480 ↗	(On Diff #144867)	You need special handling if Index is greater than or equal to NumElts*2. (vtbl produces 0, shufflevector produces undef.)
1483 ↗	(On Diff #144867)	You could simplify this code a little by using ConstantDataVector::get instead of ConstantVector::get.
test/Transforms/InstCombine/AArch64/tbl1.ll
23 ↗	(On Diff #144867)	Please add some testcases where the transform bails out.

Restrict the optimization to table lookups returning a <8 x i8> vector type. It's only beneficial for the mask {7,6,5,4,3,2,1,0} anyway.
Bail out if the constant mask contains an index out of range. The second argument of the new shufflevector is always going to be <undef> so the range is [0 ~ NumElts-1], not [0 ~ 2*NumElts-1], like @efriedma noticed.
Replaced getZExtValue() with getLimitedValue() for getting the value of a ConstantInt.
Simplified the retrieval of mask indices by using ConstantDataVector::get instead of ConstantVector::get.
Added testcases where the transform bails out.
Autogenerated the Filecheck patterns using the script utils/update_test_checks.py

LGTM

This revision is now accepted and ready to land.May 23 2018, 11:51 AM

Closed by commit rL333550: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector (authored by alelab01). · Explain WhyMay 30 2018, 7:42 AM

This revision was automatically updated to reflect the committed changes.

Diff 149118

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 1,381 Lines • ▼ Show 20 Lines	static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
APFloat::cmpResult Cmp1 = Max3.compare(Src1);		APFloat::cmpResult Cmp1 = Max3.compare(Src1);
assert(Cmp1 != APFloat::cmpUnordered && "nans handled separately");		assert(Cmp1 != APFloat::cmpUnordered && "nans handled separately");
if (Cmp1 == APFloat::cmpEqual)		if (Cmp1 == APFloat::cmpEqual)
return maxnum(Src0, Src2);		return maxnum(Src0, Src2);

return maxnum(Src0, Src1);		return maxnum(Src0, Src1);
}		}

		/// Convert a table lookup to shufflevector if the mask is constant.
		/// This could benefit tbl1 if the mask is { 7,6,5,4,3,2,1,0 }, in
		/// which case we could lower the shufflevector with rev64 instructions
		/// as it's actually a byte reverse.
		static Value *simplifyNeonTbl1(const IntrinsicInst &II,
		InstCombiner::BuilderTy &Builder) {
		// Bail out if the mask is not a constant.
		auto *C = dyn_cast<Constant>(II.getArgOperand(1));
		if (!C)
		return nullptr;

		auto *VecTy = cast<VectorType>(II.getType());
		unsigned NumElts = VecTy->getNumElements();

		// Only perform this transformation for <8 x i8> vector types.
		if (!VecTy->getElementType()->isIntegerTy(8) \|\| NumElts != 8)
		return nullptr;

		uint32_t Indexes[8];

		for (unsigned I = 0; I < NumElts; ++I) {
		Constant *COp = C->getAggregateElement(I);

		if (!COp \|\| !isa<ConstantInt>(COp))
		return nullptr;

		Indexes[I] = cast<ConstantInt>(COp)->getLimitedValue();

		// Make sure the mask indices are in range.
		if (Indexes[I] >= NumElts)
		return nullptr;
		}

		auto *ShuffleMask = ConstantDataVector::get(II.getContext(),
		makeArrayRef(Indexes));
		auto *V1 = II.getArgOperand(0);
		auto *V2 = Constant::getNullValue(V1->getType());
		return Builder.CreateShuffleVector(V1, V2, ShuffleMask);
		}

// Returns true iff the 2 intrinsics have the same operands, limiting the		// Returns true iff the 2 intrinsics have the same operands, limiting the
// comparison to the first NumOperands.		// comparison to the first NumOperands.
static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,		static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
unsigned NumOperands) {		unsigned NumOperands) {
assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");
assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");
for (unsigned i = 0; i < NumOperands; i++)		for (unsigned i = 0; i < NumOperands; i++)
if (I.getArgOperand(i) != E.getArgOperand(i))		if (I.getArgOperand(i) != E.getArgOperand(i))
▲ Show 20 Lines • Show All 1,525 Lines • ▼ Show 20 Lines	if (IntrAlign && IntrAlign->getZExtValue() < MemAlign) {
II->setArgOperand(AlignArg,		II->setArgOperand(AlignArg,
ConstantInt::get(Type::getInt32Ty(II->getContext()),		ConstantInt::get(Type::getInt32Ty(II->getContext()),
MemAlign, false));		MemAlign, false));
return II;		return II;
}		}
break;		break;
}		}

		case Intrinsic::arm_neon_vtbl1:
		case Intrinsic::aarch64_neon_tbl1:
		if (Value V = simplifyNeonTbl1(II, Builder))
		return replaceInstUsesWith(*II, V);
		break;

case Intrinsic::arm_neon_vmulls:		case Intrinsic::arm_neon_vmulls:
case Intrinsic::arm_neon_vmullu:		case Intrinsic::arm_neon_vmullu:
case Intrinsic::aarch64_neon_smull:		case Intrinsic::aarch64_neon_smull:
case Intrinsic::aarch64_neon_umull: {		case Intrinsic::aarch64_neon_umull: {
Value *Arg0 = II->getArgOperand(0);		Value *Arg0 = II->getArgOperand(0);
Value *Arg1 = II->getArgOperand(1);		Value *Arg1 = II->getArgOperand(1);

// Handle mul by zero first:		// Handle mul by zero first:
▲ Show 20 Lines • Show All 1,381 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/AArch64/tbl1.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-arm-none-eabi"

				; Turning a table lookup intrinsic into a shuffle vector instruction
				; can be beneficial. If the mask used for the lookup is the constant
				; vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64
				; instructions instead.

				define <8 x i8> @tbl1_8x8(<16 x i8> %vec) {
				; CHECK-LABEL: @tbl1_8x8(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = shufflevector <16 x i8> [[VEC:%.]], <16 x i8> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: ret <8 x i8> [[TMP0]]
				;
				entry:
				%tbl1 = call <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8> %vec, <8 x i8> <i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <8 x i8> %tbl1
				}

				; Bail the optimization if a mask index is out of range.
				define <8 x i8> @tbl1_8x8_out_of_range(<16 x i8> %vec) {
				; CHECK-LABEL: @tbl1_8x8_out_of_range(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TBL1:%.]] = call <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8> [[VEC:%.]], <8 x i8> <i8 8, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				; CHECK-NEXT: ret <8 x i8> [[TBL1]]
				;
				entry:
				%tbl1 = call <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8> %vec, <8 x i8> <i8 8, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <8 x i8> %tbl1
				}

				; Bail the optimization if the size of the return vector is not 8 elements.
				define <16 x i8> @tbl1_16x8(<16 x i8> %vec) {
				; CHECK-LABEL: @tbl1_16x8(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TBL1:%.]] = call <16 x i8> @llvm.aarch64.neon.tbl1.v16i8(<16 x i8> [[VEC:%.]], <16 x i8> <i8 15, i8 14, i8 13, i8 12, i8 11, i8 10, i8 9, i8 8, i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				; CHECK-NEXT: ret <16 x i8> [[TBL1]]
				;
				entry:
				%tbl1 = call <16 x i8> @llvm.aarch64.neon.tbl1.v16i8(<16 x i8> %vec, <16 x i8> <i8 15, i8 14, i8 13, i8 12, i8 11, i8 10, i8 9, i8 8, i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <16 x i8> %tbl1
				}

				; Bail the optimization if the elements of the return vector are not of type i8.
				define <8 x i16> @tbl1_8x16(<16 x i8> %vec) {
				; CHECK-LABEL: @tbl1_8x16(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TBL1:%.]] = call <8 x i16> @llvm.aarch64.neon.tbl1.v8i16(<16 x i8> [[VEC:%.]], <8 x i16> <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 7>)
				; CHECK-NEXT: ret <8 x i16> [[TBL1]]
				;
				entry:
				%tbl1 = call <8 x i16> @llvm.aarch64.neon.tbl1.v8i16(<16 x i8> %vec, <8 x i16> <i16 0, i16 1, i16 2, i16 3, i16 4, i16 5, i16 6, i16 7>)
				ret <8 x i16> %tbl1
				}

				; The type <8 x i16> is not a valid return type for this intrinsic,
				; but we want to test that the optimization won't trigger for vector
				; elements of type different than i8.
				declare <8 x i16> @llvm.aarch64.neon.tbl1.v8i16(<16 x i8>, <8 x i16>)

				declare <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8>, <8 x i8>)
				declare <16 x i8> @llvm.aarch64.neon.tbl1.v16i8(<16 x i8>, <16 x i8>)

llvm/trunk/test/Transforms/InstCombine/ARM/tbl1.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8-arm-none-eabi"

				; Turning a table lookup intrinsic into a shuffle vector instruction
				; can be beneficial. If the mask used for the lookup is the constant
				; vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64
				; instructions instead.

				define <8 x i8> @tbl1_8x8(<8 x i8> %vec) {
				; CHECK-LABEL: @tbl1_8x8(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = shufflevector <8 x i8> [[VEC:%.]], <8 x i8> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				; CHECK-NEXT: ret <8 x i8> [[TMP0]]
				;
				entry:
				%vtbl1 = call <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8> %vec, <8 x i8> <i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <8 x i8> %vtbl1
				}

				; Bail the optimization if a mask index is out of range.
				define <8 x i8> @tbl1_8x8_out_of_range(<8 x i8> %vec) {
				; CHECK-LABEL: @tbl1_8x8_out_of_range(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[VTBL1:%.]] = call <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8> [[VEC:%.]], <8 x i8> <i8 8, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				; CHECK-NEXT: ret <8 x i8> [[VTBL1]]
				;
				entry:
				%vtbl1 = call <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8> %vec, <8 x i8> <i8 8, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <8 x i8> %vtbl1
				}

				declare <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8>, <8 x i8>)

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149118

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

llvm/trunk/test/Transforms/InstCombine/AArch64/tbl1.ll

llvm/trunk/test/Transforms/InstCombine/ARM/tbl1.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vectorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149118

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

llvm/trunk/test/Transforms/InstCombine/AArch64/tbl1.ll

llvm/trunk/test/Transforms/InstCombine/ARM/tbl1.ll

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
ClosedPublic