Download Raw Diff

Details

Reviewers

majnemer
t.p.northover
javed.absar
spatel
efriedma

Commits

rG52457d33b23c: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
rL333550: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector

Summary

Turning a table lookup intrinsic into a shuffle vector instruction can be beneficial. If the mask used for the lookup is the constant vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64 instructions instead.

Diff Detail

Event Timeline

labrinea created this revision.Apr 26 2018, 9:51 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 26 2018, 9:51 AM

Herald added subscribers: chrib, kristof.beyls, rengolin. · View Herald Transcript

The constant folding of the vld1 intrinsic might worth moving in lib/Analysis/ConstantFolding.cpp as a separate patch, but I wasn't sure whether we always want to fold vld1, even when it's not used as a table lookup mask. I've asked about the matter in llvm-dev.

lebedev.ri edited reviewers, added: spatel; removed: llvm-commits.Apr 26 2018, 9:58 AM

efriedma added a subscriber: efriedma.Apr 26 2018, 12:50 PM

efriedma added inline comments.

lib/Transforms/InstCombine/InstCombineCalls.cpp
951	Please use llvm::ConstantFoldLoadFromConstPtr.
974	Why does the mask pattern matter?

javed.absar added inline comments.Apr 26 2018, 1:31 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
938	Should we assert here or simply return nullptr?
test/Transforms/InstCombine/AArch64/table-lookup.ll
2	It would be nice to add a comment here to explain the purpose of this test.

labrinea added inline comments.Apr 27 2018, 1:55 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
938	I followed the same practice as other conversion functions of this file. We call simplifyTableLookup only when handling the Arm/AArch64 tbl1 intrinsics, so we certainly know that NumElts is 8. The assertion makes sure we don't call this routine from another context.
951	This won't work. ConstantFoldLoadFromConstPtr first tries ConstantFoldLoadThroughGEPConstantExpr. This returns just the first element of the [8xi8] array, since vld1 expects i8, which is obtained by i8 getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0) in my reproducer.
974	Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This particular pattern allows the back-end to generate byte reverse instructions instead of a table lookup, which is better. Generalising the transformation for every pattern probably doesn't hurt but it's not beneficial either. However, applying the transformation for larger table lookups (tbl2,tbl3,tbl4) results in worse codegen from the back-end.
test/Transforms/InstCombine/AArch64/table-lookup.ll
2	Sure, thanks.

efriedma added inline comments.Apr 27 2018, 12:30 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
951	`ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo())` or something like that should work.
974	Sure, it's not always beneficial, but it's likely we can perform some useful transform on the resulting shuffle, and worst-case the backend just turns the shuffle back into a vtbl. And I don't really want to have code here to try to figure out whether a given shuffle is legal; that gets really complicated in general.

@efriedma I am going to create a new revision for converting a vld1 into an llvm load. I'll then update this revision to keep only the tbl1~>shufflevector conversion. I'll also make the patch accept any constant mask pattern.

I've moved the constant folding to a new revision: https://reviews.llvm.org/D46273. I've also added comments to the tests explaining the reason of this transformation.

Please add a testcase for llvm.aarch64.neon.tbl1.v16i8.

Good catch. The current patch hits the assertion when handling the llvm.aarch64.neon.tbl1.v16i8 inrinsic, because NumElts is 16. Does it make sense to perform the transformation in this case? I could get rid of the assert and bail the optimization if NumElts neither 8 nor 16 (or just 8).

labrinea updated this revision to Diff 144867.May 2 2018, 6:13 AM

@efriedma, ping.

Ping?

efriedma added inline comments.May 18 2018, 1:40 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1527	I guess you could also check that the vector element type is i8? It should be for code generated by clang, but the verifier doesn't actually enforce it, and this code can crash if it isn't.
1543	getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and avoids a potential crash on invalid input.
1544	You need special handling if Index is greater than or equal to NumElts*2. (vtbl produces 0, shufflevector produces undef.)
1547	You could simplify this code a little by using ConstantDataVector::get instead of ConstantVector::get.
test/Transforms/InstCombine/AArch64/tbl1.ll
23 ↗	(On Diff #144867)	Please add some testcases where the transform bails out.

Restrict the optimization to table lookups returning a <8 x i8> vector type. It's only beneficial for the mask {7,6,5,4,3,2,1,0} anyway.
Bail out if the constant mask contains an index out of range. The second argument of the new shufflevector is always going to be <undef> so the range is [0 ~ NumElts-1], not [0 ~ 2*NumElts-1], like @efriedma noticed.
Replaced getZExtValue() with getLimitedValue() for getting the value of a ConstantInt.
Simplified the retrieval of mask indices by using ConstantDataVector::get instead of ConstantVector::get.
Added testcases where the transform bails out.
Autogenerated the Filecheck patterns using the script utils/update_test_checks.py

LGTM

This revision is now accepted and ready to land.May 23 2018, 11:51 AM

Closed by commit rL333550: [InstCombine, ARM, AArch64] Convert table lookup to shuffle vector (authored by alelab01). · Explain WhyMay 30 2018, 7:42 AM

This revision was automatically updated to reflect the committed changes.

Diff 144141

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 920 Lines • ▼ Show 20 Lines	if (II.getIntrinsicID() == Intrinsic::x86_sse4a_insertq) {
Module *M = II.getModule();		Module *M = II.getModule();
Value *F = Intrinsic::getDeclaration(M, Intrinsic::x86_sse4a_insertqi);		Value *F = Intrinsic::getDeclaration(M, Intrinsic::x86_sse4a_insertqi);
return Builder.CreateCall(F, Args);		return Builder.CreateCall(F, Args);
}		}

return nullptr;		return nullptr;
}		}

		/// Convert a table lookup to shufflevector if the mask is constant.
		/// This can only benefit tbl1 if the mask is { 7,6,5,4,3,2,1,0 }, in
		/// which case we could lower the shufflevector with rev64 instructions
		/// as it's actually a byte reverse.
		static Value *simplifyTableLookup(const IntrinsicInst &II,
		const DataLayout &DL,
		InstCombiner::BuilderTy &Builder) {
		auto VecTy = cast<VectorType>(II.getType());
		unsigned NumElts = VecTy->getNumElements();
		assert((NumElts == 8) && "Unexpected number of elements in shuffle mask!");
		javed.absarUnsubmitted Not Done Reply Inline Actions Should we assert here or simply return nullptr? javed.absar: Should we assert here or simply return nullptr?
		labrineaAuthorUnsubmitted Not Done Reply Inline Actions I followed the same practice as other conversion functions of this file. We call simplifyTableLookup only when handling the Arm/AArch64 tbl1 intrinsics, so we certainly know that NumElts is 8. The assertion makes sure we don't call this routine from another context. labrinea: I followed the same practice as other conversion functions of this file. We call…

		auto Mask = dyn_cast<Constant>(II.getArgOperand(1));

		// If the mask is coming from a vector load try to fold it.
		if (!Mask) {
		auto Vld1 = dyn_cast<IntrinsicInst>(II.getArgOperand(1));

		if (Vld1 && Vld1->getIntrinsicID() == Intrinsic::arm_neon_vld1) {
		// Strip off any GEP address adjustments and pointer
		// casts from the original object being addressed.
		Value *V = GetUnderlyingObject(Vld1->getArgOperand(0), DL);

		if (auto GV = dyn_cast<GlobalVariable>(V)) {
		efriedmaUnsubmitted Not Done Reply Inline Actions Please use llvm::ConstantFoldLoadFromConstPtr. efriedma: Please use llvm::ConstantFoldLoadFromConstPtr.
		labrineaAuthorUnsubmitted Not Done Reply Inline Actions This won't work. ConstantFoldLoadFromConstPtr first tries ConstantFoldLoadThroughGEPConstantExpr. This returns just the first element of the [8xi8] array, since vld1 expects i8, which is obtained by i8 getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0) in my reproducer. labrinea: This won't work. ConstantFoldLoadFromConstPtr first tries…
		efriedmaUnsubmitted Not Done Reply Inline Actions `ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo())` or something like that should work. efriedma: `ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo())` or…
		if (GV->isConstant() && GV->hasDefinitiveInitializer()) {
		Constant *C = GV->getInitializer();
		SmallVector<Constant *, 8> NewElements;

		for (unsigned I = 0; I < NumElts; ++I) {
		Constant *Elt = C->getAggregateElement(I);
		if (!Elt)
		return nullptr;
		NewElements.push_back(Elt);
		}
		Mask = ConstantVector::get(NewElements);
		}
		}
		}
		}

		if (!Mask)
		return nullptr;

		Constant *Indexes[8] = {nullptr};
		auto EltTy = Type::getInt32Ty(II.getContext());

		// Check whether the mask matches the pattern { 7,6,5,4,3,2,1,0 }.
		efriedmaUnsubmitted Not Done Reply Inline Actions Why does the mask pattern matter? efriedma: Why does the mask pattern matter?
		labrineaAuthorUnsubmitted Not Done Reply Inline Actions Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This particular pattern allows the back-end to generate byte reverse instructions instead of a table lookup, which is better. Generalising the transformation for every pattern probably doesn't hurt but it's not beneficial either. However, applying the transformation for larger table lookups (tbl2,tbl3,tbl4) results in worse codegen from the back-end. labrinea: Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This…
		efriedmaUnsubmitted Not Done Reply Inline Actions Sure, it's not always beneficial, but it's likely we can perform some useful transform on the resulting shuffle, and worst-case the backend just turns the shuffle back into a vtbl. And I don't really want to have code here to try to figure out whether a given shuffle is legal; that gets really complicated in general. efriedma: Sure, it's not always beneficial, but it's likely we can perform some useful transform on the…
		for (unsigned I = 0; I < NumElts; ++I) {
		Constant *COp = Mask->getAggregateElement(I);
		if (!COp \|\| !isa<ConstantInt>(COp))
		return nullptr;

		uint8_t Index = cast<ConstantInt>(COp)->getValue().getZExtValue();
		if (Index != NumElts-1 - I)
		return nullptr;

		Indexes[I] = ConstantInt::get(EltTy, Index);
		}

		auto ShuffleMask = ConstantVector::get(makeArrayRef(Indexes, NumElts));
		auto V1 = II.getArgOperand(0);
		auto V2 = Constant::getNullValue(V1->getType());
		return Builder.CreateShuffleVector(V1, V2, ShuffleMask);
		}

/// Attempt to convert pshufb* to shufflevector if the mask is constant.		/// Attempt to convert pshufb* to shufflevector if the mask is constant.
static Value *simplifyX86pshufb(const IntrinsicInst &II,		static Value *simplifyX86pshufb(const IntrinsicInst &II,
InstCombiner::BuilderTy &Builder) {		InstCombiner::BuilderTy &Builder) {
Constant *V = dyn_cast<Constant>(II.getArgOperand(1));		Constant *V = dyn_cast<Constant>(II.getArgOperand(1));
if (!V)		if (!V)
return nullptr;		return nullptr;

auto *VecTy = cast<VectorType>(II.getType());		auto *VecTy = cast<VectorType>(II.getType());
▲ Show 20 Lines • Show All 518 Lines • ▼ Show 20 Lines
// Returns true iff the 2 intrinsics have the same operands, limiting the		// Returns true iff the 2 intrinsics have the same operands, limiting the
// comparison to the first NumOperands.		// comparison to the first NumOperands.
static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,		static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
unsigned NumOperands) {		unsigned NumOperands) {
assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");
assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");
for (unsigned i = 0; i < NumOperands; i++)		for (unsigned i = 0; i < NumOperands; i++)
if (I.getArgOperand(i) != E.getArgOperand(i))		if (I.getArgOperand(i) != E.getArgOperand(i))
return false;		return false;
		efriedmaUnsubmitted Not Done Reply Inline Actions I guess you could also check that the vector element type is i8? It should be for code generated by clang, but the verifier doesn't actually enforce it, and this code can crash if it isn't. efriedma: I guess you could also check that the vector element type is i8? It should be for code…
return true;		return true;
}		}

// Remove trivially empty start/end intrinsic ranges, i.e. a start		// Remove trivially empty start/end intrinsic ranges, i.e. a start
// immediately followed by an end (ignoring debuginfo or other		// immediately followed by an end (ignoring debuginfo or other
// start/end intrinsics in between). As this handles only the most trivial		// start/end intrinsics in between). As this handles only the most trivial
// cases, tracking the nesting level is not needed:		// cases, tracking the nesting level is not needed:
//		//
// call @llvm.foo.start(i1 0) ; &I		// call @llvm.foo.start(i1 0) ; &I
// call @llvm.foo.start(i1 0)		// call @llvm.foo.start(i1 0)
// call @llvm.foo.end(i1 0) ; This one will not be skipped: it will be removed		// call @llvm.foo.end(i1 0) ; This one will not be skipped: it will be removed
// call @llvm.foo.end(i1 0)		// call @llvm.foo.end(i1 0)
static bool removeTriviallyEmptyRange(IntrinsicInst &I, unsigned StartID,		static bool removeTriviallyEmptyRange(IntrinsicInst &I, unsigned StartID,
unsigned EndID, InstCombiner &IC) {		unsigned EndID, InstCombiner &IC) {
assert(I.getIntrinsicID() == StartID &&		assert(I.getIntrinsicID() == StartID &&
"Start intrinsic does not have expected ID");		"Start intrinsic does not have expected ID");
		efriedmaUnsubmitted Not Done Reply Inline Actions getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and avoids a potential crash on invalid input. efriedma: getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and…
BasicBlock::iterator BI(I), BE(I.getParent()->end());		BasicBlock::iterator BI(I), BE(I.getParent()->end());
		efriedmaUnsubmitted Not Done Reply Inline Actions You need special handling if Index is greater than or equal to NumElts2. (vtbl produces 0, shufflevector produces undef.) efriedma:* You need special handling if Index is greater than or equal to NumElts*2. (vtbl produces 0…
for (++BI; BI != BE; ++BI) {		for (++BI; BI != BE; ++BI) {
if (auto *E = dyn_cast<IntrinsicInst>(BI)) {		if (auto *E = dyn_cast<IntrinsicInst>(BI)) {
if (isa<DbgInfoIntrinsic>(E) \|\| E->getIntrinsicID() == StartID)		if (isa<DbgInfoIntrinsic>(E) \|\| E->getIntrinsicID() == StartID)
		efriedmaUnsubmitted Not Done Reply Inline Actions You could simplify this code a little by using ConstantDataVector::get instead of ConstantVector::get. efriedma: You could simplify this code a little by using ConstantDataVector::get instead of…
continue;		continue;
if (E->getIntrinsicID() == EndID &&		if (E->getIntrinsicID() == EndID &&
haveSameOperands(I, *E, E->getNumArgOperands())) {		haveSameOperands(I, *E, E->getNumArgOperands())) {
IC.eraseInstFromFunction(*E);		IC.eraseInstFromFunction(*E);
IC.eraseInstFromFunction(I);		IC.eraseInstFromFunction(I);
return true;		return true;
}		}
}		}
▲ Show 20 Lines • Show All 1,498 Lines • ▼ Show 20 Lines	if (IntrAlign && IntrAlign->getZExtValue() < MemAlign) {
II->setArgOperand(AlignArg,		II->setArgOperand(AlignArg,
ConstantInt::get(Type::getInt32Ty(II->getContext()),		ConstantInt::get(Type::getInt32Ty(II->getContext()),
MemAlign, false));		MemAlign, false));
return II;		return II;
}		}
break;		break;
}		}

		case Intrinsic::arm_neon_vtbl1:
		case Intrinsic::aarch64_neon_tbl1:
		if (Value V = simplifyTableLookup(II, DL, Builder)) {
		return replaceInstUsesWith(*II, V);
		}
		break;

case Intrinsic::arm_neon_vmulls:		case Intrinsic::arm_neon_vmulls:
case Intrinsic::arm_neon_vmullu:		case Intrinsic::arm_neon_vmullu:
case Intrinsic::aarch64_neon_smull:		case Intrinsic::aarch64_neon_smull:
case Intrinsic::aarch64_neon_umull: {		case Intrinsic::aarch64_neon_umull: {
Value *Arg0 = II->getArgOperand(0);		Value *Arg0 = II->getArgOperand(0);
Value *Arg1 = II->getArgOperand(1);		Value *Arg1 = II->getArgOperand(1);

// Handle mul by zero first:		// Handle mul by zero first:
▲ Show 20 Lines • Show All 1,333 Lines • Show Last 20 Lines

test/Transforms/InstCombine/AArch64/table-lookup.ll

This file was added.

				; RUN: opt -instcombine -S -o - %s \| FileCheck %s

				javed.absarUnsubmitted Not Done Reply Inline Actions It would be nice to add a comment here to explain the purpose of this test. javed.absar: It would be nice to add a comment here to explain the purpose of this test.
				labrineaAuthorUnsubmitted Not Done Reply Inline Actions Sure, thanks. labrinea: Sure, thanks.
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-arm-none-eabi"

				define <8 x i8> @table_lookup(<16 x i8> %vec) {
				entry:
				;CHECK-NOT: call <8 x i8> @llvm.aarch64.neon.tbl1.v8i8
				;CHECK: shufflevector <16 x i8> %vec, <16 x i8> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				%tbl1 = call <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8> %vec, <8 x i8> <i8 7, i8 6, i8 5, i8 4, i8 3, i8 2, i8 1, i8 0>)
				ret <8 x i8> %tbl1
				}

				declare <8 x i8> @llvm.aarch64.neon.tbl1.v8i8(<16 x i8>, <8 x i8>)

test/Transforms/InstCombine/ARM/table-lookup.ll

This file was added.

				; RUN: opt -instcombine -S -o - %s \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8-arm-none-eabi"

				@big_endian_mask = hidden constant [8 x i8] c"\07\06\05\04\03\02\01\00", align 16

				define <8 x i8> @table_lookup(<8 x i8> %vec) {
				entry:
				;CHECK-NOT: call <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8
				;CHECK-NOT: call <8 x i8> @llvm.arm.neon.vtbl1
				;CHECK: shufflevector <8 x i8> %vec, <8 x i8> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
				%mask = call <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8* getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0), i32 16)
				%vtbl1 = call <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8> %vec, <8 x i8> %mask)
				ret <8 x i8> %vtbl1
				}

				declare <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8*, i32)
				declare <8 x i8> @llvm.arm.neon.vtbl1(<8 x i8>, <8 x i8>)

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 144141

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/AArch64/table-lookup.ll

test/Transforms/InstCombine/ARM/table-lookup.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vectorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 144141

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/AArch64/table-lookup.ll

test/Transforms/InstCombine/ARM/table-lookup.ll

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
ClosedPublic