Download Raw Diff

Details

Reviewers

efriedma
llvm-commits
javed.absar

Commits

rG61f0ba1fcc3f: [InstCombine, ARM] Convert vld1 to llvm load
rL333643: [InstCombine, ARM] Convert vld1 to llvm load

Summary

This patch converts a vector load intrinsic into a simple llvm load instruction. This is beneficial when the underlying object being addressed comes from a constant, since we get constant-folding for free.

Diff Detail

Event Timeline

labrinea created this revision.Apr 30 2018, 10:06 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 30 2018, 10:06 AM

Herald added subscribers: chrib, kristof.beyls. · View Herald Transcript

rogfer01 added a subscriber: rogfer01.Apr 30 2018, 10:15 AM

I think you need an endianness check somewhere?

Nevermind, you don't need an endianness check, I think; you're not changing the number of elements, so it should have the same semantics. (See also https://llvm.org/docs/BigEndianNEON.html .)

javed.absar added inline comments.Apr 30 2018, 2:47 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
935	There is a check of IntrAlign, but then IntrAlign is still used otherwise? Maybe it will be cleaner to not use 'auto'
2991	It might be cleaner to extract MemAlign in the called function

I'd like to run some performance tests before this gets merged; I'll try to get back to you sometime this week.

labrinea added inline comments.May 1 2018, 4:10 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
935	Using auto as it's inferred by the dynamic cast.
2991	I was thinking that I'll have to pass three pointers as parameters to make that happen, which doesn't sound very optimal performance wise.

spatel added a subscriber: spatel.May 1 2018, 2:28 PM

spatel added inline comments.

lib/Transforms/InstCombine/InstCombineCalls.cpp
932	The name makes this look like a generic helper, but it's only applicable to neon, right? Why is it sitting in the middle of a pile of x86 code?
935	LLVM preference is to use 'auto *' for pointers: http://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable
936–937	This doesn't make sense when IntrAlign is not a ConstantInt. Please add a test like this: define <2 x i64> @vld1_2x64(i8* %ptr, i32 %x) { %vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 %x) ret <2 x i64> %vld1 } ...or if that's not supposed to be allowed, then we don't need a dyn_cast.

The dynamic cast check for the alignment parameter of the intrinsic is necessary. I could bail the optimization if it's not constant and let the backend crash later on. There are plenty of intrinsics where some arguments need to be constant, but IR has no way to enforce that. Clang will guarantee for it. The rest of the suggestions sound sensible. I'll update my patch accordingly.

labrinea updated this revision to Diff 144874.May 2 2018, 6:32 AM

spatel added inline comments.May 2 2018, 6:49 AM

test/Transforms/InstCombine/ARM/vld1.ll
2	Please auto-generate the assertions using: ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py The RUN line should redirect from the input file rather than taking a param: ; RUN: opt < %s -instcombine -S \| FileCheck %s

Changes to the test file:

Made the RUN line redirect the input file to opt.
Added a NOTE to enable autogenerated assertions.

In D46273#1086140, @labrinea wrote:

Added a NOTE to enable autogenerated assertions.

Sorry if this wasn't clear: my suggestion was to actually run the script, not pretend to run the script. You shouldn't need any manually-generated 'CHECK' lines in the test file. Let me know if you have problems/questions using the script.

@spatel, how cool! Thanks for pointing that out. I wasn't aware of that script. I'll update my test shortly.

I've actually used the utils/update_test_checks.py to auto-generate Filecheck assertions for the unit test. @spatel, this script is very practical but I am bit sceptical about it. We must be very careful when using it. The checks may not impose the desired compiler behaviour when auto-generated.

In D46273#1086481, @labrinea wrote:

I've actually used the utils/update_test_checks.py to auto-generate Filecheck assertions for the unit test. @spatel, this script is very practical but I am bit sceptical about it. We must be very careful when using it. The checks may not impose the desired compiler behaviour when auto-generated.

Thanks. Can you explain your concern? This is off-topic for the patch itself, so feel free to raise it on llvm-dev if that seems more appropriate.

I have no further comments for the patch, but I'll let @efriedma comment/approve once the perf results are available.

Sure! Thanks again for the review :)

@efriedma, ping. Any perf results on this?

Ping?

Sorry about the delay. I ran some benchmarks internally, looks like it's perfomance-neutral, so we're fine there.

test/Transforms/InstCombine/ARM/vld1.ll
25	Alignment on an LLVM IR load instruction is in bytes, not bits.

Using getLimitedValue() instead of getZExtValue() for the ConstantInt representing the memory alignment of the load instruction. Updated the tests: alignment in now expressed in bytes instead of bits.

efriedma added inline comments.May 23 2018, 12:08 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1467	For safety, probably should verify the alignment is a power of two.

Bail the optimization if the memory alignment is not power of two. Added a test to cover this case.

LGTM

This revision is now accepted and ready to land.May 30 2018, 12:15 PM

Closed by commit rL333643: [InstCombine, ARM] Convert vld1 to llvm load (authored by alelab01). · Explain WhyMay 31 2018, 5:23 AM

This revision was automatically updated to reflect the committed changes.

Diff 144989

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 923 Lines • ▼ Show 20 Lines	static Value simplifyX86insertq(IntrinsicInst &II, Value Op0, Value *Op1,
}		}

return nullptr;		return nullptr;
}		}

/// Attempt to convert pshufb* to shufflevector if the mask is constant.		/// Attempt to convert pshufb* to shufflevector if the mask is constant.
static Value *simplifyX86pshufb(const IntrinsicInst &II,		static Value *simplifyX86pshufb(const IntrinsicInst &II,
InstCombiner::BuilderTy &Builder) {		InstCombiner::BuilderTy &Builder) {
Constant *V = dyn_cast<Constant>(II.getArgOperand(1));		Constant *V = dyn_cast<Constant>(II.getArgOperand(1));
		spatelUnsubmitted Not Done Reply Inline Actions The name makes this look like a generic helper, but it's only applicable to neon, right? Why is it sitting in the middle of a pile of x86 code? spatel: The name makes this look like a generic helper, but it's only applicable to neon, right? Why…
if (!V)		if (!V)
return nullptr;		return nullptr;

		javed.absarUnsubmitted Not Done Reply Inline Actions There is a check of IntrAlign, but then IntrAlign is still used otherwise? Maybe it will be cleaner to not use 'auto' javed.absar: There is a check of IntrAlign, but then IntrAlign is still used otherwise? Maybe it will be…
		labrineaAuthorUnsubmitted Not Done Reply Inline Actions Using auto as it's inferred by the dynamic cast. labrinea: Using auto as it's inferred by the dynamic cast.
		spatelUnsubmitted Not Done Reply Inline Actions LLVM preference is to use 'auto ' for pointers: http://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable spatel:* LLVM preference is to use 'auto *' for pointers: http://llvm.org/docs/CodingStandards.html#use…
auto *VecTy = cast<VectorType>(II.getType());		auto *VecTy = cast<VectorType>(II.getType());
auto *MaskEltTy = Type::getInt32Ty(II.getContext());		auto *MaskEltTy = Type::getInt32Ty(II.getContext());
		spatelUnsubmitted Not Done Reply Inline Actions This doesn't make sense when IntrAlign is not a ConstantInt. Please add a test like this: define <2 x i64> @vld1_2x64(i8* %ptr, i32 %x) { %vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 %x) ret <2 x i64> %vld1 } ...or if that's not supposed to be allowed, then we don't need a dyn_cast. spatel: This doesn't make sense when IntrAlign is not a ConstantInt. Please add a test like this: ```…
unsigned NumElts = VecTy->getNumElements();		unsigned NumElts = VecTy->getNumElements();
assert((NumElts == 16 \|\| NumElts == 32 \|\| NumElts == 64) &&		assert((NumElts == 16 \|\| NumElts == 32 \|\| NumElts == 64) &&
"Unexpected number of elements in shuffle mask!");		"Unexpected number of elements in shuffle mask!");

// Construct a shuffle mask from constant integers or UNDEFs.		// Construct a shuffle mask from constant integers or UNDEFs.
Constant *Indexes[64] = {nullptr};		Constant *Indexes[64] = {nullptr};

// Each byte in the shuffle control mask forms an index to permute the		// Each byte in the shuffle control mask forms an index to permute the
▲ Show 20 Lines • Show All 501 Lines • ▼ Show 20 Lines	static APFloat fmed3AMDGCN(const APFloat &Src0, const APFloat &Src1,
APFloat::cmpResult Cmp1 = Max3.compare(Src1);		APFloat::cmpResult Cmp1 = Max3.compare(Src1);
assert(Cmp1 != APFloat::cmpUnordered && "nans handled separately");		assert(Cmp1 != APFloat::cmpUnordered && "nans handled separately");
if (Cmp1 == APFloat::cmpEqual)		if (Cmp1 == APFloat::cmpEqual)
return maxnum(Src0, Src2);		return maxnum(Src0, Src2);

return maxnum(Src0, Src1);		return maxnum(Src0, Src1);
}		}

		/// Convert a vector load intrinsic into a simple llvm load instruction.
		/// This is beneficial when the underlying object being addressed comes
		/// from a constant, since we get constant-folding for free.
		static Value *simplifyNeonVld1(const IntrinsicInst &II,
		unsigned MemAlign,
		InstCombiner::BuilderTy &Builder) {
		auto *IntrAlign = dyn_cast<ConstantInt>(II.getArgOperand(1));

		if (!IntrAlign)
		return nullptr;

		unsigned Alignment = IntrAlign->getZExtValue() < MemAlign ?
		MemAlign : IntrAlign->getZExtValue();
		efriedmaUnsubmitted Not Done Reply Inline Actions For safety, probably should verify the alignment is a power of two. efriedma: For safety, probably should verify the alignment is a power of two.

		auto *BCastInst = Builder.CreateBitCast(II.getArgOperand(0),
		PointerType::get(II.getType(), 0));
		return Builder.CreateAlignedLoad(BCastInst, Alignment);
		}

// Returns true iff the 2 intrinsics have the same operands, limiting the		// Returns true iff the 2 intrinsics have the same operands, limiting the
// comparison to the first NumOperands.		// comparison to the first NumOperands.
static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,		static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
unsigned NumOperands) {		unsigned NumOperands) {
assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");
assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");
for (unsigned i = 0; i < NumOperands; i++)		for (unsigned i = 0; i < NumOperands; i++)
if (I.getArgOperand(i) != E.getArgOperand(i))		if (I.getArgOperand(i) != E.getArgOperand(i))
▲ Show 20 Lines • Show All 1,500 Lines • ▼ Show 20 Lines	if (Constant *Mask = dyn_cast<Constant>(II->getArgOperand(2))) {
Result = Builder.CreateInsertElement(Result, ExtractedElts[Idx],		Result = Builder.CreateInsertElement(Result, ExtractedElts[Idx],
Builder.getInt32(i));		Builder.getInt32(i));
}		}
return CastInst::Create(Instruction::BitCast, Result, CI.getType());		return CastInst::Create(Instruction::BitCast, Result, CI.getType());
}		}
}		}
break;		break;

case Intrinsic::arm_neon_vld1:		case Intrinsic::arm_neon_vld1: {
		unsigned MemAlign = getKnownAlignment(II->getArgOperand(0),
		javed.absarUnsubmitted Not Done Reply Inline Actions It might be cleaner to extract MemAlign in the called function javed.absar: It might be cleaner to extract MemAlign in the called function
		labrineaAuthorUnsubmitted Not Done Reply Inline Actions I was thinking that I'll have to pass three pointers as parameters to make that happen, which doesn't sound very optimal performance wise. labrinea: I was thinking that I'll have to pass three pointers as parameters to make that happen, which…
		DL, II, &AC, &DT);
		if (Value V = simplifyNeonVld1(II, MemAlign, Builder))
		return replaceInstUsesWith(*II, V);
		break;
		}

case Intrinsic::arm_neon_vld2:		case Intrinsic::arm_neon_vld2:
case Intrinsic::arm_neon_vld3:		case Intrinsic::arm_neon_vld3:
case Intrinsic::arm_neon_vld4:		case Intrinsic::arm_neon_vld4:
case Intrinsic::arm_neon_vld2lane:		case Intrinsic::arm_neon_vld2lane:
case Intrinsic::arm_neon_vld3lane:		case Intrinsic::arm_neon_vld3lane:
case Intrinsic::arm_neon_vld4lane:		case Intrinsic::arm_neon_vld4lane:
case Intrinsic::arm_neon_vst1:		case Intrinsic::arm_neon_vst1:
case Intrinsic::arm_neon_vst2:		case Intrinsic::arm_neon_vst2:
▲ Show 20 Lines • Show All 1,359 Lines • Show Last 20 Lines

test/Transforms/InstCombine/ARM/vld1.ll

This file was added.

				; RUN: opt < %s -instcombine -S \| FileCheck %s
				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				spatelUnsubmitted Not Done Reply Inline Actions Please auto-generate the assertions using: ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py The RUN line should redirect from the input file rather than taking a param: ; RUN: opt < %s -instcombine -S \| FileCheck %s spatel: Please auto-generate the assertions using: ; NOTE: Assertions have been autogenerated by…

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8-arm-none-eabi"

				; Turning a vld1 intrinsic into an llvm load is beneficial
				; when the underlying object being addressed comes from a
				; constant, since we get constant-folding for free.

				; Bail the optimization if the alignment is not a constant.
				define <2 x i64> @vld1_align(i8* %ptr, i32 %align) {
				; CHECK-NOT: bitcast i8* %ptr to <2 x i64>*
				; CHECK-NOT: load <2 x i64>, <2 x i64>*
				; CHECK: call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8
				%vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 %align)
				ret <2 x i64> %vld1
				}

				define <8 x i8> @vld1_8x8(i8* %ptr) {
				; CHECK-NOT: call <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <8 x i8>*
				; CHECK: load <8 x i8>, <8 x i8>* [[bcast]], align 8
				%vld1 = call <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8* %ptr, i32 8)
				ret <8 x i8> %vld1
				efriedmaUnsubmitted Not Done Reply Inline Actions Alignment on an LLVM IR load instruction is in bytes, not bits. efriedma: Alignment on an LLVM IR load instruction is in bytes, not bits.
				}

				define <4 x i16> @vld1_4x16(i8* %ptr) {
				; CHECK-NOT: call <4 x i16> @llvm.arm.neon.vld1.v4i16.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <4 x i16>*
				; CHECK: load <4 x i16>, <4 x i16>* [[bcast]], align 16
				%vld1 = call <4 x i16> @llvm.arm.neon.vld1.v4i16.p0i8(i8* %ptr, i32 16)
				ret <4 x i16> %vld1
				}

				define <2 x i32> @vld1_2x32(i8* %ptr) {
				; CHECK-NOT: call <2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <2 x i32>*
				; CHECK: load <2 x i32>, <2 x i32>* [[bcast]], align 32
				%vld1 = call <2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8(i8* %ptr, i32 32)
				ret <2 x i32> %vld1
				}

				define <1 x i64> @vld1_1x64(i8* %ptr) {
				; CHECK-NOT: call <1 x i64> @llvm.arm.neon.vld1.v1i64.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <1 x i64>*
				; CHECK: load <1 x i64>, <1 x i64>* [[bcast]], align 64
				%vld1 = call <1 x i64> @llvm.arm.neon.vld1.v1i64.p0i8(i8* %ptr, i32 64)
				ret <1 x i64> %vld1
				}

				define <8 x i16> @vld1_8x16(i8* %ptr) {
				; CHECK-NOT: call <8 x i16> @llvm.arm.neon.vld1.v8i16.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <8 x i16>*
				; CHECK: load <8 x i16>, <8 x i16>* [[bcast]], align 16
				%vld1 = call <8 x i16> @llvm.arm.neon.vld1.v8i16.p0i8(i8* %ptr, i32 16)
				ret <8 x i16> %vld1
				}

				define <16 x i8> @vld1_16x8(i8* %ptr) {
				; CHECK-NOT: call <16 x i8> @llvm.arm.neon.vld1.v16i8.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <16 x i8>*
				; CHECK: load <16 x i8>, <16 x i8>* [[bcast]], align 8
				%vld1 = call <16 x i8> @llvm.arm.neon.vld1.v16i8.p0i8(i8* %ptr, i32 8)
				ret <16 x i8> %vld1
				}

				define <4 x i32> @vld1_4x32(i8* %ptr) {
				; CHECK-NOT: call <4 x i32> @llvm.arm.neon.vld1.v4i32.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <4 x i32>*
				; CHECK: load <4 x i32>, <4 x i32>* [[bcast]], align 32
				%vld1 = call <4 x i32> @llvm.arm.neon.vld1.v4i32.p0i8(i8* %ptr, i32 32)
				ret <4 x i32> %vld1
				}

				define <2 x i64> @vld1_2x64(i8* %ptr) {
				; CHECK-NOT: call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8
				; CHECK: [[bcast:%.]] = bitcast i8 %ptr to <2 x i64>*
				; CHECK: load <2 x i64>, <2 x i64>* [[bcast]], align 64
				%vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 64)
				ret <2 x i64> %vld1
				}

				declare <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8*, i32)
				declare <4 x i16> @llvm.arm.neon.vld1.v4i16.p0i8(i8*, i32)
				declare <2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8(i8*, i32)
				declare <1 x i64> @llvm.arm.neon.vld1.v1i64.p0i8(i8*, i32)
				declare <8 x i16> @llvm.arm.neon.vld1.v8i16.p0i8(i8*, i32)
				declare <16 x i8> @llvm.arm.neon.vld1.v16i8.p0i8(i8*, i32)
				declare <4 x i32> @llvm.arm.neon.vld1.v4i32.p0i8(i8*, i32)
				declare <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8*, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM] Convert vld1 to llvm load
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 144989

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/ARM/vld1.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM] Convert vld1 to llvm loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 144989

lib/Transforms/InstCombine/InstCombineCalls.cpp

test/Transforms/InstCombine/ARM/vld1.ll

[InstCombine, ARM] Convert vld1 to llvm load
ClosedPublic