Download Raw Diff

Details

Reviewers

efriedma
llvm-commits
javed.absar

Commits

rG61f0ba1fcc3f: [InstCombine, ARM] Convert vld1 to llvm load
rL333643: [InstCombine, ARM] Convert vld1 to llvm load

Summary

This patch converts a vector load intrinsic into a simple llvm load instruction. This is beneficial when the underlying object being addressed comes from a constant, since we get constant-folding for free.

Diff Detail

Repository: rL LLVM

Event Timeline

labrinea created this revision.Apr 30 2018, 10:06 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 30 2018, 10:06 AM

Herald added subscribers: chrib, kristof.beyls. · View Herald Transcript

rogfer01 added a subscriber: rogfer01.Apr 30 2018, 10:15 AM

I think you need an endianness check somewhere?

Nevermind, you don't need an endianness check, I think; you're not changing the number of elements, so it should have the same semantics. (See also https://llvm.org/docs/BigEndianNEON.html .)

javed.absar added inline comments.Apr 30 2018, 2:47 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
935 ↗	(On Diff #144577)	There is a check of IntrAlign, but then IntrAlign is still used otherwise? Maybe it will be cleaner to not use 'auto'
2987 ↗	(On Diff #144577)	It might be cleaner to extract MemAlign in the called function

I'd like to run some performance tests before this gets merged; I'll try to get back to you sometime this week.

labrinea added inline comments.May 1 2018, 4:10 AM

lib/Transforms/InstCombine/InstCombineCalls.cpp
935 ↗	(On Diff #144577)	Using auto as it's inferred by the dynamic cast.
2987 ↗	(On Diff #144577)	I was thinking that I'll have to pass three pointers as parameters to make that happen, which doesn't sound very optimal performance wise.

spatel added a subscriber: spatel.May 1 2018, 2:28 PM

spatel added inline comments.

lib/Transforms/InstCombine/InstCombineCalls.cpp
932 ↗	(On Diff #144577)	The name makes this look like a generic helper, but it's only applicable to neon, right? Why is it sitting in the middle of a pile of x86 code?
935 ↗	(On Diff #144577)	LLVM preference is to use 'auto *' for pointers: http://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable
936–937 ↗	(On Diff #144577)	This doesn't make sense when IntrAlign is not a ConstantInt. Please add a test like this: define <2 x i64> @vld1_2x64(i8* %ptr, i32 %x) { %vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 %x) ret <2 x i64> %vld1 } ...or if that's not supposed to be allowed, then we don't need a dyn_cast.

The dynamic cast check for the alignment parameter of the intrinsic is necessary. I could bail the optimization if it's not constant and let the backend crash later on. There are plenty of intrinsics where some arguments need to be constant, but IR has no way to enforce that. Clang will guarantee for it. The rest of the suggestions sound sensible. I'll update my patch accordingly.

labrinea updated this revision to Diff 144874.May 2 2018, 6:32 AM

spatel added inline comments.May 2 2018, 6:49 AM

test/Transforms/InstCombine/ARM/vld1.ll
1 ↗	(On Diff #144874)	Please auto-generate the assertions using: ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py The RUN line should redirect from the input file rather than taking a param: ; RUN: opt < %s -instcombine -S \| FileCheck %s

Changes to the test file:

Made the RUN line redirect the input file to opt.
Added a NOTE to enable autogenerated assertions.

In D46273#1086140, @labrinea wrote:

Added a NOTE to enable autogenerated assertions.

Sorry if this wasn't clear: my suggestion was to actually run the script, not pretend to run the script. You shouldn't need any manually-generated 'CHECK' lines in the test file. Let me know if you have problems/questions using the script.

@spatel, how cool! Thanks for pointing that out. I wasn't aware of that script. I'll update my test shortly.

I've actually used the utils/update_test_checks.py to auto-generate Filecheck assertions for the unit test. @spatel, this script is very practical but I am bit sceptical about it. We must be very careful when using it. The checks may not impose the desired compiler behaviour when auto-generated.

In D46273#1086481, @labrinea wrote:

I've actually used the utils/update_test_checks.py to auto-generate Filecheck assertions for the unit test. @spatel, this script is very practical but I am bit sceptical about it. We must be very careful when using it. The checks may not impose the desired compiler behaviour when auto-generated.

Thanks. Can you explain your concern? This is off-topic for the patch itself, so feel free to raise it on llvm-dev if that seems more appropriate.

I have no further comments for the patch, but I'll let @efriedma comment/approve once the perf results are available.

Sure! Thanks again for the review :)

@efriedma, ping. Any perf results on this?

Ping?

Sorry about the delay. I ran some benchmarks internally, looks like it's perfomance-neutral, so we're fine there.

test/Transforms/InstCombine/ARM/vld1.ll
24 ↗	(On Diff #145022)	Alignment on an LLVM IR load instruction is in bytes, not bits.

Using getLimitedValue() instead of getZExtValue() for the ConstantInt representing the memory alignment of the load instruction. Updated the tests: alignment in now expressed in bytes instead of bits.

efriedma added inline comments.May 23 2018, 12:08 PM

lib/Transforms/InstCombine/InstCombineCalls.cpp
1402 ↗	(On Diff #147970)	For safety, probably should verify the alignment is a power of two.

Bail the optimization if the memory alignment is not power of two. Added a test to cover this case.

LGTM

This revision is now accepted and ready to land.May 30 2018, 12:15 PM

Closed by commit rL333643: [InstCombine, ARM] Convert vld1 to llvm load (authored by alelab01). · Explain WhyMay 31 2018, 5:23 AM

This revision was automatically updated to reflect the committed changes.

Diff 149266

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 1,421 Lines • ▼ Show 20 Lines	static Value *simplifyNeonTbl1(const IntrinsicInst &II,

auto *ShuffleMask = ConstantDataVector::get(II.getContext(),		auto *ShuffleMask = ConstantDataVector::get(II.getContext(),
makeArrayRef(Indexes));		makeArrayRef(Indexes));
auto *V1 = II.getArgOperand(0);		auto *V1 = II.getArgOperand(0);
auto *V2 = Constant::getNullValue(V1->getType());		auto *V2 = Constant::getNullValue(V1->getType());
return Builder.CreateShuffleVector(V1, V2, ShuffleMask);		return Builder.CreateShuffleVector(V1, V2, ShuffleMask);
}		}

		/// Convert a vector load intrinsic into a simple llvm load instruction.
		/// This is beneficial when the underlying object being addressed comes
		/// from a constant, since we get constant-folding for free.
		static Value *simplifyNeonVld1(const IntrinsicInst &II,
		unsigned MemAlign,
		InstCombiner::BuilderTy &Builder) {
		auto *IntrAlign = dyn_cast<ConstantInt>(II.getArgOperand(1));

		if (!IntrAlign)
		return nullptr;

		unsigned Alignment = IntrAlign->getLimitedValue() < MemAlign ?
		MemAlign : IntrAlign->getLimitedValue();

		if (!isPowerOf2_32(Alignment))
		return nullptr;

		auto *BCastInst = Builder.CreateBitCast(II.getArgOperand(0),
		PointerType::get(II.getType(), 0));
		return Builder.CreateAlignedLoad(BCastInst, Alignment);
		}

// Returns true iff the 2 intrinsics have the same operands, limiting the		// Returns true iff the 2 intrinsics have the same operands, limiting the
// comparison to the first NumOperands.		// comparison to the first NumOperands.
static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,		static bool haveSameOperands(const IntrinsicInst &I, const IntrinsicInst &E,
unsigned NumOperands) {		unsigned NumOperands) {
assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(I.getNumArgOperands() >= NumOperands && "Not enough operands");
assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");		assert(E.getNumArgOperands() >= NumOperands && "Not enough operands");
for (unsigned i = 0; i < NumOperands; i++)		for (unsigned i = 0; i < NumOperands; i++)
if (I.getArgOperand(i) != E.getArgOperand(i))		if (I.getArgOperand(i) != E.getArgOperand(i))
▲ Show 20 Lines • Show All 1,498 Lines • ▼ Show 20 Lines	if (Constant *Mask = dyn_cast<Constant>(II->getArgOperand(2))) {
Result = Builder.CreateInsertElement(Result, ExtractedElts[Idx],		Result = Builder.CreateInsertElement(Result, ExtractedElts[Idx],
Builder.getInt32(i));		Builder.getInt32(i));
}		}
return CastInst::Create(Instruction::BitCast, Result, CI.getType());		return CastInst::Create(Instruction::BitCast, Result, CI.getType());
}		}
}		}
break;		break;

case Intrinsic::arm_neon_vld1:		case Intrinsic::arm_neon_vld1: {
		unsigned MemAlign = getKnownAlignment(II->getArgOperand(0),
		DL, II, &AC, &DT);
		if (Value V = simplifyNeonVld1(II, MemAlign, Builder))
		return replaceInstUsesWith(*II, V);
		break;
		}

case Intrinsic::arm_neon_vld2:		case Intrinsic::arm_neon_vld2:
case Intrinsic::arm_neon_vld3:		case Intrinsic::arm_neon_vld3:
case Intrinsic::arm_neon_vld4:		case Intrinsic::arm_neon_vld4:
case Intrinsic::arm_neon_vld2lane:		case Intrinsic::arm_neon_vld2lane:
case Intrinsic::arm_neon_vld3lane:		case Intrinsic::arm_neon_vld3lane:
case Intrinsic::arm_neon_vld4lane:		case Intrinsic::arm_neon_vld4lane:
case Intrinsic::arm_neon_vst1:		case Intrinsic::arm_neon_vst1:
case Intrinsic::arm_neon_vst2:		case Intrinsic::arm_neon_vst2:
▲ Show 20 Lines • Show All 1,413 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/ARM/vld1.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8-arm-none-eabi"

				; Turning a vld1 intrinsic into an llvm load is beneficial
				; when the underlying object being addressed comes from a
				; constant, since we get constant-folding for free.

				; Bail the optimization if the alignment is not a constant.
				define <2 x i64> @vld1_align(i8* %ptr, i32 %align) {
				; CHECK-LABEL: @vld1_align(
				; CHECK-NEXT: [[VLD1:%.]] = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8 [[PTR:%.]], i32 [[ALIGN:%.]])
				; CHECK-NEXT: ret <2 x i64> [[VLD1]]
				;
				%vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 %align)
				ret <2 x i64> %vld1
				}

				; Bail the optimization if the alignment is not power of 2.
				define <2 x i64> @vld1_align_pow2(i8* %ptr) {
				; CHECK-LABEL: @vld1_align_pow2(
				; CHECK-NEXT: [[VLD1:%.]] = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8 [[PTR:%.*]], i32 3)
				; CHECK-NEXT: ret <2 x i64> [[VLD1]]
				;
				%vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 3)
				ret <2 x i64> %vld1
				}

				define <8 x i8> @vld1_8x8(i8* %ptr) {
				; CHECK-LABEL: @vld1_8x8(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <8 x i8>
				; CHECK-NEXT: [[TMP2:%.]] = load <8 x i8>, <8 x i8> [[TMP1]], align 1
				; CHECK-NEXT: ret <8 x i8> [[TMP2]]
				;
				%vld1 = call <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8* %ptr, i32 1)
				ret <8 x i8> %vld1
				}

				define <4 x i16> @vld1_4x16(i8* %ptr) {
				; CHECK-LABEL: @vld1_4x16(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <4 x i16>
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2
				; CHECK-NEXT: ret <4 x i16> [[TMP2]]
				;
				%vld1 = call <4 x i16> @llvm.arm.neon.vld1.v4i16.p0i8(i8* %ptr, i32 2)
				ret <4 x i16> %vld1
				}

				define <2 x i32> @vld1_2x32(i8* %ptr) {
				; CHECK-LABEL: @vld1_2x32(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <2 x i32>
				; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4
				; CHECK-NEXT: ret <2 x i32> [[TMP2]]
				;
				%vld1 = call <2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8(i8* %ptr, i32 4)
				ret <2 x i32> %vld1
				}

				define <1 x i64> @vld1_1x64(i8* %ptr) {
				; CHECK-LABEL: @vld1_1x64(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <1 x i64>
				; CHECK-NEXT: [[TMP2:%.]] = load <1 x i64>, <1 x i64> [[TMP1]], align 8
				; CHECK-NEXT: ret <1 x i64> [[TMP2]]
				;
				%vld1 = call <1 x i64> @llvm.arm.neon.vld1.v1i64.p0i8(i8* %ptr, i32 8)
				ret <1 x i64> %vld1
				}

				define <8 x i16> @vld1_8x16(i8* %ptr) {
				; CHECK-LABEL: @vld1_8x16(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <8 x i16>
				; CHECK-NEXT: [[TMP2:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 2
				; CHECK-NEXT: ret <8 x i16> [[TMP2]]
				;
				%vld1 = call <8 x i16> @llvm.arm.neon.vld1.v8i16.p0i8(i8* %ptr, i32 2)
				ret <8 x i16> %vld1
				}

				define <16 x i8> @vld1_16x8(i8* %ptr) {
				; CHECK-LABEL: @vld1_16x8(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <16 x i8>
				; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 1
				; CHECK-NEXT: ret <16 x i8> [[TMP2]]
				;
				%vld1 = call <16 x i8> @llvm.arm.neon.vld1.v16i8.p0i8(i8* %ptr, i32 1)
				ret <16 x i8> %vld1
				}

				define <4 x i32> @vld1_4x32(i8* %ptr) {
				; CHECK-LABEL: @vld1_4x32(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <4 x i32>
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: ret <4 x i32> [[TMP2]]
				;
				%vld1 = call <4 x i32> @llvm.arm.neon.vld1.v4i32.p0i8(i8* %ptr, i32 4)
				ret <4 x i32> %vld1
				}

				define <2 x i64> @vld1_2x64(i8* %ptr) {
				; CHECK-LABEL: @vld1_2x64(
				; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[PTR:%.]] to <2 x i64>
				; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 8
				; CHECK-NEXT: ret <2 x i64> [[TMP2]]
				;
				%vld1 = call <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8* %ptr, i32 8)
				ret <2 x i64> %vld1
				}

				declare <8 x i8> @llvm.arm.neon.vld1.v8i8.p0i8(i8*, i32)
				declare <4 x i16> @llvm.arm.neon.vld1.v4i16.p0i8(i8*, i32)
				declare <2 x i32> @llvm.arm.neon.vld1.v2i32.p0i8(i8*, i32)
				declare <1 x i64> @llvm.arm.neon.vld1.v1i64.p0i8(i8*, i32)
				declare <8 x i16> @llvm.arm.neon.vld1.v8i16.p0i8(i8*, i32)
				declare <16 x i8> @llvm.arm.neon.vld1.v16i8.p0i8(i8*, i32)
				declare <4 x i32> @llvm.arm.neon.vld1.v4i32.p0i8(i8*, i32)
				declare <2 x i64> @llvm.arm.neon.vld1.v2i64.p0i8(i8*, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM] Convert vld1 to llvm load
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149266

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

llvm/trunk/test/Transforms/InstCombine/ARM/vld1.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM] Convert vld1 to llvm loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 149266

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

llvm/trunk/test/Transforms/InstCombine/ARM/vld1.ll

[InstCombine, ARM] Convert vld1 to llvm load
ClosedPublic