Download Raw Diff

Details

Reviewers

reames
hfinkel
arsenm
jdoerfert
lebedev.ri

Commits

rGbc4a63fd3c29: [InstCombine] try to narrow a truncated load
rL367011: [InstCombine] try to narrow a truncated load

Summary

trunc (load X) --> load (bitcast X to narrow type)

We have this transform in DAGCombiner::ReduceLoadWidth(), but the truncated load pattern can interfere with other instcombine transforms, so I'd like to allow the fold sooner.

Example:
https://bugs.llvm.org/show_bug.cgi?id=16739
...in that report, we have bitcasts bracketing these ops, so those could get eliminated too.

We've generally ruled out widening of loads early in IR ( LoadCombine - http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ), but I'm not sure if that reasoning applies to narrowing.

There's another request for narrowing in IR here, but it's a different pattern:
https://bugs.llvm.org/show_bug.cgi?id=42424

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Jul 9 2019, 11:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2019, 11:29 AM

Herald added subscribers: hiraditya, wdng, mcrosier. · View Herald Transcript

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

In D64432#1582980, @reames wrote:

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

Yes, I agree that we can lose information by narrowing. I was hoping to avoid that conflict with D64258, but we need to refine our definition of 'dereferenceable'. Potentially, we could have instcombine preserve the dereferenceable range in this transform?

I don't know if it makes a difference, but my intent is to not allow narrowing for vectors in this patch by using the data layout legality check. (We could make that vector bailout explicit.) So I don't think the given example with a vector type is at risk.

define i8 @narrowload(<2 x i8>* %p) {
  %a = load <2 x i8>, <2 x i8>* %p
  %p1 = getelementptr <2 x i8>, <2 x i8>* %p, i64 1
  %b = load <2 x i8>, <2 x i8>* %p1
  %a0 = extractelement <2 x i8> %a, i64 0
  %a1 = extractelement <2 x i8> %a, i64 1
  %b1 = extractelement <2 x i8> %b, i64 1
  %add1 = add i8 %a0, %a1
  %add2 = add i8 %add1, %b1
  ret i8 %add2
}
;sum = a[0] + a[1] + b[1]

In D64432#1582980, @reames wrote:

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

Could we check here if the base pointer has dereferenceable annotation and use that as a condition for this transformation? (It's more complicated to be completely lossless but this seems to be an easy to test starting point).

In D64432#1584782, @jdoerfert wrote:

Could we check here if the base pointer has dereferenceable annotation and use that as a condition for this transformation? (It's more complicated to be completely lossless but this seems to be an easy to test starting point).

That seems like a reasonable way to overcome the potential loss, so I'll update this patch with that change.
I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

Patch updated:

Add limitation based on dereferenceable attribute to prevent information loss.
Add/adjust tests to include dereferenceable attributes.

I think you are missing the negative test without dereferenceable.

In D64432#1588103, @spatel wrote:

I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

Agreed. Though, we already have dereferenceabl arguments. @uenoku Is now working on dereferenceable deduction for the Attributor :)

Patch updated:
Add a test with no 'dereferenceable' attribute on the pointer argument.

In D64432#1589539, @jdoerfert wrote:

I think you are missing the negative test without dereferenceable.

Yes - updated.

In D64432#1588103, @spatel wrote:

I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

Agreed. Though, we already have dereferenceabl arguments. @uenoku Is now working on dereferenceable deduction for the Attributor :)

Great!

Looks like you guys came to a completely reasonable short term solution with the dereferenceable limitation. I agree we need a better answer here long term, but being able to make progress on short term items without blocking is good to see.

Amusingly, I just stumbled across an example locally where narrowing a vector load was the right answer performance wise. :)

Ping.

LGTM with dereferenceable() restriction, but maybe wait for one more review.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
714 ↗	(On Diff #210342)	This one is deprecated, let's use the proper one LoadInst *NarrowLoad = new LoadInst(PtrTy, Bitcast);
llvm/test/Transforms/InstCombine/trunc-load.ll
5 ↗	(On Diff #210342)	Please precommit the tests.

This revision is now accepted and ready to land.Jul 24 2019, 5:07 AM

spatel marked an inline comment as done.Jul 24 2019, 5:13 AM

spatel added inline comments.

llvm/test/Transforms/InstCombine/trunc-load.ll
5 ↗	(On Diff #210342)	Yes - the tests were evolving with the code changes here, but now that we are settled on the constraints, I'll push those with baseline results as a preliminary step. Thanks for reviewing!

spatel mentioned this in rL366901: [InstCombine] add tests for load narrowing; NFC.Jul 24 2019, 5:43 AM

spatel mentioned this in rG362407442690: [InstCombine] add tests for load narrowing; NFC.Jul 24 2019, 5:49 AM

Patch updated:

Use new API for creating load that takes element type and pointer.
Rebased after adding baseline tests (rL366901).

spatel marked an inline comment as done.Jul 24 2019, 7:23 AM

spatel added inline comments.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
714 ↗	(On Diff #210342)	This isn't obvious (because there are apparently no header comments for any of these constructors), but the type specified by that 1st parameter is the element type of the loaded value, not the pointer type.

One final request, otherwise LGTM.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
715 ↗	(On Diff #211500)	I think you need to copy over more information (probably there is code for that somewhere): `!nontemporal !<index>` `!invariant.load !<index>` `!invariant.group !<index>`

spatel mentioned this in rL366949: [Transforms] move copying of load metadata to helper function; NFC.Jul 24 2019, 3:14 PM

spatel mentioned this in rG86e9f9dc26fd: [Transforms] move copying of load metadata to helper function; NFC.

spatel marked an inline comment as done.Jul 24 2019, 3:50 PM

spatel added inline comments.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
715 ↗	(On Diff #211500)	The closest match that I see is here: rL366949

Closed by commit rL367011: [InstCombine] try to narrow a truncated load (authored by spatel). · Explain WhyJul 25 2019, 5:13 AM

This revision was automatically updated to reflect the committed changes.

Diff 211721

llvm/trunk/lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 675 Lines • ▼ Show 20 Lines	if (isa<UndefValue>(VecOp)) {
UndefValue *NarrowUndef = UndefValue::get(DestTy);		UndefValue *NarrowUndef = UndefValue::get(DestTy);
Value *NarrowOp = Builder.CreateCast(Opcode, ScalarOp, DestScalarTy);		Value *NarrowOp = Builder.CreateCast(Opcode, ScalarOp, DestScalarTy);
return InsertElementInst::Create(NarrowUndef, NarrowOp, Index);		return InsertElementInst::Create(NarrowUndef, NarrowOp, Index);
}		}

return nullptr;		return nullptr;
}		}

		static Instruction *narrowLoad(TruncInst &Trunc,
		InstCombiner::BuilderTy &Builder,
		const DataLayout &DL) {
		// Check the layout to ensure we are not creating an unsupported operation.
		// TODO: Create a GEP to offset the load?
		if (!DL.isLittleEndian())
		return nullptr;
		unsigned NarrowBitWidth = Trunc.getDestTy()->getPrimitiveSizeInBits();
		if (!DL.isLegalInteger(NarrowBitWidth))
		return nullptr;

		// Match a truncated load with no other uses.
		Value *X;
		if (!match(Trunc.getOperand(0), m_OneUse(m_Load(m_Value(X)))))
		return nullptr;
		LoadInst *WideLoad = cast<LoadInst>(Trunc.getOperand(0));
		if (!WideLoad->isSimple())
		return nullptr;

		// Don't narrow this load if we would lose information about the
		// dereferenceable range.
		bool CanBeNull;
		uint64_t DerefBits = X->getPointerDereferenceableBytes(DL, CanBeNull) * 8;
		if (DerefBits < WideLoad->getType()->getPrimitiveSizeInBits())
		return nullptr;

		// trunc (load X) --> load (bitcast X)
		PointerType *PtrTy = PointerType::get(Trunc.getDestTy(),
		WideLoad->getPointerAddressSpace());
		Value *Bitcast = Builder.CreatePointerCast(X, PtrTy);
		LoadInst *NarrowLoad = new LoadInst(Trunc.getDestTy(), Bitcast);
		NarrowLoad->setAlignment(WideLoad->getAlignment());
		copyMetadataForLoad(NarrowLoad, WideLoad);
		return NarrowLoad;
		}

Instruction *InstCombiner::visitTrunc(TruncInst &CI) {		Instruction *InstCombiner::visitTrunc(TruncInst &CI) {
if (Instruction *Result = commonCastTransforms(CI))		if (Instruction *Result = commonCastTransforms(CI))
return Result;		return Result;

Value *Src = CI.getOperand(0);		Value *Src = CI.getOperand(0);
Type DestTy = CI.getType(), SrcTy = Src->getType();		Type DestTy = CI.getType(), SrcTy = Src->getType();

// Attempt to truncate the entire input expression tree to the destination		// Attempt to truncate the entire input expression tree to the destination
▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	if (match(Src, m_Shl(m_Value(A), m_ConstantInt(Cst))) &&
ConstantInt::get(DestTy, Cst->getValue().trunc(DestSize)));		ConstantInt::get(DestTy, Cst->getValue().trunc(DestSize)));
}		}
}		}
}		}

if (Instruction I = foldVecTruncToExtElt(CI, this))		if (Instruction I = foldVecTruncToExtElt(CI, this))
return I;		return I;

		if (Instruction *NewLoad = narrowLoad(CI, Builder, DL))
		return NewLoad;

return nullptr;		return nullptr;
}		}

Instruction InstCombiner::transformZExtICmp(ICmpInst ICI, ZExtInst &CI,		Instruction InstCombiner::transformZExtICmp(ICmpInst ICI, ZExtInst &CI,
bool DoTransform) {		bool DoTransform) {
// If we are just checking for a icmp eq of a single bit and zext'ing it		// If we are just checking for a icmp eq of a single bit and zext'ing it
// to an integer, then shift the bit to the appropriate place and then		// to an integer, then shift the bit to the appropriate place and then
// cast to integer to avoid the comparison.		// cast to integer to avoid the comparison.
▲ Show 20 Lines • Show All 1,608 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/trunc-load.ll

Show All 23 Lines	;
%x = load i64, i64* %ptr		%x = load i64, i64* %ptr
%r = trunc i64 %x to i32		%r = trunc i64 %x to i32
ret i32 %r		ret i32 %r
}		}

; On little-endian, we can narrow the load without an offset.		; On little-endian, we can narrow the load without an offset.

define i32 @truncload_deref(i64* dereferenceable(8) %ptr) {		define i32 @truncload_deref(i64* dereferenceable(8) %ptr) {
; CHECK-LABEL: @truncload_deref(		; LE-LABEL: @truncload_deref(
; CHECK-NEXT: [[X:%.]] = load i64, i64 [[PTR:%.*]], align 4		; LE-NEXT: [[TMP1:%.]] = bitcast i64 [[PTR:%.]] to i32
; CHECK-NEXT: [[R:%.*]] = trunc i64 [[X]] to i32		; LE-NEXT: [[R:%.]] = load i32, i32 [[TMP1]], align 4
; CHECK-NEXT: ret i32 [[R]]		; LE-NEXT: ret i32 [[R]]
		;
		; BE-LABEL: @truncload_deref(
		; BE-NEXT: [[X:%.]] = load i64, i64 [[PTR:%.*]], align 4
		; BE-NEXT: [[R:%.*]] = trunc i64 [[X]] to i32
		; BE-NEXT: ret i32 [[R]]
;		;
%x = load i64, i64* %ptr		%x = load i64, i64* %ptr
%r = trunc i64 %x to i32		%r = trunc i64 %x to i32
ret i32 %r		ret i32 %r
}		}

; Preserve alignment.		; Preserve alignment.

define i16 @truncload_align(i32* dereferenceable(14) %ptr) {		define i16 @truncload_align(i32* dereferenceable(14) %ptr) {
; CHECK-LABEL: @truncload_align(		; LE-LABEL: @truncload_align(
; CHECK-NEXT: [[X:%.]] = load i32, i32 [[PTR:%.*]], align 16		; LE-NEXT: [[TMP1:%.]] = bitcast i32 [[PTR:%.]] to i16
; CHECK-NEXT: [[R:%.*]] = trunc i32 [[X]] to i16		; LE-NEXT: [[R:%.]] = load i16, i16 [[TMP1]], align 16
; CHECK-NEXT: ret i16 [[R]]		; LE-NEXT: ret i16 [[R]]
		;
		; BE-LABEL: @truncload_align(
		; BE-NEXT: [[X:%.]] = load i32, i32 [[PTR:%.*]], align 16
		; BE-NEXT: [[R:%.*]] = trunc i32 [[X]] to i16
		; BE-NEXT: ret i16 [[R]]
;		;
%x = load i32, i32* %ptr, align 16		%x = load i32, i32* %ptr, align 16
%r = trunc i32 %x to i16		%r = trunc i32 %x to i16
ret i16 %r		ret i16 %r
}		}

; Negative test - extra use means we would not eliminate the original load.		; Negative test - extra use means we would not eliminate the original load.

Show All 36 Lines	;
%x = load volatile i64, i64* %ptr, align 8		%x = load volatile i64, i64* %ptr, align 8
%r = trunc i64 %x to i32		%r = trunc i64 %x to i32
ret i32 %r		ret i32 %r
}		}

; Preserve address space.		; Preserve address space.

define i32 @truncload_address_space(i64 addrspace(1)* dereferenceable(8) %ptr) {		define i32 @truncload_address_space(i64 addrspace(1)* dereferenceable(8) %ptr) {
; CHECK-LABEL: @truncload_address_space(		; LE-LABEL: @truncload_address_space(
; CHECK-NEXT: [[X:%.]] = load i64, i64 addrspace(1) [[PTR:%.*]], align 4		; LE-NEXT: [[TMP1:%.]] = bitcast i64 addrspace(1) [[PTR:%.]] to i32 addrspace(1)
; CHECK-NEXT: [[R:%.*]] = trunc i64 [[X]] to i32		; LE-NEXT: [[R:%.]] = load i32, i32 addrspace(1) [[TMP1]], align 4
; CHECK-NEXT: ret i32 [[R]]		; LE-NEXT: ret i32 [[R]]
		;
		; BE-LABEL: @truncload_address_space(
		; BE-NEXT: [[X:%.]] = load i64, i64 addrspace(1) [[PTR:%.*]], align 4
		; BE-NEXT: [[R:%.*]] = trunc i64 [[X]] to i32
		; BE-NEXT: ret i32 [[R]]
;		;
%x = load i64, i64 addrspace(1)* %ptr, align 4		%x = load i64, i64 addrspace(1)* %ptr, align 4
%r = trunc i64 %x to i32		%r = trunc i64 %x to i32
ret i32 %r		ret i32 %r
}		}

		; Most metadata should be transferred to the narrow load.
		; TODO: We lost the range.

		define i32 @truncload_metadata(i64* dereferenceable(8) %ptr) {
		; LE-LABEL: @truncload_metadata(
		; LE-NEXT: [[TMP1:%.]] = bitcast i64 [[PTR:%.]] to i32
		; LE-NEXT: [[R:%.]] = load i32, i32 [[TMP1]], align 4, !invariant.load !0, !nontemporal !1
		; LE-NEXT: ret i32 [[R]]
		;
		; BE-LABEL: @truncload_metadata(
		; BE-NEXT: [[X:%.]] = load i64, i64 [[PTR:%.*]], align 4, !range !0, !invariant.load !1, !nontemporal !2
		; BE-NEXT: [[R:%.*]] = trunc i64 [[X]] to i32
		; BE-NEXT: ret i32 [[R]]
		;
		%x = load i64, i64* %ptr, align 4, !invariant.load !0, !nontemporal !1, !range !2
		%r = trunc i64 %x to i32
		ret i32 %r
		}

		!0 = !{}
		!1 = !{i32 1}
		!2 = !{i64 0, i64 2}

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try to narrow a truncated load
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 211721

llvm/trunk/lib/Transforms/InstCombine/InstCombineCasts.cpp

llvm/trunk/test/Transforms/InstCombine/trunc-load.ll

This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try to narrow a truncated loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 211721

llvm/trunk/lib/Transforms/InstCombine/InstCombineCasts.cpp

llvm/trunk/test/Transforms/InstCombine/trunc-load.ll

[InstCombine] try to narrow a truncated load
ClosedPublic