This is an archive of the discontinued LLVM Phabricator instance.

Relax load store vectorizer pointer strip checks
ClosedPublic

Authored by rampitec on Aug 1 2019, 12:15 PM.

Download Raw Diff

Details

Reviewers

arsenm
tra
bkramer

Commits

rGeee9312a85c0: Relax load store vectorizer pointer strip checks
rL367624: Relax load store vectorizer pointer strip checks

Summary

The previous change to fix crash in the vectorizer introduced
performance regressions. The condition to preserve pointer
address space during the search is too tight, we only need to
match the size.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Aug 1 2019, 12:15 PM

Herald added subscribers: nhaehnle, wdng, jvesely. · View Herald TranscriptAug 1 2019, 12:15 PM

arsenm added inline comments.Aug 1 2019, 12:49 PM

test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll
29–30 ↗	(On Diff #212880)	It would probably be a more useful test to use an address space that will actually be vectorized (i.e. use global instead of flat)

rampitec marked an inline comment as done.Aug 1 2019, 12:55 PM

rampitec added inline comments.

test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll
29–30 ↗	(On Diff #212880)	It is neither flat or global, I have removed all references to AMD here, there is no triple and no calling convention, only data layout string.

It looks to me that the root cause is in

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	If we were to use addrspacecast(1) instead of 5 in the example below, we'd proceed with the checks. If we can deal with non-generic address spaces in principle, why can't we deal with address spaces that differ in pointer size? I'd assume that logic that determines consecutiveness should work the same. I guess one way to handle mismatched address spaces would be to normalize the pointer to the common address space (generic?) and then run the checks for the consecutiveness.
test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll
3 ↗	(On Diff #212880)	It would be good to add a comment that p5:32:32 is the critical part here as that's what causes the original problem.

Test update.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	In the meanwhile I am working on the followup patch to handle pointer size differences in graceful way. That requires more code and time and time though.

tra added inline comments.Aug 1 2019, 2:30 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	The reason SCEV is unhappy is that we're asking it to do something with different types. Bailing out early will avoid the problem, but it's still too conservative, IMO. We may still be able to do useful optimizations with such pointers. Perhaps something like this might keep things working where possible until we have a better way: https://reviews.llvm.org/differential/diff/212903/

rampitec marked an inline comment as done.Aug 1 2019, 2:35 PM

rampitec added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

347–348 ↗

(On Diff #212880)

Look, as i said I am working on the more relaxed patch to handle it. I am just trying to think about the corner cases, for example like this:

target datalayout = "e-p:64:64-p1:64:64-p5:32:32"

define amdgpu_kernel void @ext_ptr_overflow_unsigned(i32 addrspace(5)* %p) {
entry:
  %gep1 = getelementptr inbounds i32, i32 addrspace(5)* %p, i64 4294967297
  %gep2 = getelementptr inbounds i32, i32 addrspace(5)* %p, i64 4294967298
  %a.ascast = addrspacecast i32 addrspace(5)* %gep1 to i32*
  %b.ascast = addrspacecast i32 addrspace(5)* %gep2 to i32*
  %tmp1 = load i32, i32* %a.ascast, align 8
  %tmp2 = load i32, i32* %b.ascast, align 8
  unreachable
}

tra accepted this revision.Aug 1 2019, 3:03 PM

tra added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	Fair enough. Let's land your patch as is to unblock things for now.

This revision is now accepted and ready to land.Aug 1 2019, 3:03 PM

rampitec marked an inline comment as done.Aug 1 2019, 3:03 PM

rampitec added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	Looks like stripAndAccumulateInBoundsConstantOffsets() takes care of it, so offset always fits a smallest data type in chain.

Closed by commit rL367624: Relax load store vectorizer pointer strip checks (authored by rampitec). · Explain WhyAug 1 2019, 3:19 PM

This revision was automatically updated to reflect the committed changes.

rampitec marked 3 inline comments as done.

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2019, 3:19 PM

rampitec marked an inline comment as done.Aug 1 2019, 3:27 PM

rampitec added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347–348 ↗	(On Diff #212880)	https://reviews.llvm.org/D65612

arsenm added inline comments.Aug 2 2019, 8:36 AM

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347	Why the store size?

rampitec marked an inline comment as done.Aug 2 2019, 9:49 AM

rampitec added inline comments.

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
347	This is memory access, so should be in memory size.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

5 lines

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

vect-ptr-ptr-size-mismatch.ll

51 lines

Diff 212913

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 333 Lines • ▼ Show 20 Lines	bool Vectorizer::isConsecutiveAccess(Value A, Value B) {

return areConsecutivePointers(PtrA, PtrB, Size);		return areConsecutivePointers(PtrA, PtrB, Size);
}		}

bool Vectorizer::areConsecutivePointers(Value PtrA, Value PtrB,		bool Vectorizer::areConsecutivePointers(Value PtrA, Value PtrB,
const APInt &PtrDelta,		const APInt &PtrDelta,
unsigned Depth) const {		unsigned Depth) const {
unsigned PtrBitWidth = DL.getPointerTypeSizeInBits(PtrA->getType());		unsigned PtrBitWidth = DL.getPointerTypeSizeInBits(PtrA->getType());
unsigned PtrAS = PtrA->getType()->getPointerAddressSpace();
APInt OffsetA(PtrBitWidth, 0);		APInt OffsetA(PtrBitWidth, 0);
APInt OffsetB(PtrBitWidth, 0);		APInt OffsetB(PtrBitWidth, 0);
PtrA = PtrA->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetA);		PtrA = PtrA->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetA);
PtrB = PtrB->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetB);		PtrB = PtrB->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetB);

if (PtrA->getType()->getPointerAddressSpace() != PtrAS \|\|		if (DL.getTypeStoreSizeInBits(PtrA->getType()) != PtrBitWidth \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions Why the store size? arsenm: Why the store size?
		rampitecAuthorUnsubmitted Done Reply Inline Actions This is memory access, so should be in memory size. rampitec: This is memory access, so should be in memory size.
PtrB->getType()->getPointerAddressSpace() != PtrAS)		DL.getTypeStoreSizeInBits(PtrB->getType()) != PtrBitWidth)
return false;		return false;

APInt OffsetDelta = OffsetB - OffsetA;		APInt OffsetDelta = OffsetB - OffsetA;

// Check if they are based on the same pointer. That makes the offsets		// Check if they are based on the same pointer. That makes the offsets
// sufficient.		// sufficient.
if (PtrA == PtrB)		if (PtrA == PtrB)
return OffsetDelta == PtrDelta;		return OffsetDelta == PtrDelta;
▲ Show 20 Lines • Show All 896 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll

	; RUN: opt -mtriple=amdgcn-amd-amdhsa -load-store-vectorizer -S < %s \| FileCheck %s			; RUN: opt -load-store-vectorizer -S < %s \| FileCheck %s

	target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32"			target datalayout = "e-p:64:64-p1:64:64-p5:32:32"

	; CHECK-LABEL: @test			; Size mismatch between the 32 bit pointer in address space 5 and 64 bit
				; pointer in address space 0 it was cast to caused below test to crash.
				; The p5:32:32 portion of the data layout is critical for the test.

				; CHECK-LABEL: @cast_to_ptr
	; CHECK: store i32* undef, i32** %tmp9, align 8			; CHECK: store i32* undef, i32** %tmp9, align 8
	; CHECK: store i32* undef, i32** %tmp7, align 8			; CHECK: store i32* undef, i32** %tmp7, align 8
	define amdgpu_kernel void @test() {			define void @cast_to_ptr() {
	entry:			entry:
	%a10.ascast.i = addrspacecast i32* addrspace(5)* null to i32**			%ascast = addrspacecast i32* addrspace(5)* null to i32**
	%tmp4 = icmp eq i32 undef, 0			%tmp4 = icmp eq i32 undef, 0
	%tmp6 = select i1 false, i32 undef, i32 undef			%tmp6 = select i1 false, i32 undef, i32 undef
	%tmp7 = select i1 %tmp4, i32 null, i32 %tmp6			%tmp7 = select i1 %tmp4, i32 null, i32 %tmp6
	%tmp9 = select i1 %tmp4, i32 %a10.ascast.i, i32 null			%tmp9 = select i1 %tmp4, i32 %ascast, i32 null
	store i32* undef, i32** %tmp9, align 8			store i32* undef, i32** %tmp9, align 8
	store i32* undef, i32** %tmp7, align 8			store i32* undef, i32** %tmp7, align 8
	unreachable			unreachable
	}			}

				; CHECK-LABEL: @cast_to_cast
				; CHECK: %tmp4 = load i32, i32* %tmp1, align 8
				; CHECK: %tmp5 = load i32, i32* %tmp3, align 8
				define void @cast_to_cast() {
				entry:
				%a.ascast = addrspacecast i32* addrspace(5)* undef to i32**
				%b.ascast = addrspacecast i32* addrspace(5)* null to i32**
				%tmp1 = select i1 false, i32 %a.ascast, i32 undef
				%tmp3 = select i1 false, i32 %b.ascast, i32 undef
				%tmp4 = load i32, i32* %tmp1, align 8
				%tmp5 = load i32, i32* %tmp3, align 8
				unreachable
				}

				; CHECK-LABEL: @all_to_cast
				; CHECK: load <4 x float>
				define void @all_to_cast(i8* nocapture readonly align 16 dereferenceable(16) %alloc1) {
				entry:
				%alloc16 = addrspacecast i8* %alloc1 to i8 addrspace(1)*
				%tmp = bitcast i8 addrspace(1)* %alloc16 to float addrspace(1)*
				%tmp1 = load float, float addrspace(1)* %tmp, align 16, !invariant.load !0
				%tmp6 = getelementptr inbounds i8, i8 addrspace(1)* %alloc16, i64 4
				%tmp7 = bitcast i8 addrspace(1)* %tmp6 to float addrspace(1)*
				%tmp8 = load float, float addrspace(1)* %tmp7, align 4, !invariant.load !0
				%tmp15 = getelementptr inbounds i8, i8 addrspace(1)* %alloc16, i64 8
				%tmp16 = bitcast i8 addrspace(1)* %tmp15 to float addrspace(1)*
				%tmp17 = load float, float addrspace(1)* %tmp16, align 8, !invariant.load !0
				%tmp24 = getelementptr inbounds i8, i8 addrspace(1)* %alloc16, i64 12
				%tmp25 = bitcast i8 addrspace(1)* %tmp24 to float addrspace(1)*
				%tmp26 = load float, float addrspace(1)* %tmp25, align 4, !invariant.load !0
				ret void
				}

				!0 = !{}

This is an archive of the discontinued LLVM Phabricator instance.

Relax load store vectorizer pointer strip checksClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 212913

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll

Relax load store vectorizer pointer strip checks
ClosedPublic