This is an archive of the discontinued LLVM Phabricator instance.

Skip bitcasts while looking for GEP in LoadStoreVectorizer
ClosedPublic

Authored by rampitec on Apr 14 2017, 4:15 PM.

Download Raw Diff

Details

Reviewers

arsenm
vpykhtin

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Apr 14 2017, 4:15 PM

Herald added subscribers: nhaehnle, wdng, mzolotukhin. · View Herald TranscriptApr 14 2017, 4:15 PM

I don't think this should be needed? Is this representative of the case where you saw this as a problem? If I run this testcase through-separate-const-offset-from-gep first it handles it. This and other cases are expected to be cleaned up by that first

In D32101#727681, @arsenm wrote:

I don't think this should be needed? Is this representative of the case where you saw this as a problem? If I run this testcase through-separate-const-offset-from-gep first it handles it. This and other cases are expected to be cleaned up by that first

It is quite representative. Here is the real piece of code from app, right as it comes to the vectorizer:

%97 = zext i32 %96 to i64
%98 = getelementptr inbounds float, float addrspace(1)* %1, i64 %97
%99 = bitcast float addrspace(1)* %98 to i32 addrspace(1)*
%100 = load i32, i32 addrspace(1)* %99, align 4, !tbaa !10
%101 = add i32 %96, 1
%102 = zext i32 %101 to i64
%103 = getelementptr inbounds float, float addrspace(1)* %1, i64 %102
%104 = bitcast float addrspace(1)* %103 to i32 addrspace(1)*
%105 = load i32, i32 addrspace(1)* %104, align 4, !tbaa !10
%106 = add i32 %96, 2
%107 = zext i32 %106 to i64
%108 = getelementptr inbounds float, float addrspace(1)* %1, i64 %107
%109 = bitcast float addrspace(1)* %108 to i32 addrspace(1)*
%110 = load i32, i32 addrspace(1)* %109, align 4, !tbaa !10
%111 = add i32 %96, 3
%112 = zext i32 %111 to i64
%113 = getelementptr inbounds float, float addrspace(1)* %1, i64 %112
%114 = bitcast float addrspace(1)* %113 to i32 addrspace(1)*
%115 = load i32, i32 addrspace(1)* %114, align 4, !tbaa !10

In D32101#727682, @rampitec wrote:
In D32101#727681, @arsenm wrote:

I don't think this should be needed? Is this representative of the case where you saw this as a problem? If I run this testcase through-separate-const-offset-from-gep first it handles it. This and other cases are expected to be cleaned up by that first

It is quite representative. Here is the real piece of code from app, right as it comes to the vectorizer:
%97 = zext i32 %96 to i64
%98 = getelementptr inbounds float, float addrspace(1)* %1, i64 %97
%99 = bitcast float addrspace(1)* %98 to i32 addrspace(1)*
%100 = load i32, i32 addrspace(1)* %99, align 4, !tbaa !10
%101 = add i32 %96, 1
%102 = zext i32 %101 to i64
%103 = getelementptr inbounds float, float addrspace(1)* %1, i64 %102
%104 = bitcast float addrspace(1)* %103 to i32 addrspace(1)*
%105 = load i32, i32 addrspace(1)* %104, align 4, !tbaa !10
%106 = add i32 %96, 2
%107 = zext i32 %106 to i64
%108 = getelementptr inbounds float, float addrspace(1)* %1, i64 %107
%109 = bitcast float addrspace(1)* %108 to i32 addrspace(1)*
%110 = load i32, i32 addrspace(1)* %109, align 4, !tbaa !10
%111 = add i32 %96, 3
%112 = zext i32 %111 to i64
%113 = getelementptr inbounds float, float addrspace(1)* %1, i64 %112
%114 = bitcast float addrspace(1)* %113 to i32 addrspace(1)*
%115 = load i32, i32 addrspace(1)* %114, align 4, !tbaa !10

What does it look like immediately after SeparateConstOffsetFromGEP? Does one of the other passes break this somehow?

In D32101#727691, @arsenm wrote:

What does it look like immediately after SeparateConstOffsetFromGEP? Does one of the other passes break this somehow?

This is after SeparateConstOffsetFromGEP:

%408 = zext i32 %407 to i64
%409 = getelementptr inbounds float, float addrspace(1)* %1, i64 %408
%410 = bitcast float addrspace(1)* %409 to i32 addrspace(1)*
%411 = load i32, i32 addrspace(1)* %410, align 4, !tbaa !10
%412 = or i32 %407, 1
%413 = zext i32 %412 to i64
%414 = getelementptr inbounds float, float addrspace(1)* %1, i64 %413
%415 = bitcast float addrspace(1)* %414 to i32 addrspace(1)*
%416 = load i32, i32 addrspace(1)* %415, align 4, !tbaa !10
%417 = or i32 %407, 2
%418 = zext i32 %417 to i64
%419 = getelementptr inbounds float, float addrspace(1)* %1, i64 %418
%420 = bitcast float addrspace(1)* %419 to i32 addrspace(1)*
%421 = load i32, i32 addrspace(1)* %420, align 4, !tbaa !10
%422 = or i32 %407, 3
%423 = zext i32 %422 to i64
%424 = getelementptr inbounds float, float addrspace(1)* %1, i64 %423
%425 = bitcast float addrspace(1)* %424 to i32 addrspace(1)*
%426 = load i32, i32 addrspace(1)* %425, align 4, !tbaa !10

Bitcasts are there plus we have or's instead of adds.

Replaced code with standard stripPointerCasts().

Can you add a test with bitcasts between pointers with different element types? I thought the or problem was also supposed to be solved

In D32101#729447, @arsenm wrote:

Can you add a test with bitcasts between pointers with different element types? I thought the or problem was also supposed to be solved

Matt, what do you exactly mean by pointers with different element types? They are different in this test, float vs i32.

In D32101#729486, @rampitec wrote:

In D32101#729447, @arsenm wrote:

Can you add a test with bitcasts between pointers with different element types? I thought the or problem was also supposed to be solved

Matt, what do you exactly mean by pointers with different element types? They are different in this test, float vs i32.

I meant type sizes. float and i32 are both 4 bytes, I could see something going wrong if later code relied on this assumption if the source type were i8 for example

Added check for pointee size.

Ping.

rampitec added a reviewer: vpykhtin.Apr 24 2017, 8:41 PM

LGTM.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
295	May be turn it into a helper function?

vpykhtin accepted this revision.Apr 25 2017, 5:37 AM

This revision is now accepted and ready to land.Apr 25 2017, 5:37 AM

Factored out getSourceGEP() function as per review.

Looks good.

rL301343

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

19 lines

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

gep-bitcast.ll

83 lines

Diff 95633

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	if (X == PtrSCEVB)
return true;		return true;

// Sometimes even this doesn't work, because SCEV can't always see through		// Sometimes even this doesn't work, because SCEV can't always see through
// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking		// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking
// things the hard way.		// things the hard way.

// Look through GEPs after checking they're the same except for the last		// Look through GEPs after checking they're the same except for the last
// index.		// index.
GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(getPointerOperand(A));
GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(getPointerOperand(B));		// First strip pointer bitcasts. Make sure pointee size is the same with
		// and without casts.
		// TODO: a stride set by the add instruction below can match the difference
		// in pointee type size here. Currently it will not be vectorized.
		Value *SrcA = getPointerOperand(A);
		Value *SrcABase = SrcA->stripPointerCasts();
		if (DL.getTypeStoreSize(SrcA->getType()->getPointerElementType()) ==
		DL.getTypeStoreSize(SrcABase->getType()->getPointerElementType()))
		SrcA = SrcABase;
		vpykhtinUnsubmitted Done Reply Inline Actions May be turn it into a helper function? vpykhtin: May be turn it into a helper function?
		Value *SrcB = getPointerOperand(B);
		Value *SrcBBase = SrcB->stripPointerCasts();
		if (DL.getTypeStoreSize(SrcB->getType()->getPointerElementType()) ==
		DL.getTypeStoreSize(SrcBBase->getType()->getPointerElementType()))
		SrcB = SrcBBase;
		GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(SrcA);
		GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(SrcB);
if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())		if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())
return false;		return false;
unsigned FinalIndex = GEPA->getNumOperands() - 1;		unsigned FinalIndex = GEPA->getNumOperands() - 1;
for (unsigned i = 0; i < FinalIndex; i++)		for (unsigned i = 0; i < FinalIndex; i++)
if (GEPA->getOperand(i) != GEPB->getOperand(i))		if (GEPA->getOperand(i) != GEPB->getOperand(i))
return false;		return false;

Instruction *OpA = dyn_cast<Instruction>(GEPA->getOperand(FinalIndex));		Instruction *OpA = dyn_cast<Instruction>(GEPA->getOperand(FinalIndex));
▲ Show 20 Lines • Show All 764 Lines • Show Last 20 Lines

test/Transforms/LoadStoreVectorizer/AMDGPU/gep-bitcast.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn--amdhsa -load-store-vectorizer < %s \| FileCheck %s

				; Check that vectorizer can find a GEP through bitcast
				; CHECK-LABEL: @vect_zext_bitcast_f32_to_i32_idx
				; CHECK: load <4 x i32>
				define void @vect_zext_bitcast_f32_to_i32_idx(float addrspace(1)* %arg1, i32 %base) {
				%add1 = add nuw i32 %base, 0
				%zext1 = zext i32 %add1 to i64
				%gep1 = getelementptr inbounds float, float addrspace(1)* %arg1, i64 %zext1
				%f2i1 = bitcast float addrspace(1)* %gep1 to i32 addrspace(1)*
				%load1 = load i32, i32 addrspace(1)* %f2i1, align 4
				%add2 = add nuw i32 %base, 1
				%zext2 = zext i32 %add2 to i64
				%gep2 = getelementptr inbounds float, float addrspace(1)* %arg1, i64 %zext2
				%f2i2 = bitcast float addrspace(1)* %gep2 to i32 addrspace(1)*
				%load2 = load i32, i32 addrspace(1)* %f2i2, align 4
				%add3 = add nuw i32 %base, 2
				%zext3 = zext i32 %add3 to i64
				%gep3 = getelementptr inbounds float, float addrspace(1)* %arg1, i64 %zext3
				%f2i3 = bitcast float addrspace(1)* %gep3 to i32 addrspace(1)*
				%load3 = load i32, i32 addrspace(1)* %f2i3, align 4
				%add4 = add nuw i32 %base, 3
				%zext4 = zext i32 %add4 to i64
				%gep4 = getelementptr inbounds float, float addrspace(1)* %arg1, i64 %zext4
				%f2i4 = bitcast float addrspace(1)* %gep4 to i32 addrspace(1)*
				%load4 = load i32, i32 addrspace(1)* %f2i4, align 4
				ret void
				}

				; CHECK-LABEL: @vect_zext_bitcast_i8_st1_to_i32_idx
				; CHECK: load i32
				; CHECK: load i32
				; CHECK: load i32
				; CHECK: load i32
				define void @vect_zext_bitcast_i8_st1_to_i32_idx(i8 addrspace(1)* %arg1, i32 %base) {
				%add1 = add nuw i32 %base, 0
				%zext1 = zext i32 %add1 to i64
				%gep1 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext1
				%f2i1 = bitcast i8 addrspace(1)* %gep1 to i32 addrspace(1)*
				%load1 = load i32, i32 addrspace(1)* %f2i1, align 4
				%add2 = add nuw i32 %base, 1
				%zext2 = zext i32 %add2 to i64
				%gep2 = getelementptr inbounds i8,i8 addrspace(1)* %arg1, i64 %zext2
				%f2i2 = bitcast i8 addrspace(1)* %gep2 to i32 addrspace(1)*
				%load2 = load i32, i32 addrspace(1)* %f2i2, align 4
				%add3 = add nuw i32 %base, 2
				%zext3 = zext i32 %add3 to i64
				%gep3 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext3
				%f2i3 = bitcast i8 addrspace(1)* %gep3 to i32 addrspace(1)*
				%load3 = load i32, i32 addrspace(1)* %f2i3, align 4
				%add4 = add nuw i32 %base, 3
				%zext4 = zext i32 %add4 to i64
				%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4
				%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*
				%load4 = load i32, i32 addrspace(1)* %f2i4, align 4
				ret void
				}

				; TODO: This can be vectorized, but currently vectorizer unable to do it.
				; CHECK-LABEL: @vect_zext_bitcast_i8_st4_to_i32_idx
				define void @vect_zext_bitcast_i8_st4_to_i32_idx(i8 addrspace(1)* %arg1, i32 %base) {
				%add1 = add nuw i32 %base, 0
				%zext1 = zext i32 %add1 to i64
				%gep1 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext1
				%f2i1 = bitcast i8 addrspace(1)* %gep1 to i32 addrspace(1)*
				%load1 = load i32, i32 addrspace(1)* %f2i1, align 4
				%add2 = add nuw i32 %base, 4
				%zext2 = zext i32 %add2 to i64
				%gep2 = getelementptr inbounds i8,i8 addrspace(1)* %arg1, i64 %zext2
				%f2i2 = bitcast i8 addrspace(1)* %gep2 to i32 addrspace(1)*
				%load2 = load i32, i32 addrspace(1)* %f2i2, align 4
				%add3 = add nuw i32 %base, 8
				%zext3 = zext i32 %add3 to i64
				%gep3 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext3
				%f2i3 = bitcast i8 addrspace(1)* %gep3 to i32 addrspace(1)*
				%load3 = load i32, i32 addrspace(1)* %f2i3, align 4
				%add4 = add nuw i32 %base, 16
				%zext4 = zext i32 %add4 to i64
				%gep4 = getelementptr inbounds i8, i8 addrspace(1)* %arg1, i64 %zext4
				%f2i4 = bitcast i8 addrspace(1)* %gep4 to i32 addrspace(1)*
				%load4 = load i32, i32 addrspace(1)* %f2i4, align 4
				ret void
				}