This is an archive of the discontinued LLVM Phabricator instance.

[LoadStoreVectorizer] Use getMinusScev() to compute the distance between two pointers.
ClosedPublic

Authored by FarhanaAleen on Jul 18 2018, 4:06 PM.

Download Raw Diff

Details

Reviewers

Commits

rG8c7a30baea21: [LoadStoreVectorizer] Use getMinusScev() to compute the distance between two…
rL337471: [LoadStoreVectorizer] Use getMinusScev() to compute the distance between two…

Summary

Currently, isConsecutiveAccess() detects two pointers(PtrA and PtrB) as consecutive by comparing PtrB with BaseDelta+PtrA. This works when both pointers are factorized or both of them are not factorized. But isConsecutiveAccess() fails if one of the pointers is factorized but the other one is not.

Here is an example:
PtrA = 4 * (A + B)
PtrB = 4 + 4A + 4B

This patch uses getMinusSCEV() to compute the distance between two pointers. getMinusSCEV() allows combining the expressions and computing the simplified distance.

Diff Detail

Repository: rL LLVM

Event Timeline

FarhanaAleen created this revision.Jul 18 2018, 4:06 PM

Herald added subscribers: javed.absar, nhaehnle. · View Herald TranscriptJul 18 2018, 4:06 PM

LGTM

This revision is now accepted and ready to land.Jul 18 2018, 5:17 PM

Closed by commit rL337471: [LoadStoreVectorizer] Use getMinusScev() to compute the distance between two… (authored by faaleen). · Explain WhyJul 19 2018, 9:55 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJul 19 2018, 9:55 AM

Adding optional items noticed in post commit review.

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
339	It really looks like the new check completely subsumes this check. Have you tried simplify the code to remove this case?
351	This block may also be subsumed by the subtract. If not, this would make a good bug against SCEV because it should be able to handle most any reasonable case here.
llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/complex-index.ll
7	You really should be able to construct a target independent test for this. You don't need to exercise the actual vectorization, just the analysis phase. I'm not sure the code is structured to make this easy, but if it isn't, it really should be. (To be clear, this is not a must have, just a strong nice to have.)

FarhanaAleen added inline comments.Jul 19 2018, 1:56 PM

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
339	Thanks for your comments. Yes, this check is completely subsumed by the new check. The reason I did not remove this check because I thought performing this(const + an existing SCEV) is much cheaper than the new check (which requires to create minus scev, might have to work with more expressions) and only do this new expensive check if the result can't be achieved in a cheaper way. Also, most program execution might highly likely hit the first check. Now, I am also thinking that it's probably just one time cost since we cache the result which should be pretty negligible. I will remove this check.
351	No, this block is not subsumed by the subtract. Currently, SCEV does not handle distribution around zext/sext. I am working on a patch that extends SCEV to handle this and remove this routine entirely from LoadStoreVectorizer.
llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/complex-index.ll
7	Sounds good.

FarhanaAleen added inline comments.Jul 21 2018, 4:19 PM

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
351	In order to get rid of this block, a fundamental problem of SCEV needs to be solved which is proving a value "V" is not a poison value. Currently, this proving is supported only when V is in the header of a loop L where L repeatedly contains V. When V is not inside a loop, SCEV conservatively considers it as a poison value. Since SCEV can guarantee that V will not overflow, it conservatively drops the nsw/nuw flag. func(%base) { B1: %add1 = add i32 nsw 1, %base .. } In the above example, there no way to guarantee that the %add1 will not overflow other than relying on the IR flag. But SCEV does not allow mapping back to the original instruction (), therefore corresponding scev expr will not contain nsw/nuw flag. This block of code is taking advantage of the IR to extract the flag information.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoadStoreVectorizer.cpp

8 lines

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

complex-index.ll

49 lines

Diff 156300

llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

Show First 20 Lines • Show All 330 Lines • ▼ Show 20 Lines	bool Vectorizer::isConsecutiveAccess(Value A, Value B) {
// Compute the necessary base pointer delta to have the necessary final delta		// Compute the necessary base pointer delta to have the necessary final delta
// equal to the size.		// equal to the size.
APInt BaseDelta = Size - OffsetDelta;		APInt BaseDelta = Size - OffsetDelta;

// Compute the distance with SCEV between the base pointers.		// Compute the distance with SCEV between the base pointers.
const SCEV *PtrSCEVA = SE.getSCEV(PtrA);		const SCEV *PtrSCEVA = SE.getSCEV(PtrA);
const SCEV *PtrSCEVB = SE.getSCEV(PtrB);		const SCEV *PtrSCEVB = SE.getSCEV(PtrB);
const SCEV *C = SE.getConstant(BaseDelta);		const SCEV *C = SE.getConstant(BaseDelta);
const SCEV *X = SE.getAddExpr(PtrSCEVA, C);		const SCEV *X = SE.getAddExpr(PtrSCEVA, C);
		reamesUnsubmitted Not Done Reply Inline Actions It really looks like the new check completely subsumes this check. Have you tried simplify the code to remove this case? reames: It really looks like the new check completely subsumes this check. Have you tried simplify the…
		FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions Thanks for your comments. Yes, this check is completely subsumed by the new check. The reason I did not remove this check because I thought performing this(const + an existing SCEV) is much cheaper than the new check (which requires to create minus scev, might have to work with more expressions) and only do this new expensive check if the result can't be achieved in a cheaper way. Also, most program execution might highly likely hit the first check. Now, I am also thinking that it's probably just one time cost since we cache the result which should be pretty negligible. I will remove this check. FarhanaAleen: Thanks for your comments. Yes, this check is completely subsumed by the new check. The reason…
if (X == PtrSCEVB)		if (X == PtrSCEVB)
return true;		return true;

		// The above check will not catch the cases where one of the pointers is
		// factorized but the other one is not, such as (C + (S * (A + B))) vs
		// (AS + BS). Get the minus scev. That will allow re-combining the expresions
		// and getting the simplified difference.
		const SCEV *Dist = SE.getMinusSCEV(PtrSCEVB, PtrSCEVA);
		if (C == Dist)
		return true;

// Sometimes even this doesn't work, because SCEV can't always see through		// Sometimes even this doesn't work, because SCEV can't always see through
		reamesUnsubmitted Not Done Reply Inline Actions This block may also be subsumed by the subtract. If not, this would make a good bug against SCEV because it should be able to handle most any reasonable case here. reames: This block may also be subsumed by the subtract. If not, this would make a good bug against…
		FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions No, this block is not subsumed by the subtract. Currently, SCEV does not handle distribution around zext/sext. I am working on a patch that extends SCEV to handle this and remove this routine entirely from LoadStoreVectorizer. FarhanaAleen: No, this block is not subsumed by the subtract. Currently, SCEV does not handle distribution…
		FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions In order to get rid of this block, a fundamental problem of SCEV needs to be solved which is proving a value "V" is not a poison value. Currently, this proving is supported only when V is in the header of a loop L where L repeatedly contains V. When V is not inside a loop, SCEV conservatively considers it as a poison value. Since SCEV can guarantee that V will not overflow, it conservatively drops the nsw/nuw flag. func(%base) { B1: %add1 = add i32 nsw 1, %base .. } In the above example, there no way to guarantee that the %add1 will not overflow other than relying on the IR flag. But SCEV does not allow mapping back to the original instruction (), therefore corresponding scev expr will not contain nsw/nuw flag. This block of code is taking advantage of the IR to extract the flag information. FarhanaAleen: In order to get rid of this block, a fundamental problem of SCEV needs to be solved which is…
// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking		// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking
// things the hard way.		// things the hard way.

// Look through GEPs after checking they're the same except for the last		// Look through GEPs after checking they're the same except for the last
// index.		// index.
GetElementPtrInst *GEPA = getSourceGEP(A);		GetElementPtrInst *GEPA = getSourceGEP(A);
GetElementPtrInst *GEPB = getSourceGEP(B);		GetElementPtrInst *GEPB = getSourceGEP(B);
if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())		if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())
▲ Show 20 Lines • Show All 806 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoadStoreVectorizer/AMDGPU/complex-index.ll

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				declare i64 @_Z12get_local_idj(i32)

				declare i64 @_Z12get_group_idj(i32)

				declare double @llvm.fmuladd.f64(double, double, double)
				reamesUnsubmitted Not Done Reply Inline Actions You really should be able to construct a target independent test for this. You don't need to exercise the actual vectorization, just the analysis phase. I'm not sure the code is structured to make this easy, but if it isn't, it really should be. (To be clear, this is not a must have, just a strong nice to have.) reames: You really should be able to construct a target independent test for this. You don't need to…
				FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions Sounds good. FarhanaAleen: Sounds good.

				; CHECK-LABEL: @factorizedVsNonfactorizedAccess(
				; CHECK: load <2 x float>
				; CHECK: store <2 x float>
				define amdgpu_kernel void @factorizedVsNonfactorizedAccess(float addrspace(1)* nocapture %c) {
				entry:
				%call = tail call i64 @_Z12get_local_idj(i32 0)
				%call1 = tail call i64 @_Z12get_group_idj(i32 0)
				%div = lshr i64 %call, 4
				%div2 = lshr i64 %call1, 3
				%mul = shl i64 %div2, 7
				%rem = shl i64 %call, 3
				%mul3 = and i64 %rem, 120
				%add = or i64 %mul, %mul3
				%rem4 = shl i64 %call1, 7
				%mul5 = and i64 %rem4, 896
				%mul6 = shl nuw nsw i64 %div, 3
				%add7 = add nuw i64 %mul5, %mul6
				%mul9 = shl i64 %add7, 10
				%add10 = add i64 %mul9, %add
				%arrayidx = getelementptr inbounds float, float addrspace(1)* %c, i64 %add10
				%load1 = load float, float addrspace(1)* %arrayidx, align 4
				%conv = fpext float %load1 to double
				%mul11 = fmul double %conv, 0x3FEAB481D8F35506
				%conv12 = fptrunc double %mul11 to float
				%conv18 = fpext float %conv12 to double
				%storeval1 = tail call double @llvm.fmuladd.f64(double 0x3FF4FFAFBBEC946A, double 0.000000e+00, double %conv18)
				%cstoreval1 = fptrunc double %storeval1 to float
				store float %cstoreval1, float addrspace(1)* %arrayidx, align 4

				%add23 = or i64 %add10, 1
				%arrayidx24 = getelementptr inbounds float, float addrspace(1)* %c, i64 %add23
				%load2 = load float, float addrspace(1)* %arrayidx24, align 4
				%conv25 = fpext float %load2 to double
				%mul26 = fmul double %conv25, 0x3FEAB481D8F35506
				%conv27 = fptrunc double %mul26 to float
				%conv34 = fpext float %conv27 to double
				%storeval2 = tail call double @llvm.fmuladd.f64(double 0x3FF4FFAFBBEC946A, double 0.000000e+00, double %conv34)
				%cstoreval2 = fptrunc double %storeval2 to float
				store float %cstoreval2, float addrspace(1)* %arrayidx24, align 4
				ret void
				}
				No newline at end of file