This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
PtrUseVisitor.h
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
4
SROA.cpp
-
test/Transforms/SROA/
-
Transforms/
-
SROA/
-
basictest.ll

Differential D10482

Make SROA handle addrspacecast
Needs ReviewPublic

Authored by wengxt on Jun 16 2015, 12:21 PM.

Download Raw Diff

This revision needs review, but all reviewers have resigned.

Details

Reviewers

arsenm

Summary

This is basically D4501 from Matt with fixed testcase. In GPU related
world addrspacecast instructions are commonly used to avoid overhead
when access memory with generic load and store. SROA pass currently does
not take addrspaceCast into account. This patch uses the same way that SROA
handles bitcast to process addrspacecast.

Diff Detail

Event Timeline

wengxt updated this revision to Diff 27778.Jun 16 2015, 12:21 PM

wengxt retitled this revision from to Make SROA handle addrspacecast.

wengxt updated this object.

wengxt edited the test plan for this revision. (Show Details)

wengxt added reviewers: arsenm, chandlerc.

wengxt added subscribers: jingyue, Unknown Object (MLST).

wengxt mentioned this in D10483: Add NVPTXLowerAlloca pass to convert alloca'ed memory to local address.Jun 16 2015, 12:30 PM

sanjoy added a subscriber: sanjoy.Jun 16 2015, 1:20 PM

sanjoy added inline comments.

lib/Transforms/Scalar/SROA.cpp
1825	Will this transform a "gep(addrspacecast X), Y)" to "addrspacecast(gep X, Y')"? If so, I think we need a target specific hook that does a legality check on the addrspacecast we're about to insert -- I don't think "addrspacecast X" is legal implies an addrspacecast on a GEP derived from X is also legal.

I'm actually quite alarmed at the idea of needing to handle addrspacecast of *alloca*s. I feel like their existence is a bigger problem.

I'd like to understand the use case that GPU folks have for this better. For an un-escaped alloca, why are there addrspacecasts at all? Why would they be useful? Why couldn't you lower them with arbitrarily fast loads and stores due to being unescaped allocas?

lib/Transforms/Scalar/SROA.cpp
1825	The whole point of addrspacecast was to avoid the need for target specific checks... I think we just shouldn't transform GEPs around them if that isn't legal everywhere.

Regarding your question on why these addrspacecasts even exist, the short answer is that emitting addrspacecasts after alloca eases NVPTX's address space inference which lowers access to special address spaces into fast loads/stores.

The long answer requires some background on CUDA and how the NVPTX backend works. I tried to make it brief. Let me know if it makes sense.

First of all, address spaces are not included in CUDA's type system. All pointers in a CUDA source code are generic. Only when a variable is declared, the programmer can annotate the address space that this variable lives in. For example, __shared__ int *a = xxx; means a, the pointer itself, lives in the shared address space rather than a points to an int that lives in the shared space.

So, LLVM's middle-end and/or the NVPTX backend bear the burden of optimizing the memory accesses to variables in special address spaces. Given an address-taken variable annotated with a special address space, the clang front-end simply casts the resultant address to generic (i.e. addrspace(0)) right after the declaration. The optimizer needs to eliminate unnecessary addrspacecasts and emits fast loads/stores.

Currently, the pass that performs this address space optimization in NVPTX is NVPTXFavorNonGenericAddrSpaces. It works by pushing addrspacecast (from special to generic) towards loads and stores as far as possible. After that, the codegen can simply lower load/store <ty>, <ty> addrspace(x)* to fast loads/stores in machine code. For example, if a function reads from a shared variable, the IR clang emits without optimization looks like

@a = addrspace(3) global float
define ... {
  %1 = addrspacecast float addrspace(3)* @a to float* ; cast from special to generic
  %2 = load float, float* %1
  ...
}

then NVPTXFavorNonGenericAddrSpaces optimizes it to

@a = addrspace(3) global float
define ... {
  %2 = load float, float addrspace(3)* @a
  ...
}

and finally NVPTX's codegen lowers this "load from shared" to a ld.shared.f32 instruction.

This pass works well for GlobalVariables whose address spaces are explicitly annotated. However, alloca always returns addrspace(0) pointers, so the clang front-end doesn't (need to) emit any explicit addrspacecast from local to generic in the initial IR. This creates troubles for the address space inference, because NVPTXFavorNonGenericAddrSpaces has no addrspacecasts to push for accesses to the local address space.

To address this issue, we plan to run another pass called NVPTXLowerAlloca (D10483) that emits two addrspacecasts after an alloca, one casting generic to local and the other casting local back to generic. For example,

define ...() {
  %1 = alloca float
  store float 0, float* %1
}

becomes

define ...() {
  %1 = alloca float
  %2 = addrspacecast float* %1 to float addrspace(4)*
  %3 = addrspacecast float addrspace(4)* %2 to float*
  store float 0, float* %3
}

Then, NVPTXFavorNonGenericAddrSpaces pushes the second addrspacecast as usual towards the store.

define ...() {
  %1 = alloca float
  %2 = addrspacecast float* %1 to float addrspace(4)*
  store float 0, float addrspace(4)* %2
}

This is why NVPTX emits addrspacecasts after alloca at some stage.

If having SROA to handle addrspacecast of allocas is too alarming, I think the easiest way to fix the issue on our side is to run SROA between NVPTXLowerKernelArgs (another address space inference pass, which must run before SROA) and NVPTXLowerAlloca. It will work in the short term. However, Justin Holewinski and I have some concerns on how much we depend on correctly ordering these address-space inference passes, and think we should merge all of them into one big pass. If we want that merging to happen, we won't be able to run SROA in the middle. Still, workarounds exist such as merging most of the address space inference and only running a small part before SROA.

lib/Transforms/Scalar/SROA.cpp
1825	Thanks for pointing this out, Sanjoy. I understand your concerns. Curiously, on which existing targets such transformation from `gep(addrspacecast X), Y` to `addrspacecast(gep X, Y)` is illegal? Just asking so that I can keep a concrete example in mind when I later write target-independent optimizations on addrspacecast.

The last time this came up, I was able to get instcombine to eliminate trivial addrspacecasts of alloca which worked just as well. I just never got around to actually testing / posting the patch. The main piece was just removing an outdated check for the address space mismatch from before addrspacecast existed

sanjoy added inline comments.Jun 17 2015, 3:36 PM

lib/Transforms/Scalar/SROA.cpp
1825	In practice, I think `gep(addrspacecast X), Y` and `addrspacecast(gep X, Y)` should compute the same result in all existing targets. However, this caps how complex an `addrspacecast` implementation can be. For instance, targets won't be able to implement `addrspacecast` as addrspacecast(P) { return abs(P) } since with this change, `gep(addrspacecast(-1), 2)` != `addrspacecast(gep -1, 2)`. I'm okay with moving ahead with this gep transform as long as we change the langref to reflect that `addrspacecast` instructions cannot be arbitrarily complex value modifications (so that an `abs` implementation for `addrspacecast` can be ruled out).

chandlerc removed a reviewer: chandlerc.Apr 6 2016, 10:52 PM

arsenm mentioned this in D31924: SROA: Allow eliminating addrspacecasted allocas.Apr 10 2017, 8:52 PM

arsenm resigned from this revision.Jun 3 2019, 11:14 AM

Herald added a subscriber: wdng. · View Herald TranscriptJun 3 2019, 11:14 AM

Revision Contents

Path

Size

include/

llvm/

Analysis/

PtrUseVisitor.h

4 lines

lib/

Transforms/

Scalar/

SROA.cpp

25 lines

test/

Transforms/

SROA/

basictest.ll

37 lines

Diff 27778

include/llvm/Analysis/PtrUseVisitor.h

Show First 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	void visitStoreInst(StoreInst &SI) {
if (SI.getValueOperand() == U->get())		if (SI.getValueOperand() == U->get())
PI.setEscaped(&SI);		PI.setEscaped(&SI);
}		}

void visitBitCastInst(BitCastInst &BC) {		void visitBitCastInst(BitCastInst &BC) {
enqueueUsers(BC);		enqueueUsers(BC);
}		}

		void visitAddrSpaceCastInst(AddrSpaceCastInst &ASC) {
		enqueueUsers(ASC);
		}

void visitPtrToIntInst(PtrToIntInst &I) {		void visitPtrToIntInst(PtrToIntInst &I) {
PI.setEscaped(&I);		PI.setEscaped(&I);
}		}

void visitGetElementPtrInst(GetElementPtrInst &GEPI) {		void visitGetElementPtrInst(GetElementPtrInst &GEPI) {
if (GEPI.use_empty())		if (GEPI.use_empty())
return;		return;

Show All 36 Lines

lib/Transforms/Scalar/SROA.cpp

Show First 20 Lines • Show All 682 Lines • ▼ Show 20 Lines	private:

void visitBitCastInst(BitCastInst &BC) {		void visitBitCastInst(BitCastInst &BC) {
if (BC.use_empty())		if (BC.use_empty())
return markAsDead(BC);		return markAsDead(BC);

return Base::visitBitCastInst(BC);		return Base::visitBitCastInst(BC);
}		}

		void visitAddrSpaceCastInst(AddrSpaceCastInst &ASC) {
		if (ASC.use_empty())
		return markAsDead(ASC);

		return Base::visitAddrSpaceCastInst(ASC);
		}

void visitGetElementPtrInst(GetElementPtrInst &GEPI) {		void visitGetElementPtrInst(GetElementPtrInst &GEPI) {
if (GEPI.use_empty())		if (GEPI.use_empty())
return markAsDead(GEPI);		return markAsDead(GEPI);

if (SROAStrictInbounds && GEPI.isInBounds()) {		if (SROAStrictInbounds && GEPI.isInBounds()) {
// FIXME: This is a manually un-factored variant of the basic code inside		// FIXME: This is a manually un-factored variant of the basic code inside
// of GEPs with checking of the inbounds invariant specified in the		// of GEPs with checking of the inbounds invariant specified in the
// langref in a very strict sense. If we ever want to enable		// langref in a very strict sense. If we ever want to enable
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	do {
Size = std::max(Size, DL.getTypeStoreSize(Op->getType()));		Size = std::max(Size, DL.getTypeStoreSize(Op->getType()));
continue;		continue;
}		}

if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(I)) {		if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(I)) {
if (!GEP->hasAllZeroIndices())		if (!GEP->hasAllZeroIndices())
return GEP;		return GEP;
} else if (!isa<BitCastInst>(I) && !isa<PHINode>(I) &&		} else if (!isa<BitCastInst>(I) && !isa<PHINode>(I) &&
!isa<SelectInst>(I)) {		!isa<SelectInst>(I) && !isa<AddrSpaceCastInst>(I)) {
return I;		return I;
}		}

for (User *U : I->users())		for (User *U : I->users())
if (Visited.insert(cast<Instruction>(U)).second)		if (Visited.insert(cast<Instruction>(U)).second)
Uses.push_back(std::make_pair(I, cast<Instruction>(U)));		Uses.push_back(std::make_pair(I, cast<Instruction>(U)));
} while (!Uses.empty());		} while (!Uses.empty());

▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	isInstInList(Instruction *I,
do {		do {
if (Ptr == &AI)		if (Ptr == &AI)
return true;		return true;

if (BitCastInst *BCI = dyn_cast<BitCastInst>(Ptr))		if (BitCastInst *BCI = dyn_cast<BitCastInst>(Ptr))
Ptr = BCI->getOperand(0);		Ptr = BCI->getOperand(0);
else if (GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(Ptr))		else if (GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(Ptr))
Ptr = GEPI->getPointerOperand();		Ptr = GEPI->getPointerOperand();
		else if (AddrSpaceCastInst *ASC = dyn_cast<AddrSpaceCastInst>(Ptr))
		Ptr = ASC->getOperand(0);
else		else
return false;		return false;

} while (Visited.insert(Ptr).second);		} while (Visited.insert(Ptr).second);

return false;		return false;
}		}

▲ Show 20 Lines • Show All 629 Lines • ▼ Show 20 Lines	do {

// Stash this pointer if we've found an i8*.		// Stash this pointer if we've found an i8*.
if (Ptr->getType()->isIntegerTy(8)) {		if (Ptr->getType()->isIntegerTy(8)) {
Int8Ptr = Ptr;		Int8Ptr = Ptr;
Int8PtrOffset = Offset;		Int8PtrOffset = Offset;
}		}

// Peel off a layer of the pointer and update the offset appropriately.		// Peel off a layer of the pointer and update the offset appropriately.
if (Operator::getOpcode(Ptr) == Instruction::BitCast) {		unsigned Opc = Operator::getOpcode(Ptr);
		if (Opc == Instruction::BitCast \|\| Opc == Instruction::AddrSpaceCast) {
Ptr = cast<Operator>(Ptr)->getOperand(0);		Ptr = cast<Operator>(Ptr)->getOperand(0);
} else if (GlobalAlias *GA = dyn_cast<GlobalAlias>(Ptr)) {		} else if (GlobalAlias *GA = dyn_cast<GlobalAlias>(Ptr)) {
if (GA->mayBeOverridden())		if (GA->mayBeOverridden())
break;		break;
Ptr = GA->getAliasee();		Ptr = GA->getAliasee();
} else {		} else {
break;		break;
}		}
Show All 12 Lines	OffsetPtr = Int8PtrOffset == 0
? Int8Ptr		? Int8Ptr
: IRB.CreateInBoundsGEP(IRB.getInt8Ty(), Int8Ptr,		: IRB.CreateInBoundsGEP(IRB.getInt8Ty(), Int8Ptr,
IRB.getInt(Int8PtrOffset),		IRB.getInt(Int8PtrOffset),
NamePrefix + "sroa_raw_idx");		NamePrefix + "sroa_raw_idx");
}		}
Ptr = OffsetPtr;		Ptr = OffsetPtr;

// On the off chance we were targeting i8*, guard the bitcast here.		// On the off chance we were targeting i8*, guard the bitcast here.
if (Ptr->getType() != PointerTy)		if (Ptr->getType() != PointerTy) {
Ptr = IRB.CreateBitCast(Ptr, PointerTy, NamePrefix + "sroa_cast");		Ptr = IRB.CreatePointerBitCastOrAddrSpaceCast(Ptr, PointerTy,
		sanjoyUnsubmitted Not Done Reply Inline Actions Will this transform a "gep(addrspacecast X), Y)" to "addrspacecast(gep X, Y')"? If so, I think we need a target specific hook that does a legality check on the addrspacecast we're about to insert -- I don't think "addrspacecast X" is legal implies an addrspacecast on a GEP derived from X is also legal. sanjoy: Will this transform a "gep(addrspacecast X), Y)" to "addrspacecast(gep X, Y')"? If so, I think…
		chandlercUnsubmitted Not Done Reply Inline Actions The whole point of addrspacecast was to avoid the need for target specific checks... I think we just shouldn't transform GEPs around them if that isn't legal everywhere. chandlerc: The whole point of addrspacecast was to avoid the need for target specific checks... I think we…
		jingyueUnsubmitted Not Done Reply Inline Actions Thanks for pointing this out, Sanjoy. I understand your concerns. Curiously, on which existing targets such transformation from `gep(addrspacecast X), Y` to `addrspacecast(gep X, Y)` is illegal? Just asking so that I can keep a concrete example in mind when I later write target-independent optimizations on addrspacecast. jingyue: Thanks for pointing this out, Sanjoy. I understand your concerns. Curiously, on which existing…
		sanjoyUnsubmitted Not Done Reply Inline Actions In practice, I think `gep(addrspacecast X), Y` and `addrspacecast(gep X, Y)` should compute the same result in all existing targets. However, this caps how complex an `addrspacecast` implementation can be. For instance, targets won't be able to implement `addrspacecast` as addrspacecast(P) { return abs(P) } since with this change, `gep(addrspacecast(-1), 2)` != `addrspacecast(gep -1, 2)`. I'm okay with moving ahead with this gep transform as long as we change the langref to reflect that `addrspacecast` instructions cannot be arbitrarily complex value modifications (so that an `abs` implementation for `addrspacecast` can be ruled out). sanjoy: In practice, I think `gep(addrspacecast X), Y` and `addrspacecast(gep X, Y)` should compute the…
		NamePrefix + "sroa_cast");
		}

return Ptr;		return Ptr;
}		}

/// \brief Compute the adjusted alignment for a load or store from an offset.		/// \brief Compute the adjusted alignment for a load or store from an offset.
static unsigned getAdjustedAlignment(Instruction *I, uint64_t Offset,		static unsigned getAdjustedAlignment(Instruction *I, uint64_t Offset,
const DataLayout &DL) {		const DataLayout &DL) {
unsigned Alignment;		unsigned Alignment;
▲ Show 20 Lines • Show All 1,483 Lines • ▼ Show 20 Lines	bool visitStoreInst(StoreInst &SI) {
return true;		return true;
}		}

bool visitBitCastInst(BitCastInst &BC) {		bool visitBitCastInst(BitCastInst &BC) {
enqueueUsers(BC);		enqueueUsers(BC);
return false;		return false;
}		}

		bool visitAddrSpaceCastInst(AddrSpaceCastInst &ASC) {
		enqueueUsers(ASC);
		return false;
		}

bool visitGetElementPtrInst(GetElementPtrInst &GEPI) {		bool visitGetElementPtrInst(GetElementPtrInst &GEPI) {
enqueueUsers(GEPI);		enqueueUsers(GEPI);
return false;		return false;
}		}

bool visitPHINode(PHINode &PN) {		bool visitPHINode(PHINode &PN) {
enqueueUsers(PN);		enqueueUsers(PN);
return false;		return false;
▲ Show 20 Lines • Show All 1,166 Lines • Show Last 20 Lines

test/Transforms/SROA/basictest.ll

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	entry:
store i64 %X, i64* %B		store i64 %X, i64* %B
br label %L2		br label %L2

L2:		L2:
%Z = load i64, i64* %B		%Z = load i64, i64* %B
ret i64 %Z		ret i64 %Z
}		}

		define i64 @test2_addrspacecast(i64 %X) {
		; CHECK-LABEL: @test2_addrspacecast(
		; CHECK-NOT: alloca
		; CHECK: ret i64 %X

		entry:
		%A = alloca [8 x i8]
		%B = addrspacecast [8 x i8]* %A to i64 addrspace(1)*
		store i64 %X, i64 addrspace(1)* %B
		br label %L2

		L2:
		%Z = load i64, i64 addrspace(1)* %B
		ret i64 %Z
		}

define void @test3(i8* %dst, i8* %src) {		define void @test3(i8* %dst, i8* %src) {
; CHECK-LABEL: @test3(		; CHECK-LABEL: @test3(

entry:		entry:
%a = alloca [300 x i8]		%a = alloca [300 x i8]
; CHECK-NOT: alloca		; CHECK-NOT: alloca
; CHECK: %[[test3_a1:.*]] = alloca [42 x i8]		; CHECK: %[[test3_a1:.*]] = alloca [42 x i8]
; CHECK-NEXT: %[[test3_a2:.*]] = alloca [99 x i8]		; CHECK-NEXT: %[[test3_a2:.*]] = alloca [99 x i8]
▲ Show 20 Lines • Show All 744 Lines • ▼ Show 20 Lines	entry:
%cast1 = bitcast %opaque* %x to i8*		%cast1 = bitcast %opaque* %x to i8*
%cast2 = bitcast { i64, i8* }* %a to i8*		%cast2 = bitcast { i64, i8* }* %a to i8*
call void @llvm.memcpy.p0i8.p0i8.i32(i8* %cast2, i8* %cast1, i32 16, i32 1, i1 false)		call void @llvm.memcpy.p0i8.p0i8.i32(i8* %cast2, i8* %cast1, i32 16, i32 1, i1 false)
%gep = getelementptr inbounds { i64, i8* }, { i64, i8* }* %a, i32 0, i32 0		%gep = getelementptr inbounds { i64, i8* }, { i64, i8* }* %a, i32 0, i32 0
%val = load i64, i64* %gep		%val = load i64, i64* %gep
ret i32 undef		ret i32 undef
}		}

		declare void @llvm.memcpy.p0i8.p1i8.i32(i8* nocapture, i8 addrspace(1)* nocapture, i32, i32, i1) nounwind

		define i32 @test19_addrspacecast(%opaque* %x) {
		; This input will cause us to try to compute a natural GEP when rewriting
		; pointers in such a way that we try to GEP through the opaque type. Previously,
		; a check for an unsized type was missing and this crashed. Ensure it behaves
		; reasonably now.
		; CHECK-LABEL: @test19_addrspacecast(
		; CHECK-NOT: alloca
		; CHECK: ret i32 undef

		entry:
		%a = alloca { i64, i8* }
		%cast1 = addrspacecast %opaque* %x to i8 addrspace(1)*
		%cast2 = bitcast { i64, i8* }* %a to i8*
		call void @llvm.memcpy.p0i8.p1i8.i32(i8* %cast2, i8 addrspace(1)* %cast1, i32 16, i32 1, i1 false)
		%gep = getelementptr inbounds { i64, i8* }, { i64, i8* }* %a, i32 0, i32 0
		%val = load i64, i64* %gep
		ret i32 undef
		}

define i32 @test20() {		define i32 @test20() {
; Ensure we can track negative offsets (before the beginning of the alloca) and		; Ensure we can track negative offsets (before the beginning of the alloca) and
; negative relative offsets from offsets starting past the end of the alloca.		; negative relative offsets from offsets starting past the end of the alloca.
; CHECK-LABEL: @test20(		; CHECK-LABEL: @test20(
; CHECK-NOT: alloca		; CHECK-NOT: alloca
; CHECK: %[[sum1:.*]] = add i32 1, 2		; CHECK: %[[sum1:.*]] = add i32 1, 2
; CHECK: %[[sum2:.*]] = add i32 %[[sum1]], 3		; CHECK: %[[sum2:.*]] = add i32 %[[sum1]], 3
; CHECK: ret i32 %[[sum2]]		; CHECK: ret i32 %[[sum2]]
▲ Show 20 Lines • Show All 762 Lines • Show Last 20 Lines