This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
1
NVPTXLowerArgs.cpp
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
-
lower-args.ll

Differential D91928

[nvptx] Skip alloca for read-only byval arguments.
AbandonedPublic

Authored by hliao on Nov 21 2020, 11:51 PM.

Download Raw Diff

Details

Reviewers

tra
jlebar

Summary

Once a byval argument is attributed with read-only, there's no store into that argument and it's safe to skip generating alloca to match read-only input parameter space property. Cast that generic pointer to the parameter space and back so that the address space inference pass could infer the correct parameter space.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	360 ms	linux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp

Event Timeline

hliao created this revision.Nov 21 2020, 11:51 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2020, 11:51 PM

Herald added subscribers: llvm-commits, hiraditya, jholewinski. · View Herald Transcript

hliao requested review of this revision.Nov 21 2020, 11:51 PM

hliao added a reviewer: jlebar.Nov 21 2020, 11:52 PM

It turns out that the simplest way is to skip generating alloca once that byval argument is readonly. As readonly will be attributed once there's no write to that argument, it's safe to just cast that pointer to the parameter space if it has readonly. Basically, that argument lowering pass does a similar to D91590 but, instead, applies that in the backend. I verified that, for that simple test CUDA code, it would generate the same SASS.

This looks really simple, which is awesome. I am enthusiastic. But I am worried it may not be correct.

AIUI params are special in that they *must* be read from the param address space. It is illegal to do a generic load of a param.

So this change is correct only if we can guarantee that address space inference will infer the specific address space for all uses of the pointer.

But address space inference is not guaranteed. For example, you could select on two pointers of two different address spaces. So long as you only ever read from these pointers, the arg can still be marked as ReadOnly. But with this patch, we'd end up doing a generic load from the param space, which would be illegal.

Take it all with a grain of salt since I've also been out of the game for a while.

llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp
166	nit: s/could/can/

Harbormaster completed remote builds in B79714: Diff 306894.Nov 22 2020, 12:32 AM

In D91928#2409971, @jlebar wrote:

This looks really simple, which is awesome. I am enthusiastic. But I am worried it may not be correct.

AIUI params are special in that they *must* be read from the param address space. It is illegal to do a generic load of a param.

So this change is correct only if we can guarantee that address space inference will infer the specific address space for all uses of the pointer.

But address space inference is not guaranteed. For example, you could select on two pointers of two different address spaces. So long as you only ever read from these pointers, the arg can still be marked as ReadOnly. But with this patch, we'd end up doing a generic load from the param space, which would be illegal.

Take it all with a grain of salt since I've also been out of the game for a while.

readonly is marked in the middle-end (function and argument attribute deduction) and way before the backend. That deduction looks through all relevant users including PHI, addrspacecast and calls. I don't believe there's any exception to prove that deduction wrong.
The address space inference here only refers to the one in the backend directly after this argument lowering gpass. Thep helps translate that argument loading into the correct one using parameter space and doesn't help the aforementioned argument attribute deduction.

I don't believe there's any exception to prove deduction [of the readonly attribute] wrong.

Understood.

The address space inference here only refers to the one in the backend directly after this argument lowering gpass.

Also understood.

This isn't speaking to my concern, though.

Suppose we have

__global__ void foo(int x, const int* y, int* out, bool flag) {
  int* ptr = flag ? &x : y;
  *out = *ptr;
}

In this case we can say with confidence that x is readonly.

But address space inference cannot infer the address space of ptr (how could it?). Therefore we will do a generic load, which is wrong.

In D91928#2410164, @jlebar wrote:
I don't believe there's any exception to prove deduction [of the readonly attribute] wrong.

Understood.

The address space inference here only refers to the one in the backend directly after this argument lowering gpass.

Also understood.

This isn't speaking to my concern, though.

Suppose we have
__global__ void foo(int x, const int* y, int* out, bool flag) {
  int* ptr = flag ? &x : y;
  *out = *ptr;
}
In this case we can say with confidence that x is readonly.

But address space inference cannot infer the address space of ptr (how could it?). Therefore we will do a generic load, which is wrong.

I see your point. PTX doesn't state the generic addressing could be performed on that parameter space. But, that case could be excluded with the extra check on how that parameter space pointer is used. In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

I think an allowlist might be more appropriate than a denylist. Rather than, anything other than PHI and SELECT, could it be, if it's only transitively used by gep and load we're good?

I am not 100% sure even that works, though. The real problem is that this pass is trying to reason about what the addrspace inference pass is capable of. We can only do the transformation if here if we're positive that addrspace inference will eliminate all generic loads from the arg. That's a layering violation and ultimately is fragile.

In D91928#2410283, @jlebar wrote:

In case it's not used in PHI or SELECT and cannot ensure the result is also a pointer to the parameter space, we could skip alloca insertion.

I think an allowlist might be more appropriate than a denylist. Rather than, anything other than PHI and SELECT, could it be, if it's only transitively used by gep and load we're good?

I am not 100% sure even that works, though. The real problem is that this pass is trying to reason about what the addrspace inference pass is capable of. We can only do the transformation if here if we're positive that addrspace inference will eliminate all generic loads from the arg. That's a layering violation and ultimately is fragile.

yeah, it seems the other approach is more appropriate to place the alloca in the frontend and that explicitly copy from the parameter space to the private space.

In D91928#2410288, @hliao wrote:

yeah, it seems the other approach is more appropriate to place the alloca in the frontend and that explicitly copy from the parameter space to the private space.

+1. Inserting alloca+copy early would be beneficial in general -- it will face more optimization opportunities which should be possible to see-through the copy in some cases.
Adding readonly on the original argument would probably be good, too.

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXLowerArgs.cpp

19 lines

test/

CodeGen/

NVPTX/

lower-args.ll

21 lines

Diff 306894

llvm/lib/Target/NVPTX/NVPTXLowerArgs.cpp

	Show First 20 Lines • Show All 153 Lines • ▼ Show 20 Lines
	void NVPTXLowerArgs::handleByValParam(Argument *Arg) {			void NVPTXLowerArgs::handleByValParam(Argument *Arg) {
	Function *Func = Arg->getParent();			Function *Func = Arg->getParent();
	Instruction *FirstInst = &(Func->getEntryBlock().front());			Instruction *FirstInst = &(Func->getEntryBlock().front());
	PointerType *PType = dyn_cast<PointerType>(Arg->getType());			PointerType *PType = dyn_cast<PointerType>(Arg->getType());

	assert(PType && "Expecting pointer type in handleByValParam");			assert(PType && "Expecting pointer type in handleByValParam");

	Type *StructType = PType->getElementType();			Type *StructType = PType->getElementType();

				if (Arg->onlyReadsMemory()) {
				// Once there's no store to that byval argument, there's no need to
				// generate an `alloca`. Cast it into the parameter space and cast it back
				// to the generic space so that the address space inference could infer the
				jlebarUnsubmitted Not Done Reply Inline Actions nit: s/could/can/ jlebar: nit: s/could/can/
				// correct address space.
				Value *ArgInParam = new AddrSpaceCastInst(
				Arg, PointerType::get(StructType, ADDRESS_SPACE_PARAM), Arg->getName(),
				FirstInst);
				Value *ArgInGeneric = new AddrSpaceCastInst(
				ArgInParam, PType, Arg->getName() + ".addrspacecast", FirstInst);
				for (auto &U : Arg->uses()) {
				if (U.getUser() == ArgInParam)
				continue;
				U.getUser()->setOperand(U.getOperandNo(), ArgInGeneric);
				}
				return;
				}

	const DataLayout &DL = Func->getParent()->getDataLayout();			const DataLayout &DL = Func->getParent()->getDataLayout();
	unsigned AS = DL.getAllocaAddrSpace();			unsigned AS = DL.getAllocaAddrSpace();
	AllocaInst *AllocA = new AllocaInst(StructType, AS, Arg->getName(), FirstInst);			AllocaInst *AllocA = new AllocaInst(StructType, AS, Arg->getName(), FirstInst);
	// Set the alignment to alignment of the byval parameter. This is because,			// Set the alignment to alignment of the byval parameter. This is because,
	// later load/stores assume that alignment, and we are going to replace			// later load/stores assume that alignment, and we are going to replace
	// the use of the byval parameter with this alloca instruction.			// the use of the byval parameter with this alloca instruction.
	AllocA->setAlignment(Func->getParamAlign(Arg->getArgNo())			AllocA->setAlignment(Func->getParamAlign(Arg->getArgNo())
	.getValueOr(DL.getPrefTypeAlign(StructType)));			.getValueOr(DL.getPrefTypeAlign(StructType)));
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/lower-args.ll

	; RUN: opt < %s -S -nvptx-lower-args \| FileCheck %s --check-prefix IR			; RUN: opt < %s -S -nvptx-lower-args \| FileCheck %s --check-prefix IR
	; RUN: llc < %s -mcpu=sm_20 \| FileCheck %s --check-prefix PTX			; RUN: llc < %s -mcpu=sm_20 \| FileCheck %s --check-prefix PTX

	target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"			target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
	target triple = "nvptx64-nvidia-cuda"			target triple = "nvptx64-nvidia-cuda"

	%class.outer = type <{ %class.inner, i32, [4 x i8] }>			%class.outer = type <{ %class.inner, i32, [4 x i8] }>
	%class.inner = type { i32, i32 }			%class.inner = type { i32, i32 }

	; Check that nvptx-lower-args preserves arg alignment			; Check that nvptx-lower-args preserves arg alignment
	define void @load_alignment(%class.outer* nocapture readonly byval(%class.outer) align 8 %arg) {			define void @load_alignment(%class.outer* nocapture readonly byval(%class.outer) align 8 %arg) {
	entry:			entry:
				; IR-LABEL: @load_alignment
				; IR: addrspacecast %class.outer* %arg to %class.outer addrspace(101)*
				; IR-NEXT: addrspacecast %class.outer addrspace(101)* %arg1 to %class.outer*
				; PTX: ld.param.u64
				; PTX-NOT: ld.param.u8
				%arg.idx = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 0
				%arg.idx.val = load i32, i32* %arg.idx, align 8
				%arg.idx1 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 1
				%arg.idx1.val = load i32, i32* %arg.idx1, align 8
				%arg.idx2 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 1
				%arg.idx2.val = load i32, i32* %arg.idx2, align 8
				%arg.idx.val.val = load i32, i32* %arg.idx.val, align 4
				%add.i = add nsw i32 %arg.idx.val.val, %arg.idx2.val
				store i32 %add.i, i32* %arg.idx1.val, align 4
				ret void
				}

				; Check that nvptx-lower-args preserves arg alignment
				define void @load_alignment_without_readonly(%class.outer* nocapture byval(%class.outer) align 8 %arg) {
				entry:
				; IR-LABEL: @load_alignment_without_readonly
	; IR: load %class.outer, %class.outer addrspace(101)*			; IR: load %class.outer, %class.outer addrspace(101)*
	; IR-SAME: align 8			; IR-SAME: align 8
	; PTX: ld.param.u64			; PTX: ld.param.u64
	; PTX-NOT: ld.param.u8			; PTX-NOT: ld.param.u8
	%arg.idx = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 0			%arg.idx = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 0
	%arg.idx.val = load i32, i32* %arg.idx, align 8			%arg.idx.val = load i32, i32* %arg.idx, align 8
	%arg.idx1 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 1			%arg.idx1 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 0, i32 1
	%arg.idx1.val = load i32, i32* %arg.idx1, align 8			%arg.idx1.val = load i32, i32* %arg.idx1, align 8
	%arg.idx2 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 1			%arg.idx2 = getelementptr %class.outer, %class.outer* %arg, i64 0, i32 1
	%arg.idx2.val = load i32, i32* %arg.idx2, align 8			%arg.idx2.val = load i32, i32* %arg.idx2, align 8
	%arg.idx.val.val = load i32, i32* %arg.idx.val, align 4			%arg.idx.val.val = load i32, i32* %arg.idx.val, align 4
	%add.i = add nsw i32 %arg.idx.val.val, %arg.idx2.val			%add.i = add nsw i32 %arg.idx.val.val, %arg.idx2.val
	store i32 %add.i, i32* %arg.idx1.val, align 4			store i32 %add.i, i32* %arg.idx1.val, align 4
	ret void			ret void
	}			}