This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Align kernel launch args correctly when the LLVM type's alignment is different from the clang type's alignment.
ClosedPublic

Authored by jlebar on Jul 27 2016, 10:46 AM.

Download Raw Diff

Details

Reviewers

Commits

rGe56360a2cd7c: [CUDA] Align kernel launch args correctly when the LLVM type's alignment is…
rC276927: [CUDA] Align kernel launch args correctly when the LLVM type's alignment is…
rL276927: [CUDA] Align kernel launch args correctly when the LLVM type's alignment is…

Summary

Before this patch, we computed the offsets in memory of args passed to
GPU kernel functions by throwing all of the args into an LLVM struct.

clang emits packed llvm structs basically whenever it feels like it, and
packed structs have alignment 1. So we cannot rely on the llvm type's
alignment matching the C++ type's alignment.

This patch fixes our codegen so we always respect the clang types'
alignments.

Diff Detail

Event Timeline

jlebar updated this revision to Diff 65771.Jul 27 2016, 10:46 AM

jlebar retitled this revision from to [CUDA] Align kernel launch args correctly when the LLVM type's alignment is different from the clang type's alignment..

jlebar updated this object.

jlebar added a reviewer: rnk.

jlebar added subscribers: tra, cfe-commits.

rnk added inline comments.Jul 27 2016, 11:05 AM

test/CodeGenCUDA/kernel-args-alignment.cu
2–3	Typically clang doesn't need a registered backend for a target to generate IR for that target. It "knows" a whole bunch of stuff about all target calling conventions and data layout. Unless CUDA goes out of its way to query LLVM backend information, we shouldn't need these REQUIRES lines. You should probably test this theory, though, by configuring an ARM-only clang and running the tests. :)

Remove REQUIRES lines.

test/CodeGenCUDA/kernel-args-alignment.cu
2–3	Yeah, I don't think we actually need this, as we have a bunch of other codegen tests that don't have these REQUIRES lines.

lgtm

This revision is now accepted and ready to land.Jul 27 2016, 3:31 PM

Closed by commit rL276927: [CUDA] Align kernel launch args correctly when the LLVM type's alignment is… (authored by jlebar). · Explain WhyJul 27 2016, 3:44 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

CGCUDABuiltin.cpp

6 lines

CGCUDANV.cpp

41 lines

test/

CodeGenCUDA/

kernel-args-alignment.cu

36 lines

Diff 65800

lib/CodeGen/CGCUDABuiltin.cpp

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	CodeGenFunction::EmitCUDADevicePrintfCallExpr(const CallExpr *E,
llvm::Value *BufferPtr;		llvm::Value *BufferPtr;
if (Args.size() <= 1) {		if (Args.size() <= 1) {
// If there are no args, pass a null pointer to vprintf.		// If there are no args, pass a null pointer to vprintf.
BufferPtr = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(Ctx));		BufferPtr = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(Ctx));
} else {		} else {
llvm::SmallVector<llvm::Type *, 8> ArgTypes;		llvm::SmallVector<llvm::Type *, 8> ArgTypes;
for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I)		for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I)
ArgTypes.push_back(Args[I].RV.getScalarVal()->getType());		ArgTypes.push_back(Args[I].RV.getScalarVal()->getType());

		// Using llvm::StructType is correct only because printf doesn't accept
		// aggregates. If we had to handle aggregates here, we'd have to manually
		// compute the offsets within the alloca -- we wouldn't be able to assume
		// that the alignment of the llvm type was the same as the alignment of the
		// clang type.
llvm::Type *AllocaTy = llvm::StructType::create(ArgTypes, "printf_args");		llvm::Type *AllocaTy = llvm::StructType::create(ArgTypes, "printf_args");
llvm::Value *Alloca = CreateTempAlloca(AllocaTy);		llvm::Value *Alloca = CreateTempAlloca(AllocaTy);

for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {		for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {
llvm::Value *P = Builder.CreateStructGEP(AllocaTy, Alloca, I - 1);		llvm::Value *P = Builder.CreateStructGEP(AllocaTy, Alloca, I - 1);
llvm::Value *Arg = Args[I].RV.getScalarVal();		llvm::Value *Arg = Args[I].RV.getScalarVal();
Builder.CreateAlignedStore(Arg, P, DL.getPrefTypeAlignment(Arg->getType()));		Builder.CreateAlignedStore(Arg, P, DL.getPrefTypeAlignment(Arg->getType()));
}		}
BufferPtr = Builder.CreatePointerCast(Alloca, llvm::Type::getInt8PtrTy(Ctx));		BufferPtr = Builder.CreatePointerCast(Alloca, llvm::Type::getInt8PtrTy(Ctx));
}		}

// Invoke vprintf and return.		// Invoke vprintf and return.
llvm::Function* VprintfFunc = GetVprintfDeclaration(CGM.getModule());		llvm::Function* VprintfFunc = GetVprintfDeclaration(CGM.getModule());
return RValue::get(		return RValue::get(
Builder.CreateCall(VprintfFunc, {Args[0].RV.getScalarVal(), BufferPtr}));		Builder.CreateCall(VprintfFunc, {Args[0].RV.getScalarVal(), BufferPtr}));
}		}

lib/CodeGen/CGCUDANV.cpp

	Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines
	void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,			void CGNVCUDARuntime::emitDeviceStub(CodeGenFunction &CGF,
	FunctionArgList &Args) {			FunctionArgList &Args) {
	EmittedKernels.push_back(CGF.CurFn);			EmittedKernels.push_back(CGF.CurFn);
	emitDeviceStubBody(CGF, Args);			emitDeviceStubBody(CGF, Args);
	}			}

	void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,			void CGNVCUDARuntime::emitDeviceStubBody(CodeGenFunction &CGF,
	FunctionArgList &Args) {			FunctionArgList &Args) {
	// Build the argument value list and the argument stack struct type.			// Emit a call to cudaSetupArgument for each arg in Args.
	SmallVector<llvm::Value *, 16> ArgValues;
	std::vector<llvm::Type *> ArgTypes;
	for (FunctionArgList::const_iterator I = Args.begin(), E = Args.end();
	I != E; ++I) {
	llvm::Value V = CGF.GetAddrOfLocalVar(I).getPointer();
	ArgValues.push_back(V);
	assert(isa<llvm::PointerType>(V->getType()) && "Arg type not PointerType");
	ArgTypes.push_back(cast<llvm::PointerType>(V->getType())->getElementType());
	}
	llvm::StructType *ArgStackTy = llvm::StructType::get(Context, ArgTypes);

	llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");

	// Emit the calls to cudaSetupArgument
	llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();			llvm::Constant *cudaSetupArgFn = getSetupArgumentFn();
	for (unsigned I = 0, E = Args.size(); I != E; ++I) {			llvm::BasicBlock *EndBlock = CGF.createBasicBlock("setup.end");
	llvm::Value *Args[3];			CharUnits Offset = CharUnits::Zero();
	llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");			for (const VarDecl *A : Args) {
	Args[0] = CGF.Builder.CreatePointerCast(ArgValues[I], VoidPtrTy);			CharUnits TyWidth, TyAlign;
	Args[1] = CGF.Builder.CreateIntCast(			std::tie(TyWidth, TyAlign) =
	llvm::ConstantExpr::getSizeOf(ArgTypes[I]),			CGM.getContext().getTypeInfoInChars(A->getType());
	SizeTy, false);			Offset = Offset.alignTo(TyAlign);
	Args[2] = CGF.Builder.CreateIntCast(			llvm::Value *Args[] = {
	llvm::ConstantExpr::getOffsetOf(ArgStackTy, I),			CGF.Builder.CreatePointerCast(CGF.GetAddrOfLocalVar(A).getPointer(),
	SizeTy, false);			VoidPtrTy),
				llvm::ConstantInt::get(SizeTy, TyWidth.getQuantity()),
				llvm::ConstantInt::get(SizeTy, Offset.getQuantity()),
				};
	llvm::CallSite CS = CGF.EmitRuntimeCallOrInvoke(cudaSetupArgFn, Args);			llvm::CallSite CS = CGF.EmitRuntimeCallOrInvoke(cudaSetupArgFn, Args);
	llvm::Constant *Zero = llvm::ConstantInt::get(IntTy, 0);			llvm::Constant *Zero = llvm::ConstantInt::get(IntTy, 0);
	llvm::Value *CSZero = CGF.Builder.CreateICmpEQ(CS.getInstruction(), Zero);			llvm::Value *CSZero = CGF.Builder.CreateICmpEQ(CS.getInstruction(), Zero);
				llvm::BasicBlock *NextBlock = CGF.createBasicBlock("setup.next");
	CGF.Builder.CreateCondBr(CSZero, NextBlock, EndBlock);			CGF.Builder.CreateCondBr(CSZero, NextBlock, EndBlock);
	CGF.EmitBlock(NextBlock);			CGF.EmitBlock(NextBlock);
				Offset += TyWidth;
	}			}

	// Emit the call to cudaLaunch			// Emit the call to cudaLaunch
	llvm::Constant *cudaLaunchFn = getLaunchFn();			llvm::Constant *cudaLaunchFn = getLaunchFn();
	llvm::Value *Arg = CGF.Builder.CreatePointerCast(CGF.CurFn, CharPtrTy);			llvm::Value *Arg = CGF.Builder.CreatePointerCast(CGF.CurFn, CharPtrTy);
	CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);			CGF.EmitRuntimeCallOrInvoke(cudaLaunchFn, Arg);
	CGF.EmitBranch(EndBlock);			CGF.EmitBranch(EndBlock);

	▲ Show 20 Lines • Show All 208 Lines • Show Last 20 Lines

test/CodeGenCUDA/kernel-args-alignment.cu

This file was added.

				// RUN: %clang_cc1 --std=c++11 -triple x86_64-unknown-linux-gnu -emit-llvm -o - %s \| \
				// RUN: FileCheck -check-prefix HOST -check-prefix CHECK %s

				rnkUnsubmitted Not Done Reply Inline Actions Typically clang doesn't need a registered backend for a target to generate IR for that target. It "knows" a whole bunch of stuff about all target calling conventions and data layout. Unless CUDA goes out of its way to query LLVM backend information, we shouldn't need these REQUIRES lines. You should probably test this theory, though, by configuring an ARM-only clang and running the tests. :) rnk: Typically clang doesn't need a registered backend for a target to generate IR for that target.
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Yeah, I don't think we actually need this, as we have a bunch of other codegen tests that don't have these REQUIRES lines. jlebar: Yeah, I don't think we actually need this, as we have a bunch of other codegen tests that don't…
				// RUN: %clang_cc1 --std=c++11 -fcuda-is-device -triple nvptx64-nvidia-cuda \
				// RUN: -emit-llvm -o - %s \| FileCheck -check-prefix DEVICE -check-prefix CHECK %s

				#include "Inputs/cuda.h"

				struct U {
				short x;
				} __attribute__((packed));

				struct S {
				int *ptr;
				char a;
				U u;
				};

				// Clang should generate a packed LLVM struct for S (denoted by the <>s),
				// otherwise this test isn't interesting.
				// CHECK: %struct.S = type <{ i32*, i8, %struct.U, [5 x i8] }>

				static_assert(alignof(S) == 8, "Unexpected alignment.");

				// HOST-LABEL: @_Z6kernelc1SPi
				// Marshalled kernel args should be:
				// 1. offset 0, width 1
				// 2. offset 8 (because alignof(S) == 8), width 16
				// 3. offset 24, width 8
				// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)
				// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)
				// HOST: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)

				// DEVICE-LABEL: @_Z6kernelc1SPi
				// DEVICE-SAME: i8{{[^,]}}, %struct.S byval align 8{{[^,]}}, i32
				__global__ void kernel(char a, S s, int *b) {}