This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
6/10
CGCUDABuiltin.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
printf.cu

Differential D16664

[CUDA] Generate CUDA's printf alloca in its function's entry block.
ClosedPublic

Authored by jlebar on Jan 27 2016, 6:23 PM.

Download Raw Diff

Details

Reviewers

echristo
rnk

Commits

rGc0e42750da5f: [CUDA] Generate CUDA's printf alloca in its function's entry block.
rC259122: [CUDA] Generate CUDA's printf alloca in its function's entry block.
rL259122: [CUDA] Generate CUDA's printf alloca in its function's entry block.

Summary

This is necessary to prevent llvm from generating stacksave intrinsics
around this alloca. NVVM doesn't have a stack, and we don't handle said
intrinsics.

Diff Detail

Event Timeline

jlebar updated this revision to Diff 46206.Jan 27 2016, 6:23 PM

jlebar retitled this revision from to [CUDA] Generate CUDA's printf alloca in its function's entry block..

jlebar updated this object.

jlebar added a reviewer: rnk.

jlebar added subscribers: tra, echristo, jhen, cfe-commits.

echristo added inline comments.Jan 27 2016, 6:41 PM

lib/CodeGen/CGCUDABuiltin.cpp
95	Not quite, you'll want to use AllocaInsertPt for this or even CreateTempAlloca.

Address echristo's review comments.

jlebar updated this object.Jan 28 2016, 10:27 AM

jlebar marked an inline comment as done.

jlebar added inline comments.

lib/CodeGen/CGCUDABuiltin.cpp
95	Aha. Used AllocaInsertPt because it doesn't seem that there's an overload of CreateTempAlloca that takes an explicit size.

rnk added inline comments.Jan 28 2016, 11:50 AM

lib/CodeGen/CGCUDABuiltin.cpp
91–92	The fact that allocas for local variables should always go in the entry block is pretty widespread cultural knowledge in LLVM and clang. Most readers aren't going to need this comment, unless you expect that people working on CUDA won't have that background. Plus, if you use CreateTempAlloca, there won't be any question about which insert point should be used.
95	You can still use CreateTempAlloca by making an `[i8 x N]` LLVM type. You'll have to use CreateStructGEP below for forming GEPs. Overall I think that'd be nicer, since you don't need to worry about insertion at all.

echristo added inline comments.Jan 28 2016, 11:54 AM

lib/CodeGen/CGCUDABuiltin.cpp
93	Also you'd have wanted to insert it before anyhow.
95	+1 :)

Use a struct rather than an i8 buffer.

Thank you for the reviews.

Please have another look; I switched to using a struct proper. It's a lot cleaner! We're now assuming that the struct is aligned in the same way as vprintf wants, but if anything I expect this new code is more likely to match what it wants.

lib/CodeGen/CGCUDABuiltin.cpp
91–92	OK, yeah, I also don't like comments that explain something that everyone other than the author knows. Thanks.

One inline nit then LGTM.

-eric

lib/CodeGen/CGCUDABuiltin.cpp
87	on the wrong side ;)

This revision is now accepted and ready to land.Jan 28 2016, 2:45 PM

jlebar marked an inline comment as done.Jan 28 2016, 4:02 PM

jlebar added inline comments.

lib/CodeGen/CGCUDABuiltin.cpp
87	Argh, I really need to set up a linter. I'm still doing readability reviews, and I cannot brain two styles. Sorry to keep wasting your time with silly stuff like this.

Closed by commit rL259122: [CUDA] Generate CUDA's printf alloca in its function's entry block. (authored by jlebar). · Explain WhyJan 28 2016, 4:02 PM

This revision was automatically updated to reflect the committed changes.

jlebar marked an inline comment as done.

echristo added inline comments.Jan 28 2016, 4:06 PM

lib/CodeGen/CGCUDABuiltin.cpp
87	You could just use clang-format on everything :)

Do you have a script that will take as input a commit range and git
commit --amend clang-tidy fixes for lines modified in that range?
Because if so,

a) I would be your best friend forever, and
b) It should be simple to convert that into a linter for arc to catch
the case when I forget to run said tool.

Hm, well, https://llvm.org/svn/llvm-project/cfe/trunk/tools/clang-format/git-clang-format
is close... Not sure if that triggers the bff clause, will consult my
attorney.

Revision Contents

Path

Size

lib/

CodeGen/

CGCUDABuiltin.cpp

57 lines

test/

CodeGenCUDA/

printf.cu

56 lines

Diff 46314

lib/CodeGen/CGCUDABuiltin.cpp

	Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
	// the varargs.			// the varargs.
	//			//
	// For example, the call			// For example, the call
	//			//
	// printf("format string", arg1, arg2, arg3);			// printf("format string", arg1, arg2, arg3);
	//			//
	// is converted into something resembling			// is converted into something resembling
	//			//
	// char* buf = alloca(...);			// struct Tmp {
	// reinterpret_cast<Arg1>(buf) = arg1;			// Arg1 a1;
	// reinterpret_cast<Arg2>(buf + ...) = arg2;			// Arg2 a2;
	// reinterpret_cast<Arg3>(buf + ...) = arg3;			// Arg3 a3;
				// };
				// char* buf = alloca(sizeof(Tmp));
				// (Tmp)buf = {a1, a2, a3};
	// vprintf("format string", buf);			// vprintf("format string", buf);
	//			//
	// buf is aligned to the max of {alignof(Arg1), ...}. Furthermore, each of the			// buf is aligned to the max of {alignof(Arg1), ...}. Furthermore, each of the
	// args is itself aligned to its preferred alignment.			// args is itself aligned to its preferred alignment.
	//			//
	// Note that by the time this function runs, E's args have already undergone the			// Note that by the time this function runs, E's args have already undergone the
	// standard C vararg promotion (short -> int, float -> double, etc.).			// standard C vararg promotion (short -> int, float -> double, etc.).
	RValue			RValue
	CodeGenFunction::EmitCUDADevicePrintfCallExpr(const CallExpr *E,			CodeGenFunction::EmitCUDADevicePrintfCallExpr(const CallExpr *E,
	ReturnValueSlot ReturnValue) {			ReturnValueSlot ReturnValue) {
	assert(getLangOpts().CUDA);			assert(getLangOpts().CUDA);
	assert(getLangOpts().CUDAIsDevice);			assert(getLangOpts().CUDAIsDevice);
	assert(E->getBuiltinCallee() == Builtin::BIprintf);			assert(E->getBuiltinCallee() == Builtin::BIprintf);
	assert(E->getNumArgs() >= 1); // printf always has at least one arg.			assert(E->getNumArgs() >= 1); // printf always has at least one arg.

	const llvm::DataLayout &DL = CGM.getDataLayout();			const llvm::DataLayout &DL = CGM.getDataLayout();
	llvm::LLVMContext &Ctx = CGM.getLLVMContext();			llvm::LLVMContext &Ctx = CGM.getLLVMContext();

	CallArgList Args;			CallArgList Args;
	EmitCallArgs(Args,			EmitCallArgs(Args,
	E->getDirectCallee()->getType()->getAs<FunctionProtoType>(),			E->getDirectCallee()->getType()->getAs<FunctionProtoType>(),
	E->arguments(), E->getDirectCallee(),			E->arguments(), E->getDirectCallee(),
	/* ParamsToSkip = */ 0);			/* ParamsToSkip = */ 0);

	// Figure out how large of a buffer we need to hold our varargs and how			// Construct and fill the args buffer that we'll pass to vprintf.
	// aligned the buffer needs to be. We start iterating at Arg[1], because			llvm::Value* BufferPtr;
				echristoUnsubmitted Done Reply Inline Actions on the wrong side ;) echristo: * on the wrong side ;)
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Argh, I really need to set up a linter. I'm still doing readability reviews, and I cannot brain two styles. Sorry to keep wasting your time with silly stuff like this. jlebar: Argh, I really need to set up a linter. I'm still doing readability reviews, and I cannot…
				echristoUnsubmitted Not Done Reply Inline Actions You could just use clang-format on everything :) echristo: You could just use clang-format on everything :)
	// that's our first vararg.			if (Args.size() <= 1) {
	unsigned BufSize = 0;
	unsigned BufAlign = 0;
	for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {
	const RValue& RV = Args[I].RV;
	llvm::Type* Ty = RV.getScalarVal()->getType();

	auto Align = DL.getPrefTypeAlignment(Ty);
	BufAlign = std::max(BufAlign, Align);
	// Add padding required to keep the current arg aligned.
	BufSize = llvm::alignTo(BufSize, Align);
	BufSize += DL.getTypeAllocSize(Ty);
	}

	// Construct and fill the buffer.
	llvm::Value* BufferPtr = nullptr;
	if (BufSize == 0) {
	// If there are no args, pass a null pointer to vprintf.			// If there are no args, pass a null pointer to vprintf.
	BufferPtr = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(Ctx));			BufferPtr = llvm::ConstantPointerNull::get(llvm::Type::getInt8PtrTy(Ctx));
	} else {			} else {
	BufferPtr = Builder.Insert(new llvm::AllocaInst(			llvm::SmallVector<llvm::Type *, 8> ArgTypes;
				rnkUnsubmitted Done Reply Inline Actions The fact that allocas for local variables should always go in the entry block is pretty widespread cultural knowledge in LLVM and clang. Most readers aren't going to need this comment, unless you expect that people working on CUDA won't have that background. Plus, if you use CreateTempAlloca, there won't be any question about which insert point should be used. rnk: The fact that allocas for local variables should always go in the entry block is pretty…
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions OK, yeah, I also don't like comments that explain something that everyone other than the author knows. Thanks. jlebar: OK, yeah, I also don't like comments that explain something that everyone other than the author…
	llvm::Type::getInt8Ty(Ctx), llvm::ConstantInt::get(Int32Ty, BufSize),			for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I)
				echristoUnsubmitted Done Reply Inline Actions Also you'd have wanted to insert it before anyhow. echristo: Also you'd have wanted to insert it before anyhow.
	BufAlign, "printf_arg_buf"));			ArgTypes.push_back(Args[I].RV.getScalarVal()->getType());
				llvm::Type *AllocaTy = llvm::StructType::create(ArgTypes, "printf_args");
				echristoUnsubmitted Done Reply Inline Actions Not quite, you'll want to use AllocaInsertPt for this or even CreateTempAlloca. echristo: Not quite, you'll want to use AllocaInsertPt for this or even CreateTempAlloca.
				jlebarAuthorUnsubmitted Not Done Reply Inline Actions Aha. Used AllocaInsertPt because it doesn't seem that there's an overload of CreateTempAlloca that takes an explicit size. jlebar: Aha. Used AllocaInsertPt because it doesn't seem that there's an overload of CreateTempAlloca…
				rnkUnsubmitted Done Reply Inline Actions You can still use CreateTempAlloca by making an `[i8 x N]` LLVM type. You'll have to use CreateStructGEP below for forming GEPs. Overall I think that'd be nicer, since you don't need to worry about insertion at all. rnk: You can still use CreateTempAlloca by making an `[i8 x N]` LLVM type. You'll have to use…
				echristoUnsubmitted Done Reply Inline Actions +1 :) echristo: +1 :)
				llvm::Value *Alloca = CreateTempAlloca(AllocaTy);

	unsigned Offset = 0;
	for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {			for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {
				llvm::Value *P = Builder.CreateStructGEP(AllocaTy, Alloca, I - 1);
	llvm::Value *Arg = Args[I].RV.getScalarVal();			llvm::Value *Arg = Args[I].RV.getScalarVal();
	llvm::Type *Ty = Arg->getType();			Builder.CreateAlignedStore(Arg, P, DL.getPrefTypeAlignment(Arg->getType()));
	auto Align = DL.getPrefTypeAlignment(Ty);

	// Pad the buffer to Arg's alignment.
	Offset = llvm::alignTo(Offset, Align);

	// Store Arg into the buffer at Offset.
	llvm::Value *GEP =
	Builder.CreateGEP(BufferPtr, llvm::ConstantInt::get(Int32Ty, Offset));
	llvm::Value *Cast = Builder.CreateBitCast(GEP, Ty->getPointerTo());
	Builder.CreateAlignedStore(Arg, Cast, Align);
	Offset += DL.getTypeAllocSize(Ty);
	}			}
				BufferPtr = Builder.CreatePointerCast(Alloca, llvm::Type::getInt8PtrTy(Ctx));
	}			}

	// Invoke vprintf and return.			// Invoke vprintf and return.
	llvm::Function* VprintfFunc = GetVprintfDeclaration(CGM.getModule());			llvm::Function* VprintfFunc = GetVprintfDeclaration(CGM.getModule());
	return RValue::get(			return RValue::get(
	Builder.CreateCall(VprintfFunc, {Args[0].RV.getScalarVal(), BufferPtr}));			Builder.CreateCall(VprintfFunc, {Args[0].RV.getScalarVal(), BufferPtr}));
	}			}

test/CodeGenCUDA/printf.cu

	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: nvptx-registered-target			// REQUIRES: nvptx-registered-target

	// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -fcuda-is-device -emit-llvm \			// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -fcuda-is-device -emit-llvm \
	// RUN: -o - %s \| FileCheck %s			// RUN: -o - %s \| FileCheck %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	extern "C" __device__ int vprintf(const char, const char);			extern "C" __device__ int vprintf(const char, const char);

	// Check a simple call to printf end-to-end.			// Check a simple call to printf end-to-end.
				// CHECK: [[SIMPLE_PRINTF_TY:%[a-zA-Z0-9_]+]] = type { i32, i64, double }
	__device__ int CheckSimple() {			__device__ int CheckSimple() {
				// CHECK: [[BUF:%[a-zA-Z0-9_]+]] = alloca [[SIMPLE_PRINTF_TY]]
	// CHECK: [[FMT:%[0-9]+]] = load{{.*}}%fmt			// CHECK: [[FMT:%[0-9]+]] = load{{.*}}%fmt
	const char* fmt = "%d";			const char* fmt = "%d %lld %f";
	// CHECK: [[BUF:%[a-zA-Z0-9_]+]] = alloca i8, i32 4, align 4			// CHECK: [[PTR0:%[0-9]+]] = getelementptr inbounds [[SIMPLE_PRINTF_TY]], [[SIMPLE_PRINTF_TY]]* [[BUF]], i32 0, i32 0
	// CHECK: [[PTR:%[0-9]+]] = getelementptr i8, i8* [[BUF]], i32 0			// CHECK: store i32 1, i32* [[PTR0]], align 4
	// CHECK: [[CAST:%[0-9]+]] = bitcast i8* [[PTR]] to i32*			// CHECK: [[PTR1:%[0-9]+]] = getelementptr inbounds [[SIMPLE_PRINTF_TY]], [[SIMPLE_PRINTF_TY]]* [[BUF]], i32 0, i32 1
	// CHECK: store i32 42, i32* [[CAST]], align 4			// CHECK: store i64 2, i64* [[PTR1]], align 8
	// CHECK: [[RET:%[0-9]+]] = call i32 @vprintf(i8* [[FMT]], i8* [[BUF]])			// CHECK: [[PTR2:%[0-9]+]] = getelementptr inbounds [[SIMPLE_PRINTF_TY]], [[SIMPLE_PRINTF_TY]]* [[BUF]], i32 0, i32 2
				// CHECK: store double 3.0{{[^,]}}, double [[PTR2]], align 8
				// CHECK: [[BUF_CAST:%[0-9]+]] = bitcast [[SIMPLE_PRINTF_TY]]* [[BUF]] to i8*
				// CHECK: [[RET:%[0-9]+]] = call i32 @vprintf(i8* [[FMT]], i8* [[BUF_CAST]])
	// CHECK: ret i32 [[RET]]			// CHECK: ret i32 [[RET]]
	return printf(fmt, 42);			return printf(fmt, 1, 2ll, 3.0);
	}

	// Check that the args' types are promoted correctly when we call printf.
	__device__ void CheckTypes() {
	// CHECK: alloca {{.*}} align 8
	// CHECK: getelementptr {{.*}} i32 0
	// CHECK: bitcast {{.}} to i32
	// CHECK: getelementptr {{.*}} i32 4
	// CHECK: bitcast {{.}} to i32
	// CHECK: getelementptr {{.*}} i32 8
	// CHECK: bitcast {{.}} to double
	// CHECK: getelementptr {{.*}} i32 16
	// CHECK: bitcast {{.}} to double
	printf("%d %d %f %f", (char)1, (short)2, 3.0f, 4.0);
	}

	// Check that the args are aligned properly in the buffer.
	__device__ void CheckAlign() {
	// CHECK: alloca i8, i32 40, align 8
	// CHECK: getelementptr {{.*}} i32 0
	// CHECK: getelementptr {{.*}} i32 8
	// CHECK: getelementptr {{.*}} i32 16
	// CHECK: getelementptr {{.*}} i32 20
	// CHECK: getelementptr {{.*}} i32 24
	// CHECK: getelementptr {{.*}} i32 32
	printf("%d %f %d %d %d %lld", 1, 2.0, 3, 4, 5, (long long)6);
	}			}

	__device__ void CheckNoArgs() {			__device__ void CheckNoArgs() {
	// CHECK: call i32 @vprintf({{.}}, i8 null){{$}}			// CHECK: call i32 @vprintf({{.}}, i8 null){{$}}
	printf("hello, world!");			printf("hello, world!");
	}			}

				// Check that printf's alloca happens in the entry block, not inside the if
				// statement.
				__device__ bool foo();
				__device__ void CheckAllocaIsInEntryBlock() {
				// CHECK: alloca %printf_args
				// CHECK: call {{.*}} @_Z3foov()
				if (foo()) {
				printf("%d", 42);
				}
				}