This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/lib/CodeGen/
-
lib/
-
CodeGen/
-
CGBuiltin.cpp
1/1
CGGPUBuiltin.cpp
-
openmp/libomptarget/DeviceRTL/
-
libomptarget/
-
DeviceRTL/
-
include/
-
Debug.h
-
Interface.h
-
src/
-
Debug.cpp

Differential D112504

[OpenMP] Wrap (v)printf in the new RT and use same handling for AMD
ClosedPublic

Authored by jdoerfert on Oct 25 2021, 7:19 PM.

Download Raw Diff

Details

Reviewers

JonChesterfield
tianshilei1992
jhuber6

Summary

To support printf NVPTX and AMD targets are handled differently. The
latter performs host calls, which we don't really want in OpenMP, the
former translates printf calls to vprintf calls as the NVIDIA
runtime provides an implementation for the device of vprintf. This
patch unifies the AMD and NVPTX handling and emits for both calls to the
vprintf wrapper __llvm_omp_vprintf which we define in our new device
runtime. The main benefit of this wrapper is that we can more easily
control (and profile) the emission of printf calls in device code.

Note: Tests are coming.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	240 ms	x64 debian > Clang.OpenMP::cancel_codegen.cpp
	240 ms	x64 debian > Clang.OpenMP::cancellation_point_codegen.cpp
	150 ms	x64 debian > Clang.OpenMP::debug-info-openmp-array.cpp
	930 ms	x64 debian > Clang.OpenMP::declare_mapper_codegen.cpp
	130 ms	x64 debian > Clang.OpenMP::declare_reduction_codegen.cpp
		View Full Test Results (170 Failed)

Event Timeline

jdoerfert created this revision.Oct 25 2021, 7:19 PM

Herald added subscribers: guansong, bollu, yaxunl. · View Herald TranscriptOct 25 2021, 7:19 PM

jdoerfert requested review of this revision.Oct 25 2021, 7:19 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 25 2021, 7:19 PM

Herald added subscribers: cfe-commits, sstefan1. · View Herald Transcript

Harbormaster completed remote builds in B130585: Diff 382165.Oct 25 2021, 7:19 PM

Actually use the new wrapper for OpenMP offload targeting AMD (and the new RT)

Harbormaster completed remote builds in B130590: Diff 382171.Oct 25 2021, 8:23 PM

That's an interesting approach.

Do you happen to know where I can find details of the data format behind that void*? Have been meaning to look at writing printf for amdgpu as host side decoding of that buffer. If the compiler knows how long it is, that would be a useful third argument.

jdoerfert added inline comments.Oct 26 2021, 6:33 AM

clang/lib/CodeGen/CGGPUBuiltin.cpp
128	In D112504#3086474, @JonChesterfield wrote: That's an interesting approach. Do you happen to know where I can find details of the data format behind that void*? Have been meaning to look at writing printf for amdgpu as host side decoding of that buffer. If the compiler knows how long it is, that would be a useful third argument. We actually do know. Above we allocate and fill the buffer. For the OpenMP wrapper you could easily add a third argument later in order to facilitate an OpenMP runtime printf impl. I would even like it to be target agnostic (e.g., replace the default CUDA route on request). That said, we should tackle that separately, wdyt?

Nice! Yep, can add a size argument later. Will want it to control copying the payload over to the host. Or we could allocate a buffer that the corresponding runtime can handle directly (pinned/fine grain) and skip that copy

JonChesterfield accepted this revision.Oct 26 2021, 12:12 PM

This revision is now accepted and ready to land.Oct 26 2021, 12:12 PM

This doesn't apply against main, diff relative to something that isn't in main

-rebase on main

Harbormaster completed remote builds in B131008: Diff 382747.Oct 27 2021, 12:36 PM

JonChesterfield mentioned this in D112680: [OpenMP] Lower printf to __llvm_omp_vprintf.Oct 27 2021, 5:55 PM

JonChesterfield mentioned this in rGdb81d8f6c4d6: [OpenMP] Lower printf to __llvm_omp_vprintf.Nov 8 2021, 10:38 AM

Committed as part of D112680.

JonChesterfield mentioned this in rG27177b82d4ca: [OpenMP] Lower printf to __llvm_omp_vprintf.Nov 10 2021, 7:31 AM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGBuiltin.cpp

3 lines

CGGPUBuiltin.cpp

26 lines

openmp/

libomptarget/

DeviceRTL/

include/

Debug.h

5 lines

Interface.h

3 lines

src/

Debug.cpp

9 lines

Diff 382747

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,054 Lines • ▼ Show 20 Lines	case Builtin::BI__builtin_load_halff: {
Address Address = EmitPointerWithAlignment(E->getArg(0));		Address Address = EmitPointerWithAlignment(E->getArg(0));
Value *HalfVal = Builder.CreateLoad(Address);		Value *HalfVal = Builder.CreateLoad(Address);
return RValue::get(Builder.CreateFPExt(HalfVal, Builder.getFloatTy()));		return RValue::get(Builder.CreateFPExt(HalfVal, Builder.getFloatTy()));
}		}
case Builtin::BIprintf:		case Builtin::BIprintf:
if (getTarget().getTriple().isNVPTX())		if (getTarget().getTriple().isNVPTX())
return EmitNVPTXDevicePrintfCallExpr(E, ReturnValue);		return EmitNVPTXDevicePrintfCallExpr(E, ReturnValue);
if (getTarget().getTriple().getArch() == Triple::amdgcn &&		if (getTarget().getTriple().getArch() == Triple::amdgcn &&
getLangOpts().HIP)		(getLangOpts().HIP \|\| (getLangOpts().OpenMPIsDevice &&
		getLangOpts().OpenMPTargetNewRuntime)))
return EmitAMDGPUDevicePrintfCallExpr(E, ReturnValue);		return EmitAMDGPUDevicePrintfCallExpr(E, ReturnValue);
break;		break;
case Builtin::BI__builtin_canonicalize:		case Builtin::BI__builtin_canonicalize:
case Builtin::BI__builtin_canonicalizef:		case Builtin::BI__builtin_canonicalizef:
case Builtin::BI__builtin_canonicalizef16:		case Builtin::BI__builtin_canonicalizef16:
case Builtin::BI__builtin_canonicalizel:		case Builtin::BI__builtin_canonicalizel:
return RValue::get(emitUnaryBuiltin(*this, E, Intrinsic::canonicalize));		return RValue::get(emitUnaryBuiltin(*this, E, Intrinsic::canonicalize));

▲ Show 20 Lines • Show All 13,704 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGGPUBuiltin.cpp

Show All 15 Lines
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"		#include "llvm/Transforms/Utils/AMDGPUEmitPrintf.h"

using namespace clang;		using namespace clang;
using namespace CodeGen;		using namespace CodeGen;

static llvm::Function *GetVprintfDeclaration(llvm::Module &M) {		static llvm::Function *GetVprintfDeclaration(CodeGenModule &CGM) {
		bool UsesNewOpenMPDeviceRuntime = CGM.getLangOpts().OpenMPIsDevice &&
		CGM.getLangOpts().OpenMPTargetNewRuntime;
		const char *Name =
		UsesNewOpenMPDeviceRuntime ? "__llvm_omp_vprintf" : "vprintf";
		llvm::Module &M = CGM.getModule();
llvm::Type *ArgTypes[] = {llvm::Type::getInt8PtrTy(M.getContext()),		llvm::Type *ArgTypes[] = {llvm::Type::getInt8PtrTy(M.getContext()),
llvm::Type::getInt8PtrTy(M.getContext())};		llvm::Type::getInt8PtrTy(M.getContext())};
llvm::FunctionType *VprintfFuncType = llvm::FunctionType::get(		llvm::FunctionType *VprintfFuncType = llvm::FunctionType::get(
llvm::Type::getInt32Ty(M.getContext()), ArgTypes, false);		llvm::Type::getInt32Ty(M.getContext()), ArgTypes, false);

if (auto* F = M.getFunction("vprintf")) {		if (auto *F = M.getFunction(Name)) {
// Our CUDA system header declares vprintf with the right signature, so		// Our CUDA system header declares vprintf with the right signature, so
// nobody else should have been able to declare vprintf with a bogus		// nobody else should have been able to declare vprintf with a bogus
// signature.		// signature. The OpenMP device runtime provides a wrapper around vprintf
		// which we use here. The signature should match though.
assert(F->getFunctionType() == VprintfFuncType);		assert(F->getFunctionType() == VprintfFuncType);
return F;		return F;
}		}

// vprintf doesn't already exist; create a declaration and insert it into the		// vprintf, or for OpenMP device offloading the vprintf wrapper, doesn't
// module.		// already exist; create a declaration and insert it into the module.
return llvm::Function::Create(		return llvm::Function::Create(
VprintfFuncType, llvm::GlobalVariable::ExternalLinkage, "vprintf", &M);		VprintfFuncType, llvm::GlobalVariable::ExternalLinkage, Name, &M);
}		}

// Transforms a call to printf into a call to the NVPTX vprintf syscall (which		// Transforms a call to printf into a call to the NVPTX vprintf syscall (which
// isn't particularly special; it's invoked just like a regular function).		// isn't particularly special; it's invoked just like a regular function).
// vprintf takes two args: A format string, and a pointer to a buffer containing		// vprintf takes two args: A format string, and a pointer to a buffer containing
// the varargs.		// the varargs.
//		//
// For example, the call		// For example, the call
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	for (unsigned I = 1, NumArgs = Args.size(); I < NumArgs; ++I) {
llvm::Value *P = Builder.CreateStructGEP(AllocaTy, Alloca, I - 1);		llvm::Value *P = Builder.CreateStructGEP(AllocaTy, Alloca, I - 1);
llvm::Value Arg = Args[I].getRValue(this).getScalarVal();		llvm::Value Arg = Args[I].getRValue(this).getScalarVal();
Builder.CreateAlignedStore(Arg, P, DL.getPrefTypeAlign(Arg->getType()));		Builder.CreateAlignedStore(Arg, P, DL.getPrefTypeAlign(Arg->getType()));
}		}
BufferPtr = Builder.CreatePointerCast(Alloca, llvm::Type::getInt8PtrTy(Ctx));		BufferPtr = Builder.CreatePointerCast(Alloca, llvm::Type::getInt8PtrTy(Ctx));
}		}

// Invoke vprintf and return.		// Invoke vprintf and return.
llvm::Function* VprintfFunc = GetVprintfDeclaration(CGM.getModule());		llvm::Function *VprintfFunc = GetVprintfDeclaration(CGM);
return RValue::get(Builder.CreateCall(		return RValue::get(Builder.CreateCall(
VprintfFunc, {Args[0].getRValue(*this).getScalarVal(), BufferPtr}));		VprintfFunc, {Args[0].getRValue(*this).getScalarVal(), BufferPtr}));
		jdoerfertAuthorUnsubmitted Done Reply Inline Actions In D112504#3086474, @JonChesterfield wrote: That's an interesting approach. Do you happen to know where I can find details of the data format behind that void? Have been meaning to look at writing printf for amdgpu as host side decoding of that buffer. If the compiler knows how long it is, that would be a useful third argument. We actually do know. Above we allocate and fill the buffer. For the OpenMP wrapper you could easily add a third argument later in order to facilitate an OpenMP runtime printf impl. I would even like it to be target agnostic (e.g., replace the default CUDA route on request). That said, we should tackle that separately, wdyt? jdoerfert:* >>! In D112504#3086474, @JonChesterfield wrote: > That's an interesting approach. > > Do you…
}		}

RValue		RValue
CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(const CallExpr *E,		CodeGenFunction::EmitAMDGPUDevicePrintfCallExpr(const CallExpr *E,
ReturnValueSlot ReturnValue) {		ReturnValueSlot ReturnValue) {
assert(getTarget().getTriple().getArch() == llvm::Triple::amdgcn);		assert(getTarget().getTriple().getArch() == llvm::Triple::amdgcn);
assert(E->getBuiltinCallee() == Builtin::BIprintf \|\|		assert(E->getBuiltinCallee() == Builtin::BIprintf \|\|
E->getBuiltinCallee() == Builtin::BI__builtin_printf);		E->getBuiltinCallee() == Builtin::BI__builtin_printf);
assert(E->getNumArgs() >= 1); // printf always has at least one arg.		assert(E->getNumArgs() >= 1); // printf always has at least one arg.

		// For OpenMP target offloading we go with a modified nvptx printf method.
		// Basically creating calls to __llvm_omp_vprintf with the arguments and
		// dealing with the details in the device runtime itself.
		if (getLangOpts().OpenMPIsDevice && getLangOpts().OpenMPTargetNewRuntime)
		return EmitNVPTXDevicePrintfCallExpr(E, ReturnValue);

CallArgList CallArgs;		CallArgList CallArgs;
EmitCallArgs(CallArgs,		EmitCallArgs(CallArgs,
E->getDirectCallee()->getType()->getAs<FunctionProtoType>(),		E->getDirectCallee()->getType()->getAs<FunctionProtoType>(),
E->arguments(), E->getDirectCallee(),		E->arguments(), E->getDirectCallee(),
/* ParamsToSkip = */ 0);		/* ParamsToSkip = */ 0);

SmallVector<llvm::Value *, 8> Args;		SmallVector<llvm::Value *, 8> Args;
for (auto A : CallArgs) {		for (auto A : CallArgs) {
Show All 16 Lines

openmp/libomptarget/DeviceRTL/include/Debug.h

	Show All 26 Lines

	/// Print			/// Print
	/// TODO: For now we have to use macros to guard the code because Clang lowers			/// TODO: For now we have to use macros to guard the code because Clang lowers
	/// `printf` to different function calls on NVPTX and AMDGCN platforms, and it			/// `printf` to different function calls on NVPTX and AMDGCN platforms, and it
	/// doesn't work for AMDGCN. After it can work on AMDGCN, we will remove the			/// doesn't work for AMDGCN. After it can work on AMDGCN, we will remove the
	/// macro.			/// macro.
	/// {			/// {

	#ifndef __AMDGCN__
	extern "C" {			extern "C" {
	int printf(const char *format, ...);			int printf(const char *format, ...);
	}			}

	#define PRINTF(fmt, ...) (void)printf(fmt, __VA_ARGS__);			#define PRINTF(fmt, ...) (void)printf(fmt, __VA_ARGS__);
	#define PRINT(str) PRINTF("%s", str)			#define PRINT(str) PRINTF("%s", str)
	#else
	#define PRINTF(fmt, ...)
	#define PRINT(str)
	#endif

	///}			///}

	/// Enter a debugging scope for performing function traces. Enabled with			/// Enter a debugging scope for performing function traces. Enabled with
	/// FunctionTracting set in the debug kind.			/// FunctionTracting set in the debug kind.
	#define FunctionTracingRAII() \			#define FunctionTracingRAII() \
	DebugEntryRAII Entry(__LINE__, __PRETTY_FUNCTION__);			DebugEntryRAII Entry(__LINE__, __PRETTY_FUNCTION__);

	Show All 9 Lines

openmp/libomptarget/DeviceRTL/include/Interface.h

	Show First 20 Lines • Show All 346 Lines • ▼ Show 20 Lines
	///}			///}

	/// Shuffle			/// Shuffle
	///			///
	///{			///{
	int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size);			int32_t __kmpc_shuffle_int32(int32_t val, int16_t delta, int16_t size);
	int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size);			int64_t __kmpc_shuffle_int64(int64_t val, int16_t delta, int16_t size);
	///}			///}

				/// Printf
				int32_t __llvm_omp_vprintf(const char Format, void Arguments);
	}			}

	#endif			#endif

openmp/libomptarget/DeviceRTL/src/Debug.cpp

	Show All 29 Lines
	}			}

	void __assert_fail(const char assertion, const char file, unsigned line,			void __assert_fail(const char assertion, const char file, unsigned line,
	const char *function) {			const char *function) {
	PRINTF("%s:%u: %s: Assertion `%s' failed.\n", file, line, function,			PRINTF("%s:%u: %s: Assertion `%s' failed.\n", file, line, function,
	assertion);			assertion);
	__builtin_trap();			__builtin_trap();
	}			}

				// We do not have a vprintf implementation for AMD GPU yet so we use a stub.
				#pragma omp begin declare variant match(device = {arch(amdgcn)})
				int32_t vprintf(const char , void ) { return 0; }
				#pragma omp end declare variant

				int32_t __llvm_omp_vprintf(const char Format, void Arguments) {
				return vprintf(Format, Arguments);
				}
	}			}

	/// Current indentation level for the function trace. Only accessed by thread 0.			/// Current indentation level for the function trace. Only accessed by thread 0.
	static uint32_t Level = 0;			static uint32_t Level = 0;
	#pragma omp allocate(Level) allocator(omp_pteam_mem_alloc)			#pragma omp allocate(Level) allocator(omp_pteam_mem_alloc)

	DebugEntryRAII::DebugEntryRAII(const unsigned Line, const char *Function) {			DebugEntryRAII::DebugEntryRAII(const unsigned Line, const char *Function) {
	if (config::isDebugMode(config::DebugKind::FunctionTracing) &&			if (config::isDebugMode(config::DebugKind::FunctionTracing) &&
	Show All 18 Lines