This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
-
LangOptions.def
-
Driver/
-
CC1Options.td
-
Sema/
-
Sema.h
-
lib/
-
Frontend/
-
CompilerInvocation.cpp
-
Sema/
-
SemaLambda.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
unnamed-types.cu

Differential D63164

[HIP] Add option to force lambda nameing following ODR in HIP/CUDA.
AbandonedPublic

Authored by hliao on Jun 11 2019, 1:30 PM.

Download Raw Diff

Details

Reviewers

tra
yaxunl
rsmith

Summary

Clang follows its own scheme for lambdas which don't need to follow ODR rule. That scheme will assign an unqiue ID within the TU scope and won't be unique or consistent across TUs.
In CUDA/HIP, a lambda with __device__ or __host__ __device__ (or an extended lambda) may be used in __global__ template function instantiation. If that lambda cannot be named following ODR rule, the device compilation may produce a mismatching device kernel name from the host compilation as the anonymous type ID assignment aforementioned.
In this patch, a new language option, -fcuda-force-lambda-odr, is introduced to force ODR for lambda naming so that all lambda could be consistently named across TUs, including the device compilation. This solves the assertion checking device kernel names as well as ensures the named-based resolution could resolve the correct device binaries from the device name generated in the host compilation.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 33238
Build 33237: arc lint + arc unit

Event Timeline

hliao created this revision.Jun 11 2019, 1:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 1:30 PM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B33238: Diff 204153.Jun 11 2019, 1:31 PM

tra added a reviewer: rsmith.Jun 11 2019, 2:50 PM

So, in short, what you're saying is that lambda type may leak into the mangled name of a __global__ function and ne need to ensure that the mangled name is identical for both host and device, hence the need for consistent naming of lambdas.

If that's the case, shouldn't it be enabled for CUDA/HIP by default? While it's not frequently used ATM, it is something we do want to work correctly all the time. The failure to do so results in weird runtime failures that would be hard to debug for end-users.

@rsmith -- are there any downsides having this enabled all the time?

In D63164#1538968, @tra wrote:

So, in short, what you're saying is that lambda type may leak into the mangled name of a __global__ function and ne need to ensure that the mangled name is identical for both host and device, hence the need for consistent naming of lambdas.

If that's the case, shouldn't it be enabled for CUDA/HIP by default? While it's not frequently used ATM, it is something we do want to work correctly all the time. The failure to do so results in weird runtime failures that would be hard to debug for end-users.

@rsmith -- are there any downsides having this enabled all the time?

yeah, we should ensure consistent naming by default. But, I want to hear more suggestion and comment before making that option by default. To more specific, as that option forces all naming of lambda to follow ODR rule. For non-__device__ lambda, even though there is no code quality change, we do add overhead for the compiler itself, as the additional records, though that should be negligible. A potential solution is to record the ODR context for parent lambdas and re-number them if the inner lambda is found as __device__ one.
However, I do like the straight-forward and extremely simple solution of this patch to force all lambda naming following ODR, there is no code quality change and, potentially slight, FE overhead. What's your thought?

BTW, I am also working on similar issues in unnamed class/struct/union. But, so far, we didn't found any workloads broken due to that and want to address that in another patch.

ping for comment as one of HIP-based workload is blocked by this issue

I think this is the wrong way to handle this issue. We need to give lambdas a mangling if they occur in functions for which there can be definitions in multiple translation units. In regular C++ code, that's inline functions and function template specializations, so that's what we're currently checking for. CUDA adds more cases (in particular, __host__ __device__ functions, plus anything else that can be emitted for multiple targets), so we should additionally check for those cases when determining whether to number lambdas. I don't see any need for a flag to control this behavior.

In D63164#1542361, @rsmith wrote:

I think this is the wrong way to handle this issue. We need to give lambdas a mangling if they occur in functions for which there can be definitions in multiple translation units. In regular C++ code, that's inline functions and function template specializations, so that's what we're currently checking for. CUDA adds more cases (in particular, __host__ __device__ functions, plus anything else that can be emitted for multiple targets), so we should additionally check for those cases when determining whether to number lambdas. I don't see any need for a flag to control this behavior.

I agree that this's a temporary solution to fix the issue. But, the real tricky part is that, once we found a __device__ lambda, we need to ensure all the enclosing scopes should be named following ODR as well just as the case illustrated in the test case. In fact, it's not the outer lambda (not annotated with __device__ nor within an inline function.) not being named in ODR. The tricky issue is that, so far, we don't maintain a context to add mangling back if we found an inner one needs to follow ODR. We have to add that before we could do that on-demand. I was working on that but it would take more efforts of review.
That's also the motivation why this change adds a option to guard this behavior.

PING

ping again. not sure my explanation gives more details on why this patch is created.

revised change is already committed.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

LangOptions.def

1 line

Driver/

CC1Options.td

2 lines

Sema/

Sema.h

8 lines

lib/

Frontend/

CompilerInvocation.cpp

3 lines

Sema/

SemaLambda.cpp

16 lines

test/

CodeGenCUDA/

unnamed-types.cu

34 lines

Diff 204153

clang/include/clang/Basic/LangOptions.def

	Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines
	LANGOPT(OpenMPOptimisticCollapse , 1, 0, "Use at most 32 bits to represent the collapsed loop nest counter.")			LANGOPT(OpenMPOptimisticCollapse , 1, 0, "Use at most 32 bits to represent the collapsed loop nest counter.")
	LANGOPT(RenderScript , 1, 0, "RenderScript")			LANGOPT(RenderScript , 1, 0, "RenderScript")

	LANGOPT(CUDAIsDevice , 1, 0, "compiling for CUDA device")			LANGOPT(CUDAIsDevice , 1, 0, "compiling for CUDA device")
	LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")			LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")
	LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")			LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")
	LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")			LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")
	LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")			LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")
				LANGOPT(CUDAForceLambdaODR, 1, 0, "force lambda naming following one definition rule (ODR)")

	LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")			LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")

	LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")			LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")
	LANGOPT(AlignedAllocation , 1, 0, "aligned allocation")			LANGOPT(AlignedAllocation , 1, 0, "aligned allocation")
	LANGOPT(AlignedAllocationUnavailable, 1, 0, "aligned allocation functions are unavailable")			LANGOPT(AlignedAllocationUnavailable, 1, 0, "aligned allocation functions are unavailable")
	LANGOPT(NewAlignOverride , 32, 0, "maximum alignment guaranteed by '::operator new(size_t)'")			LANGOPT(NewAlignOverride , 32, 0, "maximum alignment guaranteed by '::operator new(size_t)'")
	LANGOPT(ConceptsTS , 1, 0, "enable C++ Extensions for Concepts")			LANGOPT(ConceptsTS , 1, 0, "enable C++ Extensions for Concepts")
	▲ Show 20 Lines • Show All 110 Lines • Show Last 20 Lines

clang/include/clang/Driver/CC1Options.td

	Show First 20 Lines • Show All 857 Lines • ▼ Show 20 Lines
	def fcuda_is_device : Flag<["-"], "fcuda-is-device">,			def fcuda_is_device : Flag<["-"], "fcuda-is-device">,
	HelpText<"Generate code for CUDA device">;			HelpText<"Generate code for CUDA device">;
	def fcuda_include_gpubinary : Separate<["-"], "fcuda-include-gpubinary">,			def fcuda_include_gpubinary : Separate<["-"], "fcuda-include-gpubinary">,
	HelpText<"Incorporate CUDA device-side binary into host object file.">;			HelpText<"Incorporate CUDA device-side binary into host object file.">;
	def fcuda_allow_variadic_functions : Flag<["-"], "fcuda-allow-variadic-functions">,			def fcuda_allow_variadic_functions : Flag<["-"], "fcuda-allow-variadic-functions">,
	HelpText<"Allow variadic functions in CUDA device code.">;			HelpText<"Allow variadic functions in CUDA device code.">;
	def fno_cuda_host_device_constexpr : Flag<["-"], "fno-cuda-host-device-constexpr">,			def fno_cuda_host_device_constexpr : Flag<["-"], "fno-cuda-host-device-constexpr">,
	HelpText<"Don't treat unattributed constexpr functions as __host__ __device__.">;			HelpText<"Don't treat unattributed constexpr functions as __host__ __device__.">;
				def fcuda_force_lambda_odr : Flag<["-"], "fcuda-force-lambda-odr">,
				HelpText<"Force lambda naming following one definition rule (ODR).">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// OpenMP Options			// OpenMP Options
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def fopenmp_is_device : Flag<["-"], "fopenmp-is-device">,			def fopenmp_is_device : Flag<["-"], "fopenmp-is-device">,
	HelpText<"Generate code only for an OpenMP target device.">;			HelpText<"Generate code only for an OpenMP target device.">;
	def fopenmp_host_ir_file_path : Separate<["-"], "fopenmp-host-ir-file-path">,			def fopenmp_host_ir_file_path : Separate<["-"], "fopenmp-host-ir-file-path">,
	Show All 40 Lines

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,084 Lines • ▼ Show 20 Lines	public:

/// Compute the mangling number context for a lambda expression or		/// Compute the mangling number context for a lambda expression or
/// block literal.		/// block literal.
///		///
/// \param DC - The DeclContext containing the lambda expression or		/// \param DC - The DeclContext containing the lambda expression or
/// block literal.		/// block literal.
/// \param[out] ManglingContextDecl - Returns the ManglingContextDecl		/// \param[out] ManglingContextDecl - Returns the ManglingContextDecl
/// associated with the context, if relevant.		/// associated with the context, if relevant.
MangleNumberingContext *getCurrentMangleNumberContext(		MangleNumberingContext *
const DeclContext *DC,		getCurrentMangleNumberContext(const DeclContext *DC,
Decl *&ManglingContextDecl);		Decl *&ManglingContextDecl,
		bool SkpNoODRChk = false);

/// SpecialMemberOverloadResult - The overloading result for a special member		/// SpecialMemberOverloadResult - The overloading result for a special member
/// function.		/// function.
///		///
/// This is basically a wrapper around PointerIntPair. The lowest bits of the		/// This is basically a wrapper around PointerIntPair. The lowest bits of the
/// integer are used to determine whether overload resolution succeeded.		/// integer are used to determine whether overload resolution succeeded.
class SpecialMemberOverloadResult {		class SpecialMemberOverloadResult {
public:		public:
▲ Show 20 Lines • Show All 10,171 Lines • Show Last 20 Lines

clang/lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 2,425 Lines • ▼ Show 20 Lines	if (Args.hasArg(OPT_fcuda_is_device))
Opts.CUDAIsDevice = 1;		Opts.CUDAIsDevice = 1;

if (Args.hasArg(OPT_fcuda_allow_variadic_functions))		if (Args.hasArg(OPT_fcuda_allow_variadic_functions))
Opts.CUDAAllowVariadicFunctions = 1;		Opts.CUDAAllowVariadicFunctions = 1;

if (Args.hasArg(OPT_fno_cuda_host_device_constexpr))		if (Args.hasArg(OPT_fno_cuda_host_device_constexpr))
Opts.CUDAHostDeviceConstexpr = 0;		Opts.CUDAHostDeviceConstexpr = 0;

		if (Args.hasArg(OPT_fcuda_force_lambda_odr))
		Opts.CUDAForceLambdaODR = 1;

if (Opts.CUDAIsDevice && Args.hasArg(OPT_fcuda_approx_transcendentals))		if (Opts.CUDAIsDevice && Args.hasArg(OPT_fcuda_approx_transcendentals))
Opts.CUDADeviceApproxTranscendentals = 1;		Opts.CUDADeviceApproxTranscendentals = 1;

Opts.GPURelocatableDeviceCode = Args.hasArg(OPT_fgpu_rdc);		Opts.GPURelocatableDeviceCode = Args.hasArg(OPT_fgpu_rdc);

if (Opts.ObjC) {		if (Opts.ObjC) {
if (Arg *arg = Args.getLastArg(OPT_fobjc_runtime_EQ)) {		if (Arg *arg = Args.getLastArg(OPT_fobjc_runtime_EQ)) {
StringRef value = arg->getValue();		StringRef value = arg->getValue();
▲ Show 20 Lines • Show All 1,114 Lines • Show Last 20 Lines

clang/lib/Sema/SemaLambda.cpp

Show First 20 Lines • Show All 266 Lines • ▼ Show 20 Lines	if (const FunctionDecl *FD = dyn_cast<FunctionDecl>(DC))
return true;		return true;

DC = DC->getLexicalParent();		DC = DC->getLexicalParent();
}		}

return false;		return false;
}		}

MangleNumberingContext *		MangleNumberingContext *Sema::getCurrentMangleNumberContext(
Sema::getCurrentMangleNumberContext(const DeclContext *DC,		const DeclContext DC, Decl &ManglingContextDecl, bool SkpNoODRChk) {
Decl *&ManglingContextDecl) {
// Compute the context for allocating mangling numbers in the current		// Compute the context for allocating mangling numbers in the current
// expression, if the ABI requires them.		// expression, if the ABI requires them.
ManglingContextDecl = ExprEvalContexts.back().ManglingContextDecl;		ManglingContextDecl = ExprEvalContexts.back().ManglingContextDecl;

enum ContextKind {		enum ContextKind {
Normal,		Normal,
DefaultArgument,		DefaultArgument,
DataMember,		DataMember,
Show All 31 Lines	MangleNumberingContext *Sema::getCurrentMangleNumberContext(
// In the following contexts [...] the one-definition rule requires closure		// In the following contexts [...] the one-definition rule requires closure
// types in different translation units to "correspond":		// types in different translation units to "correspond":
bool IsInNonspecializedTemplate =		bool IsInNonspecializedTemplate =
inTemplateInstantiation() \|\| CurContext->isDependentContext();		inTemplateInstantiation() \|\| CurContext->isDependentContext();
switch (Kind) {		switch (Kind) {
case Normal: {		case Normal: {
// -- the bodies of non-exported nonspecialized template functions		// -- the bodies of non-exported nonspecialized template functions
// -- the bodies of inline functions		// -- the bodies of inline functions
if ((IsInNonspecializedTemplate &&		if (SkpNoODRChk \|\|
		(IsInNonspecializedTemplate &&
!(ManglingContextDecl && isa<ParmVarDecl>(ManglingContextDecl))) \|\|		!(ManglingContextDecl && isa<ParmVarDecl>(ManglingContextDecl))) \|\|
isInInlineFunction(CurContext)) {		isInInlineFunction(CurContext)) {
ManglingContextDecl = nullptr;		ManglingContextDecl = nullptr;
while (auto *CD = dyn_cast<CapturedDecl>(DC))		while (auto *CD = dyn_cast<CapturedDecl>(DC))
DC = CD->getParent();		DC = CD->getParent();
return &Context.getManglingNumberContext(DC);		return &Context.getManglingNumberContext(DC);
}		}

ManglingContextDecl = nullptr;		ManglingContextDecl = nullptr;
return nullptr;		return nullptr;
}		}

case StaticDataMember:		case StaticDataMember:
// -- the initializers of nonspecialized static members of template classes		// -- the initializers of nonspecialized static members of template classes
if (!IsInNonspecializedTemplate) {		if (!SkpNoODRChk && !IsInNonspecializedTemplate) {
ManglingContextDecl = nullptr;		ManglingContextDecl = nullptr;
return nullptr;		return nullptr;
}		}
// Fall through to get the current context.		// Fall through to get the current context.
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;

case DataMember:		case DataMember:
// -- the in-class initializers of class members		// -- the in-class initializers of class members
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	if (!Params.empty()) {
for (auto P : Method->parameters())		for (auto P : Method->parameters())
P->setOwningFunction(Method);		P->setOwningFunction(Method);
}		}

if (Mangling) {		if (Mangling) {
Class->setLambdaMangling(Mangling->first, Mangling->second);		Class->setLambdaMangling(Mangling->first, Mangling->second);
} else {		} else {
Decl *ManglingContextDecl;		Decl *ManglingContextDecl;
if (MangleNumberingContext *MCtx =		if (MangleNumberingContext *MCtx = getCurrentMangleNumberContext(
getCurrentMangleNumberContext(Class->getDeclContext(),		Class->getDeclContext(), ManglingContextDecl,
ManglingContextDecl)) {		getLangOpts().CUDAForceLambdaODR)) {
unsigned ManglingNumber = MCtx->getManglingNumber(Method);		unsigned ManglingNumber = MCtx->getManglingNumber(Method);
Class->setLambdaMangling(ManglingNumber, ManglingContextDecl);		Class->setLambdaMangling(ManglingNumber, ManglingContextDecl);
}		}
}		}

return Method;		return Method;
}		}

▲ Show 20 Lines • Show All 1,441 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/unnamed-types.cu

This file was added.

				// RUN: %clang_cc1 -std=c++11 -x hip -triple x86_64-linux-gnu -fcuda-force-lambda-odr -emit-llvm %s -o - \| FileCheck %s --check-prefix=HOST
				// RUN: %clang_cc1 -std=c++11 -x hip -triple amdgcn-amd-amdhsa -fcuda-force-lambda-odr -fcuda-is-device -emit-llvm %s -o - \| FileCheck %s --check-prefix=DEVICE

				#include "Inputs/cuda.h"

				// HOST: @0 = private unnamed_addr constant [43 x i8] c"_Z2k0IZZ2f1PfENKUlS0_E_clES0_EUlfE_EvS0_T_\00", align 1

				__device__ float d0(float x) {
				return [](float x) { return x + 2.f; }(x);
				}

				__device__ float d1(float x) {
				return [](float x) { return x * 2.f; }(x);
				}

				// DEVICE: amdgpu_kernel void @_Z2k0IZZ2f1PfENKUlS0_E_clES0_EUlfE_EvS0_T_(
				template <typename F>
				__global__ void k0(float *p, F f) {
				p[0] = f(p[0]) + d0(p[1]) + d1(p[2]);
				}

				void f0(float *p) {
				[](float *p) {
				*p = 1.f;
				}(p);
				}

				void f1(float *p) {
				[](float *p) {
				k0<<<1,1>>>(p, [] __device__ (float x) { return x + 1.f; });
				}(p);
				}
				// HOST: @__hip_register_globals
				// HOST: __hipRegisterFunction{{.}}@_Z2k0IZZ2f1PfENKUlS0_E_clES0_EUlfE_EvS0_T_{{.}}@0