This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
-
CGCall.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
Inputs/
-
device-code-2.ll
-
device-code.ll
-
link-device-bitcode.cu

Differential D47070

[CUDA] Upgrade linked bitcode to enable inlining
AbandonedPublic

Authored by Hahnfeld on May 18 2018, 8:01 AM.

Download Raw Diff

Details

Reviewers

tra

Summary

Revision rC329829 added the architecture to "target-features". This
prevents inlining of previously generated bitcode because the
feature sets don't match. Thus duplicate the information from
"target-cpu" to avoid writing special cases in the analysis.

I'm not sure if that will save us in the long term because inlining
will break again when we add new features. Additionally, using later
CUDA versions might raise the PTX version which is also a feature...

Diff Detail

Event Timeline

Hahnfeld created this revision.May 18 2018, 8:01 AM

Herald added a subscriber: cfe-commits. · View Herald TranscriptMay 18 2018, 8:01 AM

Hahnfeld mentioned this in D46842: [OpenMP][libomptarget] Make bitcode library building depend on clang and llvm-linker being available .May 18 2018, 8:28 AM

I defer to Art on this one.

This was not intended. :-( I was unaware that GetCPUAndFeaturesAttributes() would add any feature that looks like a valid CPU name to the target-cpu attribute.
All I needed is to make builtins available or not. Setting them as function attributes is not what we need here.

I'm not sure what's the best way to deal with this. On one hand I do need to make some builtins available depending on combination of GPU arch and PTX version. The only way to do it is via the features. On the other hand, the features appear to propagate to LLVM IR, which is something I don't need or want.

One way would be to introduce some sort of feature blacklist which would prevent them from being converted to function attributes.
Or, perhaps, we can change TARGET_BUILTIN or create something similar which would allow availability of builtins w/o relying on features.

As a short-term fix we can disable feature-to-function attribute propagation for NVPTX until we fix it.

@echristo -- any other suggestions?

I think that's intended because the generated code might use instructions based on that feature. If we want to ignore that, we could override TargetTransformInfo::areInlineCompatible for NVPTX to only compare target-cpu - but I'm not sure if that is wise...

Looks like this was added as a "temporary solution" in D8984. Meanwhile the attribute whitelist was merged half a year later (D7802), so maybe we can just get rid of comparing target-cpu and target-features entirely?

In D47070#1105533, @Hahnfeld wrote:

Looks like this was added as a "temporary solution" in D8984. Meanwhile the attribute whitelist was merged half a year later (D7802), so maybe we can just get rid of comparing target-cpu and target-features entirely?

You don't want to get rid of the comparison because you might have specific code that can't be inlined from one into another.

In D47070#1104846, @tra wrote:

This was not intended. :-( I was unaware that GetCPUAndFeaturesAttributes() would add any feature that looks like a valid CPU name to the target-cpu attribute.
All I needed is to make builtins available or not. Setting them as function attributes is not what we need here.

I'm not sure what's the best way to deal with this. On one hand I do need to make some builtins available depending on combination of GPU arch and PTX version. The only way to do it is via the features. On the other hand, the features appear to propagate to LLVM IR, which is something I don't need or want.

One way would be to introduce some sort of feature blacklist which would prevent them from being converted to function attributes.
Or, perhaps, we can change TARGET_BUILTIN or create something similar which would allow availability of builtins w/o relying on features.

As a short-term fix we can disable feature-to-function attribute propagation for NVPTX until we fix it.

@echristo -- any other suggestions?

This is some of what I was talking about when I was mentioning how function attributes and the targets work. Ideally you'll have a compatible set of features and it won't really cause an issue. The idea is that if you're compiling for a minimum ptx feature of X, then any "compatible" set of ptx should be able to inline into your code. I think you do want the features to propagate in general, just specific use cases may not care one way or another - that said, for those use cases you're probably just compiling everything with the same feature anyhow.

I guess, ultimately, I'm not seeing what the concern here is for how features are working or not working for the target so it's harder to help. What is the problem you're running into, or can you try a different way of explaining it to me? :)

In D47070#1106018, @echristo wrote:

As a short-term fix we can disable feature-to-function attribute propagation for NVPTX until we fix it.

@echristo -- any other suggestions?

This is some of what I was talking about when I was mentioning how function attributes and the targets work. Ideally you'll have a compatible set of features and it won't really cause an issue. The idea is that if you're compiling for a minimum ptx feature of X, then any "compatible" set of ptx should be able to inline into your code. I think you do want the features to propagate in general, just specific use cases may not care one way or another - that said, for those use cases you're probably just compiling everything with the same feature anyhow.

The thing is that with NVPTX you can not have incompatible functions in the PTX, period. PTXAS will just throw syntax errors at you. In that regard PTX is very different from intel where in the same binary you can have different functions with code for different x86 variants. For PTX, sm_50 and sm_60 mean entirely different GPUs with entirely different instruction sets/encoding. PTX version would be an approximation of a different language dialect . You can not use anything from PTX 4.0 if your file says it's PTX3.0. It's sort of like you can't use c++17 features when you're compiling in c++98 mode. Bottom line is that features and target-cpu do not make much sense for NVPTX. Everything we generate in a TU must satisfy minimum PTX version and minimum GPU variant and it all will be compiled for and run on only one specific GPU. There's no mixing and matching.

The question is -- what's the best way to make things work as they were before I broke them?
@Hahnfeld's idea of ignoring features and target-cpu would get us there, but that may be a never-ending source of surprises if/when something else decides to pay attention to those attributes.
I think a better way to deal with the problem would be to
a) figure out how to make builtins available/or not on clang side, and
b) make target-cpu and target-features attributes explicitly unsupported on NVPTX as we can not provide the functionality those attributes imply.

I guess, ultimately, I'm not seeing what the concern here is for how features are working or not working for the target so it's harder to help. What is the problem you're running into, or can you try a different way of explaining it to me? :)

Here's my understanding of what happens:
We've started adding target-features and target-cpu to everything clang generates.
We also need to link with libdevice (or IR generated by clang which which has functions w/o those attributes. Or we need to link with IR produced by clang which used different CUDA SDK and thus different PTX version in target-feature.
Due to attribute mismatch we are failing to inline some of the functions and that hurts performance.

In D47070#1108803, @tra wrote:

Here's my understanding of what happens:
We've started adding target-features and target-cpu to everything clang generates.
We also need to link with libdevice (or IR generated by clang which which has functions w/o those attributes. Or we need to link with IR produced by clang which used different CUDA SDK and thus different PTX version in target-feature.
Due to attribute mismatch we are failing to inline some of the functions and that hurts performance.

In the case of OpenMP we are linking runtime function in a bitcode library so that Clang can inline them. This dramatically improves performance, so I'm really interested in making this work again with libraries compiled by older versions of Clang.

Is there a viable path forward? Should I put up a patch that just ignores all target-features in LLVM?

IMO overriding TargetTransformInfo::areInlineCompatible to always return true on NVPTX is what we want to do instead of upgrading everything else.
AFAICT, on NVPTX there's no reason to prevent inlining due to those attributes -- we'll never generate code, nor will we ever execute it on any other GPU than we're currently compiling for.

This should get you going until I figure out how to have target-specific builtins without sticking target-cpu and target-features attributes on everything.

Superseded by D47691

Revision Contents

Path

Size

lib/

CodeGen/

CGCall.cpp

33 lines

test/

CodeGenCUDA/

Inputs/

device-code-2.ll

5 lines

device-code.ll

4 lines

link-device-bitcode.cu

9 lines

Diff 147510

lib/CodeGen/CGCall.cpp

Show First 20 Lines • Show All 1,784 Lines • ▼ Show 20 Lines	if (getLangOpts().CUDA && getLangOpts().CUDAIsDevice) {
FuncAttrs.addAttribute(llvm::Attribute::NoUnwind);		FuncAttrs.addAttribute(llvm::Attribute::NoUnwind);

// Respect -fcuda-flush-denormals-to-zero.		// Respect -fcuda-flush-denormals-to-zero.
if (getLangOpts().CUDADeviceFlushDenormalsToZero)		if (getLangOpts().CUDADeviceFlushDenormalsToZero)
FuncAttrs.addAttribute("nvptx-f32ftz", "true");		FuncAttrs.addAttribute("nvptx-f32ftz", "true");
}		}
}		}

		static bool hasTargetFeature(llvm::StringRef FeatureList,
		llvm::StringRef Feature) {
		StringRef Rest = FeatureList;
		while (!Rest.empty()) {
		auto Split = Rest.split(',');
		if (Split.first == Feature)
		return true;
		Rest = Split.second;
		}

		return false;
		}

void CodeGenModule::AddDefaultFnAttrs(llvm::Function &F) {		void CodeGenModule::AddDefaultFnAttrs(llvm::Function &F) {
llvm::AttrBuilder FuncAttrs;		llvm::AttrBuilder FuncAttrs;
ConstructDefaultFnAttrList(F.getName(),		ConstructDefaultFnAttrList(F.getName(),
F.hasFnAttribute(llvm::Attribute::OptimizeNone),		F.hasFnAttribute(llvm::Attribute::OptimizeNone),
/* AttrOnCallsite = */ false, FuncAttrs);		/* AttrOnCallsite = */ false, FuncAttrs);
F.addAttributes(llvm::AttributeList::FunctionIndex, FuncAttrs);		F.addAttributes(llvm::AttributeList::FunctionIndex, FuncAttrs);

		if (getTriple().isNVPTX()) {
		// Revision 329829 added the architecture as a "target-feature". Duplicate
		// this information from "target-cpu" to maintain the ability to inline
		// functions from bitcode files compiled with older versions of LLVM/Clang.
		auto TargetCpu = F.getFnAttribute("target-cpu");
		if (TargetCpu.isStringAttribute()) {
		llvm::StringRef CpuAttr = TargetCpu.getValueAsString();

		auto TargetFeatures = F.getFnAttribute("target-features");
		if (TargetFeatures.isStringAttribute()) {
		llvm::StringRef FeatureList = TargetFeatures.getValueAsString();
		if (!hasTargetFeature(FeatureList, CpuAttr.str())) {
		F.addFnAttr("target-features", (FeatureList + ",+" + CpuAttr).str());
		}
		} else {
		F.addFnAttr("target-features", llvm::Twine("+", CpuAttr).str());
		}
		}
		}
}		}

void CodeGenModule::ConstructAttributeList(		void CodeGenModule::ConstructAttributeList(
StringRef Name, const CGFunctionInfo &FI, CGCalleeInfo CalleeInfo,		StringRef Name, const CGFunctionInfo &FI, CGCalleeInfo CalleeInfo,
llvm::AttributeList &AttrList, unsigned &CallingConv, bool AttrOnCallSite) {		llvm::AttributeList &AttrList, unsigned &CallingConv, bool AttrOnCallSite) {
llvm::AttrBuilder FuncAttrs;		llvm::AttrBuilder FuncAttrs;
llvm::AttrBuilder RetAttrs;		llvm::AttrBuilder RetAttrs;

▲ Show 20 Lines • Show All 2,696 Lines • Show Last 20 Lines

test/CodeGenCUDA/Inputs/device-code-2.ll

	; Simple bit of IR to mimic CUDA's libdevice.			; Simple bit of IR to mimic CUDA's libdevice.

	target triple = "nvptx-unknown-cuda"			target triple = "nvptx-unknown-cuda"

	define double @__nv_sin(double %a) {			define double @__nv_sin(double %a) #0 {
	ret double 1.0			ret double 1.0
	}			}

	define double @__nv_exp(double %a) {			define double @__nv_exp(double %a) #0 {
	ret double 3.0			ret double 3.0
	}			}

	define double @__unused(double %a) {			define double @__unused(double %a) {
	ret double 2.0			ret double 2.0
	}			}

				attributes #0 = { "target-cpu"="sm_35" }

test/CodeGenCUDA/Inputs/device-code.ll

	Show All 10 Lines
	define void @unused_subfunc(float %a) {			define void @unused_subfunc(float %a) {
	ret void			ret void
	}			}

	define void @used_subfunc(float %a) {			define void @used_subfunc(float %a) {
	ret void			ret void
	}			}

	define float @_Z17device_mul_or_addff(float %a, float %b) {			define float @_Z17device_mul_or_addff(float %a, float %b) #0 {
	%reflect = call i32 @__nvvm_reflect(i8* addrspacecast (i8 addrspace(1)* getelementptr inbounds ([8 x i8], [8 x i8] addrspace(1)* @"$str", i32 0, i32 0) to i8*))			%reflect = call i32 @__nvvm_reflect(i8* addrspacecast (i8 addrspace(1)* getelementptr inbounds ([8 x i8], [8 x i8] addrspace(1)* @"$str", i32 0, i32 0) to i8*))
	%cmp = icmp ne i32 %reflect, 0			%cmp = icmp ne i32 %reflect, 0
	br i1 %cmp, label %use_mul, label %use_add			br i1 %cmp, label %use_mul, label %use_add

	use_mul:			use_mul:
	%ret1 = fmul float %a, %b			%ret1 = fmul float %a, %b
	br label %exit			br label %exit

	use_add:			use_add:
	%ret2 = fadd float %a, %b			%ret2 = fadd float %a, %b
	br label %exit			br label %exit

	exit:			exit:
	%ret = phi float [%ret1, %use_mul], [%ret2, %use_add]			%ret = phi float [%ret1, %use_mul], [%ret2, %use_add]

	call void @used_subfunc(float %ret)			call void @used_subfunc(float %ret)

	ret float %ret			ret float %ret
	}			}

				attributes #0 = { "target-cpu"="sm_35" "target-features"="+ptx42" }

test/CodeGenCUDA/link-device-bitcode.cu

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	__global__ __attribute__((used)) void kernel(float out, float in) {
*out = device_mul_or_add(in[0], in[1]);		*out = device_mul_or_add(in[0], in[1]);
out += __nv_exp(__nv_sin(out));		out += __nv_exp(__nv_sin(out));
should_not_be_internalized(out);		should_not_be_internalized(out);
}		}

// Make sure device_mul_or_add() is present in IR, is internal and		// Make sure device_mul_or_add() is present in IR, is internal and
// calls __nvvm_reflect().		// calls __nvvm_reflect().
// CHECK-IR-LABEL: define internal float @_Z17device_mul_or_addff(		// CHECK-IR-LABEL: define internal float @_Z17device_mul_or_addff(
		// CHECK-IR-SAME: [[MUL_OR_ADD:#[0-9]+]] {
// CHECK-IR-NLD-LABEL: define float @_Z17device_mul_or_addff(		// CHECK-IR-NLD-LABEL: define float @_Z17device_mul_or_addff(
// CHECK-IR: call i32 @__nvvm_reflect		// CHECK-IR: call i32 @__nvvm_reflect
// CHECK-IR: ret float		// CHECK-IR: ret float

// Make sure we've linked in and internalized only needed functions		// Make sure we've linked in and internalized only needed functions
// from the second bitcode file.		// from the second bitcode file.
// CHECK-IR-2-LABEL: define internal double @__nv_sin		// CHECK-IR-2-LABEL: define internal double @__nv_sin
		// CHECK-IR-2-SAME: [[IR2ATTR:#[0-9]+]] {
// CHECK-IR-2-LABEL: define internal double @__nv_exp		// CHECK-IR-2-LABEL: define internal double @__nv_exp
		// CHECK-IR-2-SAME: [[IR2ATTR]] {
// CHECK-IR-2-NOT: double @__unused		// CHECK-IR-2-NOT: double @__unused

		// CHECK-IR: attributes [[MUL_OR_ADD]] = {
		// CHECK-IR-SAME: "target-features"="+ptx42,+sm_35"

		// CHECK-IR-2: attributes [[IR2ATTR]] = {
		// CHECK-IR-2-SAME: "target-features"="+sm_35"

// Verify that NVVMReflect pass is among the passes run by NVPTX back-end.		// Verify that NVVMReflect pass is among the passes run by NVPTX back-end.
// CHECK-REFLECT: Replace occurrences of __nvvm_reflect() calls with 0/1		// CHECK-REFLECT: Replace occurrences of __nvvm_reflect() calls with 0/1