This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Unforce minimum alignment of 4 for byval arguments of device-side functions.
AcceptedPublic

Authored by pavelkopyl on Jan 13 2023, 4:48 PM.

Download Raw Diff

Details

Reviewers

Summary

Minimum alignment of 4 for byval arguments was forced to workaround
a bug in old versions of ptxas. Details: https://reviews.llvm.org/D22428.
Recent ptxas versions (> 9.0) do not seem to have this bug, so alignment
requirement was relaxed. To force again minimum alignment of 4, use
'-force-min-byval-param-align' option.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,210 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vloxseg.c
	60,230 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vluxseg.c
	60,240 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
	60,260 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c

Event Timeline

pavelkopyl created this revision.Jan 13 2023, 4:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 4:48 PM

Herald added subscribers: mattd, gchakrabarti, asavonic, hiraditya. · View Herald Transcript

pavelkopyl requested review of this revision.Jan 13 2023, 4:48 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 4:48 PM

Herald added subscribers: llvm-commits, jholewinski. · View Herald Transcript

pavelkopyl added a reviewer: tra.Jan 13 2023, 4:49 PM

Do we need to do it? I mean -- are there any observable differences between enforced alignment of 4 and not? For register-passed parameters it will make no difference, and for the parameters passed via constant/local memory, it may or may not make a difference.

I guess at the extreme it would impact how many 1-byte parameters we may have, as the total size of param address space has an upper bound (for kernels it's 4K: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#parameter-buffer-layout). If that's the case, then it probably does not matter whether we can pass 4K arguments or only 1K.

In other words, I do not see a practical need to introduce an option which may not have any practical use.

Harbormaster completed remote builds in B207751: Diff 489153.Jan 13 2023, 5:48 PM

pavelkopyl edited the summary of this revision. (Show Details)Jan 25 2023, 4:56 PM

In D141737#4053125, @tra wrote:

Do we need to do it? I mean -- are there any observable differences between enforced alignment of 4 and not? For register-passed parameters it will make no difference, and for the parameters passed via constant/local memory, it may or may not make a difference.

I guess at the extreme it would impact how many 1-byte parameters we may have, as the total size of param address space has an upper bound (for kernels it's 4K: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#parameter-buffer-layout). If that's the case, then it probably does not matter whether we can pass 4K arguments or only 1K.

In other words, I do not see a practical need to introduce an option which may not have any practical use.

Sorry for long response.
OK, I'll remove the option. I just wanted to clarify, are you agree in general with removing this workaround?

I guess, there is one more reason to getting back to default alignment.It seems we break ABI with alignment forced to 4. This may be an issue for mixed build clang + nvcc.

Hi @tra, does it make sense ABI related reason for you, or it's reasonable just to drop this change?

-Rebase

Harbormaster completed remote builds in B227692: Diff 516356.Apr 24 2023, 6:03 AM

In D141737#4292014, @pavelkopyl wrote:

Hi @tra, does it make sense ABI related reason for you, or it's reasonable just to drop this change?

Yes. Removing this workaround makes sense.

This revision is now accepted and ready to land.Apr 24 2023, 11:22 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

19 lines

test/

CodeGen/

NVPTX/

call_bitcast_byval.ll

14 lines

param-align.ll

21 lines

Diff 489153

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	cl::desc("NVPTX Specifies: 0 use div.approx, 1 use div.full, 2 use"
" IEEE Compliant F32 div.rnd if available."),		" IEEE Compliant F32 div.rnd if available."),
cl::init(2));		cl::init(2));

static cl::opt<bool> UsePrecSqrtF32(		static cl::opt<bool> UsePrecSqrtF32(
"nvptx-prec-sqrtf32", cl::Hidden,		"nvptx-prec-sqrtf32", cl::Hidden,
cl::desc("NVPTX Specific: 0 use sqrt.approx, 1 use sqrt.rn."),		cl::desc("NVPTX Specific: 0 use sqrt.approx, 1 use sqrt.rn."),
cl::init(true));		cl::init(true));

		static cl::opt<bool> ForceMinByValParamAlign(
		"nvptx-force-min-byval-param-align", cl::Hidden,
		cl::desc("NVPTX Specific: force 4-byte minimal alignment for byval"
		" params of device functions."),
		cl::init(false));

int NVPTXTargetLowering::getDivF32Level() const {		int NVPTXTargetLowering::getDivF32Level() const {
if (UsePrecDivF32.getNumOccurrences() > 0) {		if (UsePrecDivF32.getNumOccurrences() > 0) {
// If nvptx-prec-div32=N is used on the command-line, always honor it		// If nvptx-prec-div32=N is used on the command-line, always honor it
return UsePrecDivF32;		return UsePrecDivF32;
} else {		} else {
// Otherwise, use div.approx if fast math is enabled		// Otherwise, use div.approx if fast math is enabled
if (getTargetMachine().Options.UnsafeFPMath)		if (getTargetMachine().Options.UnsafeFPMath)
return 0;		return 0;
▲ Show 20 Lines • Show All 4,403 Lines • ▼ Show 20 Lines
Align NVPTXTargetLowering::getFunctionByValParamAlign(		Align NVPTXTargetLowering::getFunctionByValParamAlign(
const Function F, Type ArgTy, Align InitialAlign,		const Function F, Type ArgTy, Align InitialAlign,
const DataLayout &DL) const {		const DataLayout &DL) const {
Align ArgAlign = InitialAlign;		Align ArgAlign = InitialAlign;
// Try to increase alignment to enhance vectorization options.		// Try to increase alignment to enhance vectorization options.
if (F)		if (F)
ArgAlign = std::max(ArgAlign, getFunctionParamOptimizedAlign(F, ArgTy, DL));		ArgAlign = std::max(ArgAlign, getFunctionParamOptimizedAlign(F, ArgTy, DL));

// Work around a bug in ptxas. When PTX code takes address of		// Old ptx versions have a bug. When PTX code takes address of
// byval parameter with alignment < 4, ptxas generates code to		// byval parameter with alignment < 4, ptxas generates code to
// spill argument into memory. Alas on sm_50+ ptxas generates		// spill argument into memory. Alas on sm_50+ ptxas generates
// SASS code that fails with misaligned access. To work around		// SASS code that fails with misaligned access. To work around
// the problem, make sure that we align byval parameters by at		// the problem, make sure that we align byval parameters by at
// least 4.		// least 4. This bug seems to be fixed at least starting from
// TODO: this will need to be undone when we get to support multi-TU		// ptxas > 9.0.
// device-side compilation as it breaks ABI compatibility with nvcc.		// TODO: remove this after verifying the bug is not reproduced
// Hopefully ptxas bug is fixed by then.		// on non-deprecated ptxas versions.
		if (ForceMinByValParamAlign)
ArgAlign = std::max(ArgAlign, Align(4));		ArgAlign = std::max(ArgAlign, Align(4));

return ArgAlign;		return ArgAlign;
}		}

/// isLegalAddressingMode - Return true if the addressing mode represented		/// isLegalAddressingMode - Return true if the addressing mode represented
/// by AM is legal for this target, for a load/store of the specified type.		/// by AM is legal for this target, for a load/store of the specified type.
/// Used to guide target specific optimizations, like loop strength reduction		/// Used to guide target specific optimizations, like loop strength reduction
/// (LoopStrengthReduce.cpp) and memory optimization for address mode		/// (LoopStrengthReduce.cpp) and memory optimization for address mode
▲ Show 20 Lines • Show All 917 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/call_bitcast_byval.ll

	; RUN: llc < %s -march=nvptx -mcpu=sm_50 -verify-machineinstrs \| FileCheck %s			; RUN: llc < %s -march=nvptx -mcpu=sm_50 -verify-machineinstrs \| FileCheck %s
	; RUN: %if ptxas %{ llc < %s -march=nvptx -mcpu=sm_50 -verify-machineinstrs \| %ptxas-verify %}			; RUN: %if ptxas %{ llc < %s -march=nvptx -mcpu=sm_50 -verify-machineinstrs \| %ptxas-verify %}

	; calls with a bitcasted function symbol should be fine, but in combination with			; calls with a bitcasted function symbol should be fine, but in combination with
	; a byval attribute were causing a segfault during isel. This testcase was			; a byval attribute were causing a segfault during isel. This testcase was
	; reduced from a SYCL kernel using aggregate types which ended up being passed			; reduced from a SYCL kernel using aggregate types which ended up being passed
	; `byval`			; `byval`

	target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"			target datalayout = "e-i64:64-i128:128-v16:16-v32:32-n16:32:64"
	target triple = "nvptx64-nvidia-cuda"			target triple = "nvptx64-nvidia-cuda"

	%"class.complex" = type { %"class.sycl::_V1::detail::half_impl::half", %"class.sycl::_V1::detail::half_impl::half" }			%"class.complex" = type { %"class.sycl::_V1::detail::half_impl::half", %"class.sycl::_V1::detail::half_impl::half" }
	%"class.sycl::_V1::detail::half_impl::half" = type { half }			%"class.sycl::_V1::detail::half_impl::half" = type { half }
	%complex_half = type { half, half }			%complex_half = type { half, half }

	; CHECK: .param .align 4 .b8 param2[4];			; CHECK: .param .align 2 .b8 param2[4];
	; CHECK: st.param.v2.b16 [param2+0], {%h2, %h1};			; CHECK: st.param.b16 [param2+0], %h1;
				; CHECK: st.param.b16 [param2+2], %h2;
	; CHECK: .param .align 2 .b8 retval0[4];			; CHECK: .param .align 2 .b8 retval0[4];
	; CHECK: call.uni (retval0),			; CHECK: call.uni (retval0),
	; CHECK-NEXT: _Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE,			; CHECK-NEXT: _Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE,
	define weak_odr void @foo() {			define weak_odr void @foo() {
	entry:			entry:
	%call.i.i.i = tail call %"class.complex" @_Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE(i32 0, i32 0, ptr byval(%"class.complex") null)			%call.i.i.i = tail call %"class.complex" @_Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE(i32 0, i32 0, ptr byval(%"class.complex") null)
	ret void			ret void
	}			}

	;; Function pointers can escape, so we have to use a conservative			;; Function pointers can escape, so we have to use a conservative
	;; alignment for a function that has address taken.			;; alignment for a function that has address taken.
	;;			;;
	declare ptr @usefp(ptr %fp)			declare ptr @usefp(ptr %fp)
	; CHECK: .func callee(			; CHECK: .func callee(
	; CHECK-NEXT: .param .align 4 .b8 callee_param_0[4]			; CHECK-NEXT: .param .align 2 .b8 callee_param_0[4]
	define internal void @callee(ptr byval(%"class.complex") %byval_arg) {			define internal void @callee(ptr byval(%"class.complex") %byval_arg) {
	ret void			ret void
	}			}
	define void @boom() {			define void @boom() {
	%fp = call ptr @usefp(ptr @callee)			%fp = call ptr @usefp(ptr @callee)
	; CHECK: .param .align 4 .b8 param0[4];			; CHECK: .param .align 2 .b8 param0[4];
	; CHECK: st.param.v2.b16 [param0+0]			; CHECK: st.param.b16 [param0+0], %h1;
	; CHECK: .callprototype ()_ (.param .align 4 .b8 _[4]);			; CHECK: st.param.b16 [param0+2], %h2;
				; CHECK: .callprototype ()_ (.param .align 2 .b8 _[4]);
	call void %fp(ptr byval(%"class.complex") null)			call void %fp(ptr byval(%"class.complex") null)
	ret void			ret void
	}			}

	declare %complex_half @_Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE(i32, i32, ptr byval(%"class.complex"))			declare %complex_half @_Z20__spirv_GroupCMulKHRjjN5__spv12complex_halfE(i32, i32, ptr byval(%"class.complex"))

llvm/test/CodeGen/NVPTX/param-align.ll

	; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s			; RUN: llc < %s -march=nvptx -mcpu=sm_20 \| FileCheck %s --check-prefixes=CHECK,NOALIGN4
				; RUN: llc < %s -march=nvptx -mcpu=sm_20 -nvptx-force-min-byval-param-align \| FileCheck %s --check-prefixes=CHECK,ALIGN4
	; RUN: %if ptxas %{ llc < %s -march=nvptx -mcpu=sm_20 \| %ptxas-verify %}			; RUN: %if ptxas %{ llc < %s -march=nvptx -mcpu=sm_20 \| %ptxas-verify %}
				; RUN: %if ptxas %{ llc < %s -march=nvptx -mcpu=sm_20 -nvptx-force-min-byval-param-align \| %ptxas-verify %}

	;;; Need 4-byte alignment on ptr passed byval			;;; Need 4-byte alignment on ptr passed byval
	define ptx_device void @t1(ptr byval(float) %x) {			define ptx_device void @t1(ptr byval(float) %x) {
	; CHECK: .func t1			; CHECK: .func t1
	; CHECK: .param .align 4 .b8 t1_param_0[4]			; CHECK: .param .align 4 .b8 t1_param_0[4]
	ret void			ret void
	}			}

	Show All 9 Lines
	;;; Need 4-byte alignment on float2* passed byval			;;; Need 4-byte alignment on float2* passed byval
	%struct.float2 = type { float, float }			%struct.float2 = type { float, float }
	define ptx_device void @t3(ptr byval(%struct.float2) %x) {			define ptx_device void @t3(ptr byval(%struct.float2) %x) {
	; CHECK: .func t3			; CHECK: .func t3
	; CHECK: .param .align 4 .b8 t3_param_0[8]			; CHECK: .param .align 4 .b8 t3_param_0[8]
	ret void			ret void
	}			}

	;;; Need at least 4-byte alignment in order to avoid miscompilation by
	;;; ptxas for sm_50+
	define ptx_device void @t4(ptr byval(i8) %x) {			define ptx_device void @t4(ptr byval(i8) %x) {
	; CHECK: .func t4			; CHECK: .func t4
	; CHECK: .param .align 4 .b8 t4_param_0[1]			; NOALIGN4: .param .align 1 .b8 t4_param_0[1]
				; ALIGN4: .param .align 4 .b8 t4_param_0[1]
	ret void			ret void
	}			}

	;;; Make sure we adjust alignment at the call site as well.			;;; Make sure we adjust alignment at the call site as well.
	define ptx_device void @t5(ptr align 2 byval(i8) %x) {			define ptx_device void @t5(ptr align 2 byval(i8) %x) {
	; CHECK: .func t5			; CHECK: .func t5
	; CHECK: .param .align 4 .b8 t5_param_0[1]			; NOALIGN4: .param .align 2 .b8 t5_param_0[1]
				; ALIGN4: .param .align 4 .b8 t5_param_0[1]
	; CHECK: {			; CHECK: {
	; CHECK: .param .align 4 .b8 param0[1];			; NOALIGN4: .param .align 1 .b8 param0[1];
				; ALIGN4: .param .align 4 .b8 param0[1];
	; CHECK: call.uni			; CHECK: call.uni
	call void @t4(ptr byval(i8) %x)			call void @t4(ptr byval(i8) %x)
	ret void			ret void
	}			}

	;;; Make sure we adjust alignment for a function prototype			;;; Make sure we adjust alignment for a function prototype
	;;; in case of an inderect call.			;;; in case of an inderect call.

	declare ptr @getfp(i32 %n)			declare ptr @getfp(i32 %n)
	%struct.half2 = type { half, half }			%struct.half2 = type { half, half }
	define ptx_device void @t6() {			define ptx_device void @t6() {
	; CHECK: .func t6			; CHECK: .func t6
	%fp = call ptr @getfp(i32 0)			%fp = call ptr @getfp(i32 0)
	; CHECK: prototype_2 : .callprototype ()_ (.param .align 8 .b8 _[8]);			; CHECK: prototype_2 : .callprototype ()_ (.param .align 8 .b8 _[8]);
	call void %fp(ptr byval(double) null);			call void %fp(ptr byval(double) null);

	%fp2 = call ptr @getfp(i32 1)			%fp2 = call ptr @getfp(i32 1)
	; CHECK: prototype_4 : .callprototype ()_ (.param .align 4 .b8 _[4]);			; NOALIGN4: prototype_4 : .callprototype ()_ (.param .align 2 .b8 _[4]);
				; ALIGN4: prototype_4 : .callprototype ()_ (.param .align 4 .b8 _[4]);
	call void %fp(ptr byval(%struct.half2) null);			call void %fp(ptr byval(%struct.half2) null);

	%fp3 = call ptr @getfp(i32 2)			%fp3 = call ptr @getfp(i32 2)
	; CHECK: prototype_6 : .callprototype ()_ (.param .align 4 .b8 _[1]);			; NOALIGN4: prototype_6 : .callprototype ()_ (.param .align 1 .b8 _[1]);
				; ALIGN4: prototype_6 : .callprototype ()_ (.param .align 4 .b8 _[1]);
	call void %fp(ptr byval(i8) null);			call void %fp(ptr byval(i8) null);
	ret void			ret void
	}			}