This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/3
SIFrameLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
gfx-callable-preserved-registers.ll
-
need-fp-from-csr-vgpr-spill.ll

Differential D95241

AMDGPU: Fix redundant FP spilling/assert in some functions
ClosedPublic

Authored by arsenm on Jan 22 2021, 8:21 AM.

Download Raw Diff

Details

Reviewers

scott.linder
RamNalamothu
kerbowa
rampitec

Summary

If a function has stack objects, and a call, we require an FP. If we
did not initially have any stack objects, and only introduced them
during PrologEpilogInserter for CSR VGPR spills, SILowerSGPRSpills
would end up spilling the FP register as if it were a normal
register. This would result in an assert in a debug build, or
redundant handling of the FP register in a release build.

Try to predict that we will have an FP later, although this is ugly.

Diff Detail

Event Timeline

arsenm created this revision.Jan 22 2021, 8:21 AM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald TranscriptJan 22 2021, 8:21 AM

arsenm requested review of this revision.Jan 22 2021, 8:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 22 2021, 8:21 AM

Herald added a subscriber: wdng. · View Herald Transcript

scott.linder added inline comments.Jan 22 2021, 10:18 AM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1349	Is there a reason to have `hasFP` as distinct from `WillHaveFP`? Is it not possible to have a single predicate which encompasses both? Is it an issue of needing to consider different context for each callsite? It seems like the only question any part of the code should be asking is: "Can we prove we do not and will not need a frame pointer?" I would expect it to be a "monotone" predicate, in that it can only ever "increase" in strength from "we may need an FP, so you must assume we do/will" to "we absolutely do not need an FP, and will never accidentally end up in a situation later in compilation where we actually do need an FP", and for debug builds this should be enforced via an assertion. It would be safe to implement the predicate to always return "we may need it", but we would want to have it detect as many cases where we could promise no need for an FP starting as early in compilation as possible. If I understand the exact issue being addressed here, this patch makes sense, but it feels like we will be hitting other similar issues until we do something more fundamental.

arsenm added inline comments.Jan 22 2021, 12:20 PM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1349	Ultimately this is because our hasFP condition is weird. Unlike every other target, the stack addressing and stack growth are in the same direction so we end up requiring an FP to access stack objects if there's a call. We then don't know the set of stack objects because the callee saved spills are inserted later. Once we start making decisions that assume hasFP, we're stuck with the FP. We already do the same prediction in determineCalleeSaves, this just repeats it in determineCalleeSavesSGPR. This doesn't have to do with the callsites, it's a function of the function with a call. I think I tried before to make hasFP conservative, but I don't remember what the problem was

LGTM, this materially improves things, and we can always try to unify all of this in the future if we have the time

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1349	I still think it would make it would be easier to understand if we could find a way to move this all into `hasFP` and make it conservative. It seems like other targets get this for free by virtue of being able to address negative offsets. Once we start making decisions that assume hasFP, we're stuck with the FP. If we actually ensure this, I'm happy, but I just have a hard time reasoning about it with what we currently have. Ideally I'd like an `#ifndef NDEBUG` `bool` that gets set when we decide we can elide the FP in `hasFP`. Then we can `assert(!ElidedFP)` if we later determine we need an FP. It sounds like the case you are fixing here already hit an assert in debug builds, though, so maybe this isn't necessary.

This revision is now accepted and ready to land.Jan 25 2021, 12:18 PM

5f9707b7960e9a86330dacb814b2ec55207e6f87

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIFrameLowering.cpp

14 lines

test/

CodeGen/

AMDGPU/

gfx-callable-preserved-registers.ll

28 lines

need-fp-from-csr-vgpr-spill.ll

112 lines

Diff 318531

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

Show First 20 Lines • Show All 1,327 Lines • ▼ Show 20 Lines	void SIFrameLowering::determineCalleeSavesSGPR(MachineFunction &MF,
if (MFI->isEntryFunction())		if (MFI->isEntryFunction())
return;		return;

const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
const SIRegisterInfo *TRI = ST.getRegisterInfo();		const SIRegisterInfo *TRI = ST.getRegisterInfo();

// The SP is specifically managed and we don't want extra spills of it.		// The SP is specifically managed and we don't want extra spills of it.
SavedRegs.reset(MFI->getStackPtrOffsetReg());		SavedRegs.reset(MFI->getStackPtrOffsetReg());

		const BitVector AllSavedRegs = SavedRegs;
SavedRegs.clearBitsInMask(TRI->getAllVGPRRegMask());		SavedRegs.clearBitsInMask(TRI->getAllVGPRRegMask());

		// If clearing VGPRs changed the mask, we will have some CSR VGPR spills.
		const bool HaveAnyCSRVGPR = SavedRegs != AllSavedRegs;

		// We have to anticipate introducing CSR VGPR spills if we don't have any
		// stack objects already, since we require an FP if there is a call and stack.
		MachineFrameInfo &FrameInfo = MF.getFrameInfo();
		const bool WillHaveFP = FrameInfo.hasCalls() && HaveAnyCSRVGPR;

		// FP will be specially managed like SP.
		if (WillHaveFP \|\| hasFP(MF))
		scott.linderUnsubmitted Not Done Reply Inline Actions Is there a reason to have `hasFP` as distinct from `WillHaveFP`? Is it not possible to have a single predicate which encompasses both? Is it an issue of needing to consider different context for each callsite? It seems like the only question any part of the code should be asking is: "Can we prove we do not and will not need a frame pointer?" I would expect it to be a "monotone" predicate, in that it can only ever "increase" in strength from "we may need an FP, so you must assume we do/will" to "we absolutely do not need an FP, and will never accidentally end up in a situation later in compilation where we actually do need an FP", and for debug builds this should be enforced via an assertion. It would be safe to implement the predicate to always return "we may need it", but we would want to have it detect as many cases where we could promise no need for an FP starting as early in compilation as possible. If I understand the exact issue being addressed here, this patch makes sense, but it feels like we will be hitting other similar issues until we do something more fundamental. scott.linder: Is there a reason to have `hasFP` as distinct from `WillHaveFP`? Is it not possible to have a…
		arsenmAuthorUnsubmitted Done Reply Inline Actions Ultimately this is because our hasFP condition is weird. Unlike every other target, the stack addressing and stack growth are in the same direction so we end up requiring an FP to access stack objects if there's a call. We then don't know the set of stack objects because the callee saved spills are inserted later. Once we start making decisions that assume hasFP, we're stuck with the FP. We already do the same prediction in determineCalleeSaves, this just repeats it in determineCalleeSavesSGPR. This doesn't have to do with the callsites, it's a function of the function with a call. I think I tried before to make hasFP conservative, but I don't remember what the problem was arsenm: Ultimately this is because our hasFP condition is weird. Unlike every other target, the stack…
		scott.linderUnsubmitted Not Done Reply Inline Actions I still think it would make it would be easier to understand if we could find a way to move this all into `hasFP` and make it conservative. It seems like other targets get this for free by virtue of being able to address negative offsets. Once we start making decisions that assume hasFP, we're stuck with the FP. If we actually ensure this, I'm happy, but I just have a hard time reasoning about it with what we currently have. Ideally I'd like an `#ifndef NDEBUG` `bool` that gets set when we decide we can elide the FP in `hasFP`. Then we can `assert(!ElidedFP)` if we later determine we need an FP. It sounds like the case you are fixing here already hit an assert in debug builds, though, so maybe this isn't necessary. scott.linder: I still think it would make it would be easier to understand if we could find a way to move…
		SavedRegs.reset(MFI->getFrameOffsetReg());
}		}

bool SIFrameLowering::assignCalleeSavedSpillSlots(		bool SIFrameLowering::assignCalleeSavedSpillSlots(
MachineFunction &MF, const TargetRegisterInfo *TRI,		MachineFunction &MF, const TargetRegisterInfo *TRI,
std::vector<CalleeSavedInfo> &CSI) const {		std::vector<CalleeSavedInfo> &CSI) const {
if (CSI.empty())		if (CSI.empty())
return true; // Early exit if no callee saved registers are modified!		return true; // Early exit if no callee saved registers are modified!

▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/gfx-callable-preserved-registers.ll

	Show First 20 Lines • Show All 260 Lines • ▼ Show 20 Lines

	define amdgpu_gfx void @test_call_void_func_void_preserves_s33(i32 addrspace(1)* %out) #0 {			define amdgpu_gfx void @test_call_void_func_void_preserves_s33(i32 addrspace(1)* %out) #0 {
	; GFX9-LABEL: test_call_void_func_void_preserves_s33:			; GFX9-LABEL: test_call_void_func_void_preserves_s33:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1			; GFX9-NEXT: s_or_saveexec_b64 s[4:5], -1
	; GFX9-NEXT: buffer_store_dword v40, off, s[0:3], s32 ; 4-byte Folded Spill			; GFX9-NEXT: buffer_store_dword v40, off, s[0:3], s32 ; 4-byte Folded Spill
	; GFX9-NEXT: s_mov_b64 exec, s[4:5]			; GFX9-NEXT: s_mov_b64 exec, s[4:5]
	; GFX9-NEXT: v_writelane_b32 v40, s33, 3			; GFX9-NEXT: v_writelane_b32 v40, s33, 2
				; GFX9-NEXT: v_writelane_b32 v40, s30, 0
	; GFX9-NEXT: s_mov_b32 s33, s32			; GFX9-NEXT: s_mov_b32 s33, s32
	; GFX9-NEXT: v_writelane_b32 v40, s33, 0
	; GFX9-NEXT: v_writelane_b32 v40, s30, 1
	; GFX9-NEXT: s_add_u32 s32, s32, 0x400			; GFX9-NEXT: s_add_u32 s32, s32, 0x400
	; GFX9-NEXT: s_getpc_b64 s[4:5]			; GFX9-NEXT: s_getpc_b64 s[4:5]
	; GFX9-NEXT: s_add_u32 s4, s4, external_void_func_void@rel32@lo+4			; GFX9-NEXT: s_add_u32 s4, s4, external_void_func_void@rel32@lo+4
	; GFX9-NEXT: s_addc_u32 s5, s5, external_void_func_void@rel32@hi+12			; GFX9-NEXT: s_addc_u32 s5, s5, external_void_func_void@rel32@hi+12
	; GFX9-NEXT: v_writelane_b32 v40, s31, 2			; GFX9-NEXT: v_writelane_b32 v40, s31, 1
	; GFX9-NEXT: ;;#ASMSTART			; GFX9-NEXT: ;;#ASMSTART
	; GFX9-NEXT: ; def s33			; GFX9-NEXT: ; def s33
	; GFX9-NEXT: ;;#ASMEND			; GFX9-NEXT: ;;#ASMEND
	; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]			; GFX9-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX9-NEXT: v_readlane_b32 s4, v40, 0
	; GFX9-NEXT: ;;#ASMSTART			; GFX9-NEXT: ;;#ASMSTART
	; GFX9-NEXT: ; use s33			; GFX9-NEXT: ; use s33
	; GFX9-NEXT: ;;#ASMEND			; GFX9-NEXT: ;;#ASMEND
	; GFX9-NEXT: v_readlane_b32 s4, v40, 1			; GFX9-NEXT: v_readlane_b32 s5, v40, 1
	; GFX9-NEXT: v_readlane_b32 s33, v40, 0
	; GFX9-NEXT: v_readlane_b32 s5, v40, 2
	; GFX9-NEXT: s_sub_u32 s32, s32, 0x400			; GFX9-NEXT: s_sub_u32 s32, s32, 0x400
	; GFX9-NEXT: v_readlane_b32 s33, v40, 3			; GFX9-NEXT: v_readlane_b32 s33, v40, 2
	; GFX9-NEXT: s_or_saveexec_b64 s[6:7], -1			; GFX9-NEXT: s_or_saveexec_b64 s[6:7], -1
	; GFX9-NEXT: buffer_load_dword v40, off, s[0:3], s32 ; 4-byte Folded Reload			; GFX9-NEXT: buffer_load_dword v40, off, s[0:3], s32 ; 4-byte Folded Reload
	; GFX9-NEXT: s_mov_b64 exec, s[6:7]			; GFX9-NEXT: s_mov_b64 exec, s[6:7]
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[4:5]			; GFX9-NEXT: s_setpc_b64 s[4:5]
	;			;
	; GFX10-LABEL: test_call_void_func_void_preserves_s33:			; GFX10-LABEL: test_call_void_func_void_preserves_s33:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: s_or_saveexec_b32 s4, -1			; GFX10-NEXT: s_or_saveexec_b32 s4, -1
	; GFX10-NEXT: buffer_store_dword v40, off, s[0:3], s32 ; 4-byte Folded Spill			; GFX10-NEXT: buffer_store_dword v40, off, s[0:3], s32 ; 4-byte Folded Spill
	; GFX10-NEXT: s_waitcnt_depctr 0xffe3			; GFX10-NEXT: s_waitcnt_depctr 0xffe3
	; GFX10-NEXT: s_mov_b32 exec_lo, s4			; GFX10-NEXT: s_mov_b32 exec_lo, s4
	; GFX10-NEXT: v_writelane_b32 v40, s33, 3			; GFX10-NEXT: v_writelane_b32 v40, s33, 2
	; GFX10-NEXT: s_mov_b32 s33, s32			; GFX10-NEXT: s_mov_b32 s33, s32
	; GFX10-NEXT: s_add_u32 s32, s32, 0x200			; GFX10-NEXT: s_add_u32 s32, s32, 0x200
	; GFX10-NEXT: s_getpc_b64 s[4:5]			; GFX10-NEXT: s_getpc_b64 s[4:5]
	; GFX10-NEXT: s_add_u32 s4, s4, external_void_func_void@rel32@lo+4			; GFX10-NEXT: s_add_u32 s4, s4, external_void_func_void@rel32@lo+4
	; GFX10-NEXT: s_addc_u32 s5, s5, external_void_func_void@rel32@hi+12			; GFX10-NEXT: s_addc_u32 s5, s5, external_void_func_void@rel32@hi+12
	; GFX10-NEXT: v_writelane_b32 v40, s33, 0
	; GFX10-NEXT: ;;#ASMSTART			; GFX10-NEXT: ;;#ASMSTART
	; GFX10-NEXT: ; def s33			; GFX10-NEXT: ; def s33
	; GFX10-NEXT: ;;#ASMEND			; GFX10-NEXT: ;;#ASMEND
	; GFX10-NEXT: v_writelane_b32 v40, s30, 1			; GFX10-NEXT: v_writelane_b32 v40, s30, 0
	; GFX10-NEXT: v_writelane_b32 v40, s31, 2			; GFX10-NEXT: v_writelane_b32 v40, s31, 1
	; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]			; GFX10-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; GFX10-NEXT: v_readlane_b32 s4, v40, 0
	; GFX10-NEXT: ;;#ASMSTART			; GFX10-NEXT: ;;#ASMSTART
	; GFX10-NEXT: ; use s33			; GFX10-NEXT: ; use s33
	; GFX10-NEXT: ;;#ASMEND			; GFX10-NEXT: ;;#ASMEND
	; GFX10-NEXT: v_readlane_b32 s4, v40, 1			; GFX10-NEXT: v_readlane_b32 s5, v40, 1
	; GFX10-NEXT: v_readlane_b32 s33, v40, 0
	; GFX10-NEXT: v_readlane_b32 s5, v40, 2
	; GFX10-NEXT: s_sub_u32 s32, s32, 0x200			; GFX10-NEXT: s_sub_u32 s32, s32, 0x200
	; GFX10-NEXT: v_readlane_b32 s33, v40, 3			; GFX10-NEXT: v_readlane_b32 s33, v40, 2
	; GFX10-NEXT: s_or_saveexec_b32 s6, -1			; GFX10-NEXT: s_or_saveexec_b32 s6, -1
	; GFX10-NEXT: buffer_load_dword v40, off, s[0:3], s32 ; 4-byte Folded Reload			; GFX10-NEXT: buffer_load_dword v40, off, s[0:3], s32 ; 4-byte Folded Reload
	; GFX10-NEXT: s_waitcnt_depctr 0xffe3			; GFX10-NEXT: s_waitcnt_depctr 0xffe3
	; GFX10-NEXT: s_mov_b32 exec_lo, s6			; GFX10-NEXT: s_mov_b32 exec_lo, s6
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: s_setpc_b64 s[4:5]			; GFX10-NEXT: s_setpc_b64 s[4:5]
	%s33 = call i32 asm sideeffect "; def $0", "={s33}"()			%s33 = call i32 asm sideeffect "; def $0", "={s33}"()
	call amdgpu_gfx void @external_void_func_void()			call amdgpu_gfx void @external_void_func_void()
	▲ Show 20 Lines • Show All 490 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/need-fp-from-csr-vgpr-spill.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck %s

				; FP is in CSR range, modified.
				define hidden fastcc void @callee_has_fp() #1 {
				; CHECK-LABEL: callee_has_fp:
				; CHECK: ; %bb.0:
				; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; CHECK-NEXT: s_mov_b32 s4, s33
				; CHECK-NEXT: s_mov_b32 s33, s32
				; CHECK-NEXT: s_add_u32 s32, s32, 0x200
				; CHECK-NEXT: v_mov_b32_e32 v0, 1
				; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], s33 offset:4
				; CHECK-NEXT: s_waitcnt vmcnt(0)
				; CHECK-NEXT: s_sub_u32 s32, s32, 0x200
				; CHECK-NEXT: s_mov_b32 s33, s4
				; CHECK-NEXT: s_setpc_b64 s[30:31]
				%alloca = alloca i32, addrspace(5)
				store volatile i32 1, i32 addrspace(5)* %alloca
				ret void
				}

				; Has no stack objects, but introduces them due to the CSR spill. We
				; see the FP modified in the callee with IPRA. We should not have
				; redundant spills of s33 or assert.
				define internal fastcc void @csr_vgpr_spill_fp_callee() #0 {
				; CHECK-LABEL: csr_vgpr_spill_fp_callee:
				; CHECK: ; %bb.0: ; %bb
				; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; CHECK-NEXT: s_mov_b32 s8, s33
				; CHECK-NEXT: s_mov_b32 s33, s32
				; CHECK-NEXT: s_add_u32 s32, s32, 0x400
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, callee_has_fp@rel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, callee_has_fp@rel32@hi+12
				; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s33 ; 4-byte Folded Spill
				; CHECK-NEXT: s_mov_b64 s[6:7], s[30:31]
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; CHECK-NEXT: ;;#ASMSTART
				; CHECK-NEXT: ; clobber csr v40
				; CHECK-NEXT: ;;#ASMEND
				; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s33 ; 4-byte Folded Reload
				; CHECK-NEXT: s_sub_u32 s32, s32, 0x400
				; CHECK-NEXT: s_mov_b32 s33, s8
				; CHECK-NEXT: s_waitcnt vmcnt(0)
				; CHECK-NEXT: s_setpc_b64 s[6:7]
				bb:
				call fastcc void @callee_has_fp()
				call void asm sideeffect "; clobber csr v40", "~{v40}"()
				ret void
				}

				define amdgpu_kernel void @kernel_call() {
				; CHECK-LABEL: kernel_call:
				; CHECK: ; %bb.0: ; %bb
				; CHECK-NEXT: s_add_u32 flat_scratch_lo, s4, s7
				; CHECK-NEXT: s_addc_u32 flat_scratch_hi, s5, 0
				; CHECK-NEXT: s_add_u32 s0, s0, s7
				; CHECK-NEXT: s_addc_u32 s1, s1, 0
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, csr_vgpr_spill_fp_callee@rel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, csr_vgpr_spill_fp_callee@rel32@hi+12
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; CHECK-NEXT: s_endpgm
				bb:
				tail call fastcc void @csr_vgpr_spill_fp_callee()
				ret void
				}

				; Same, except with a tail call.
				define internal fastcc void @csr_vgpr_spill_fp_tailcall_callee() #0 {
				; CHECK-LABEL: csr_vgpr_spill_fp_tailcall_callee:
				; CHECK: ; %bb.0: ; %bb
				; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; CHECK-NEXT: buffer_store_dword v40, off, s[0:3], s32 ; 4-byte Folded Spill
				; CHECK-NEXT: ;;#ASMSTART
				; CHECK-NEXT: ; clobber csr v40
				; CHECK-NEXT: ;;#ASMEND
				; CHECK-NEXT: buffer_load_dword v40, off, s[0:3], s32 ; 4-byte Folded Reload
				; CHECK-NEXT: v_writelane_b32 v1, s33, 0
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, callee_has_fp@rel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, callee_has_fp@rel32@hi+12
				; CHECK-NEXT: v_readlane_b32 s33, v1, 0
				; CHECK-NEXT: s_setpc_b64 s[4:5]
				bb:
				call void asm sideeffect "; clobber csr v40", "~{v40}"()
				tail call fastcc void @callee_has_fp()
				ret void
				}

				define amdgpu_kernel void @kernel_tailcall() {
				; CHECK-LABEL: kernel_tailcall:
				; CHECK: ; %bb.0: ; %bb
				; CHECK-NEXT: s_add_u32 flat_scratch_lo, s4, s7
				; CHECK-NEXT: s_addc_u32 flat_scratch_hi, s5, 0
				; CHECK-NEXT: s_add_u32 s0, s0, s7
				; CHECK-NEXT: s_addc_u32 s1, s1, 0
				; CHECK-NEXT: s_getpc_b64 s[4:5]
				; CHECK-NEXT: s_add_u32 s4, s4, csr_vgpr_spill_fp_tailcall_callee@rel32@lo+4
				; CHECK-NEXT: s_addc_u32 s5, s5, csr_vgpr_spill_fp_tailcall_callee@rel32@hi+12
				; CHECK-NEXT: s_mov_b32 s32, 0
				; CHECK-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; CHECK-NEXT: s_endpgm
				bb:
				tail call fastcc void @csr_vgpr_spill_fp_tailcall_callee()
				ret void
				}

				attributes #0 = { "frame-pointer"="none" noinline }
				attributes #1 = { "frame-pointer"="all" noinline }