This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/2
SIFoldOperands.cpp
2/3
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
1/1
fabs.ll
-
fneg-fabs.ll
1/1
sgpr-copy-cse.ll
1/1
waitcnt-vscnt.ll
1/1
wqm.ll

Differential D87556

[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.
ClosedPublic

Authored by hliao on Sep 11 2020, 10:05 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm

Commits

rGc3492a1aa1b9: [amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.

Summary

Need to lower COPY from SGPR to VGPR to a real instruction as the standard COPY is used where the source and destination are from the same register bank so that we potentially coalesc them together and save one COPY. Considering that, backend optimizations, such as CSE, won't handle them. However, the copy from SGPR to VGPR always needs materializing to a native instruction, it should be lowered into a real one before other backend optimizations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.Sep 11 2020, 10:05 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2020, 10:06 PM

Herald added subscribers: llvm-commits, kerbowa, hiraditya and 7 others. · View Herald Transcript

hliao requested review of this revision.Sep 11 2020, 10:06 PM

Herald added a subscriber: wdng. · View Herald TranscriptSep 11 2020, 10:06 PM

With this patch, we reduce the register pressure for certain tests which could be improved up to 10%. As the code generated would be visibly different from the previous one, regression tests are updated as well.

hliao added inline comments.Sep 11 2020, 10:37 PM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1247–1251	This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
105–108	Add one option to disable or enable such lowering. Because this lowering may increase register live ranges as well as register pressures, some tests may have lower performance. Having this option eases the performance evaluation for suspected benchmarks.
11513–11514	Only handle SGPR_(32\|64) so far as SReg_(32\|64) by default in instruction selection. So far operand folding performs in the pre-order direction, which makes the chain of MOV from the immediate value to SGPR and then to VGPR won't be folded efficiently. We fold the first MOV from the immediate value to SGPR into the second MOV from SGPR to VGPR. IMHO, we should change the folding order from pre-order to post-order one.
llvm/test/CodeGen/AMDGPU/fabs.ll
14	The additional optimizations remove redundant instructions so that shrink pass could discover such patterns.
llvm/test/CodeGen/AMDGPU/sgpr-copy-cse.ll
11	This is the test extracted from that benchmark.
llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll
156–159	as instructions are reduced, the schedule is slightly different from the previous one. Those 2 `s_waitcnt` could not be merged into a single one.
llvm/test/CodeGen/AMDGPU/wqm.ll
653	Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading immediate value followed by COPY from SGPR to VGPR. As COPY doesn't allow immediate operand. The fold only happens on the user of that COPY. As we lowering that COPY to V_MOV, that S_MOV will be folded firstly into that V_MOV. However, that's not a good one as one VGPR is used instead of one SGPR. IMHO, the proper way to fix that is to change the pre-order operand folding into a post-order one, i.e. we always fold from the user instruction. That folding site is a better place to decide how to fold, especially if more than one operands could be folded.

Harbormaster completed remote builds in B71453: Diff 291373.Sep 11 2020, 10:42 PM

critson added a subscriber: critson.Sep 13 2020, 1:26 AM

Follow suggestions from clang-tidy.

Harbormaster completed remote builds in B71569: Diff 291582.Sep 14 2020, 9:05 AM

foad added a subscriber: foad.Sep 14 2020, 9:48 AM

LGTM, but please also wait for Matt's review.

This revision is now accepted and ready to land.Sep 14 2020, 11:28 AM

Closed by commit rGc3492a1aa1b9: [amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel. (authored by hliao). · Explain WhySep 17 2020, 8:04 AM

This revision was automatically updated to reflect the committed changes.

hliao added a commit: rGc3492a1aa1b9: [amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel..

arsenm added inline comments.Sep 17 2020, 8:17 AM

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1246–1251	This is a separate change
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11497	I think it's unreasonable to have another surprise pass here

I don't think this transform necessarily makes sense. Copies should always be preferable to moves

arsenm added a reverting change: rG27df1652709b: Revert "[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.".Sep 18 2020, 6:48 AM

hliao mentioned this in D87939: [PeepholeOptimizer] Enhance the redundant COPY elimination..Sep 18 2020, 1:29 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIFoldOperands.cpp

5 lines

SIISelLowering.cpp

61 lines

test/

CodeGen/

AMDGPU/

6 lines

6 lines

26 lines

4 lines

4 lines

Diff 292511

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

Show First 20 Lines • Show All 1,237 Lines • ▼ Show 20 Lines	for (MachineRegisterInfo::use_nodbg_iterator
// Folding immediates with more than one use will increase program size.		// Folding immediates with more than one use will increase program size.
// FIXME: This will also reduce register usage, which may be better		// FIXME: This will also reduce register usage, which may be better
// in some cases. A better heuristic is needed.		// in some cases. A better heuristic is needed.
if (isInlineConstantIfFolded(TII, *UseMI, OpNo, OpToFold)) {		if (isInlineConstantIfFolded(TII, *UseMI, OpNo, OpToFold)) {
foldOperand(OpToFold, UseMI, OpNo, FoldList, CopiesToReplace);		foldOperand(OpToFold, UseMI, OpNo, FoldList, CopiesToReplace);
} else if (frameIndexMayFold(TII, *UseMI, OpNo, OpToFold)) {		} else if (frameIndexMayFold(TII, *UseMI, OpNo, OpToFold)) {
foldOperand(OpToFold, UseMI, OpNo, FoldList,		foldOperand(OpToFold, UseMI, OpNo, FoldList,
CopiesToReplace);		CopiesToReplace);
} else {		} else {
		// Skip updating literal use if it's used in the same REQ_SQUENCE as,
		// if that literal could be inlined, it's just a single use.
		if (NonInlineUse && NonInlineUse->getParent() == UseMI &&
		UseMI->isRegSequence())
		continue;
		hliaoAuthorUnsubmitted Done Reply Inline Actions This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test. hliao: This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV…
		arsenmUnsubmitted Not Done Reply Inline Actions This is a separate change arsenm: This is a separate change
if (++NumLiteralUses == 1) {		if (++NumLiteralUses == 1) {
NonInlineUse = &*Use;		NonInlineUse = &*Use;
NonInlineUseOpNo = OpNo;		NonInlineUseOpNo = OpNo;
}		}
}		}
}		}

if (NumLiteralUses == 1) {		if (NumLiteralUses == 1) {
▲ Show 20 Lines • Show All 339 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	static cl::opt<bool> VGPRReserveforSGPRSpill(
cl::desc("Allocates one VGPR for future SGPR Spill"), cl::init(true));		cl::desc("Allocates one VGPR for future SGPR Spill"), cl::init(true));

static cl::opt<bool> UseDivergentRegisterIndexing(		static cl::opt<bool> UseDivergentRegisterIndexing(
"amdgpu-use-divergent-register-indexing",		"amdgpu-use-divergent-register-indexing",
cl::Hidden,		cl::Hidden,
cl::desc("Use indirect register addressing for divergent indexes"),		cl::desc("Use indirect register addressing for divergent indexes"),
cl::init(false));		cl::init(false));

		static cl::opt<bool> EnableLowerSGPRToVGPRCopy(
		"lower-sgpr-to-vgpr-copy", cl::Hidden,
		cl::desc("Enable lowering copy from SGPR to VGPR"), cl::init(true));

		hliaoAuthorUnsubmitted Done Reply Inline Actions Add one option to disable or enable such lowering. Because this lowering may increase register live ranges as well as register pressures, some tests may have lower performance. Having this option eases the performance evaluation for suspected benchmarks. hliao: Add one option to disable or enable such lowering. Because this lowering may increase register…
static bool hasFP32Denormals(const MachineFunction &MF) {		static bool hasFP32Denormals(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().allFP32Denormals();		return Info->getMode().allFP32Denormals();
}		}

static bool hasFP64FP16Denormals(const MachineFunction &MF) {		static bool hasFP64FP16Denormals(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().allFP64FP16Denormals();		return Info->getMode().allFP64FP16Denormals();
▲ Show 20 Lines • Show All 11,367 Lines • ▼ Show 20 Lines	bool SITargetLowering::checkAsmConstraintValA(SDValue Op,
if ((Size == 16 && AMDGPU::isInlinableLiteral16(Val, HasInv2Pi)) \|\|		if ((Size == 16 && AMDGPU::isInlinableLiteral16(Val, HasInv2Pi)) \|\|
(Size == 32 && AMDGPU::isInlinableLiteral32(Val, HasInv2Pi)) \|\|		(Size == 32 && AMDGPU::isInlinableLiteral32(Val, HasInv2Pi)) \|\|
(Size == 64 && AMDGPU::isInlinableLiteral64(Val, HasInv2Pi))) {		(Size == 64 && AMDGPU::isInlinableLiteral64(Val, HasInv2Pi))) {
return true;		return true;
}		}
return false;		return false;
}		}

		// Lower COPY from SGPR to VGPR to real one as they are real transfer instead
		// of COPY.
		static void lowerSGPRToVGPRCopy(MachineFunction &MF, MachineRegisterInfo &MRI,
		const SIRegisterInfo &TRI,
		const SIInstrInfo &TII) {
		for (MachineBasicBlock &MBB : MF) {
		arsenmUnsubmitted Not Done Reply Inline Actions I think it's unreasonable to have another surprise pass here arsenm: I think it's unreasonable to have another surprise pass here
		for (auto BI = MBB.begin(), BE = MBB.end(); BI != BE; /EMPTY/) {
		MachineInstr &MI = *BI++;

		auto IsSGPRToVGPRCopy = [&MRI, &TRI](const MachineInstr &MI) {
		if (!MI.isCopy())
		return false;

		auto DstReg = MI.getOperand(0).getReg();
		auto SrcReg = MI.getOperand(1).getReg();
		const auto *DstRC = DstReg.isVirtual() ? MRI.getRegClass(DstReg)
		: TRI.getPhysRegClass(DstReg);
		const auto *SrcRC = SrcReg.isVirtual() ? MRI.getRegClass(SrcReg)
		: TRI.getPhysRegClass(SrcReg);
		return (DstRC == &AMDGPU::VGPR_32RegClass \|\|
		DstRC == &AMDGPU::VReg_64RegClass) &&
		(SrcRC == &AMDGPU::SGPR_32RegClass \|\|
		SrcRC == &AMDGPU::SGPR_64RegClass);
		hliaoAuthorUnsubmitted Done Reply Inline Actions Only handle SGPR_(32\|64) so far as SReg_(32\|64) by default in instruction selection. So far operand folding performs in the pre-order direction, which makes the chain of MOV from the immediate value to SGPR and then to VGPR won't be folded efficiently. We fold the first MOV from the immediate value to SGPR into the second MOV from SGPR to VGPR. IMHO, we should change the folding order from pre-order to post-order one. hliao: Only handle SGPR_(32\|64) so far as SReg_(32\|64) by default in instruction selection. So far…
		};

		// Skip if it's not a copy from SGPR to VGPR.
		if (!IsSGPRToVGPRCopy(MI))
		continue;

		const MachineOperand &Src = MI.getOperand(1);
		// FIXME: Need subreg support.
		if (Src.getSubReg() != AMDGPU::NoSubRegister)
		continue;
		// FIXME: Need undef support.
		if (Src.getReg().isVirtual()) {
		auto *DefMI = MRI.getVRegDef(Src.getReg());
		if (!DefMI \|\| DefMI->isImplicitDef())
		continue;
		}

		LLVM_DEBUG(dbgs() << "Lower COPY: " << MI);
		unsigned Opcode = (TRI.getRegSizeInBits(Src.getReg(), MRI) == 64)
		? AMDGPU::V_MOV_B64_PSEUDO
		: AMDGPU::V_MOV_B32_e32;
		auto DstReg = MI.getOperand(0).getReg();
		auto MIB = BuildMI(MBB, MI, MI.getDebugLoc(), TII.get(Opcode), DstReg)
		.add(MI.getOperand(1));
		LLVM_DEBUG(dbgs() << " to: " << *MIB.getInstr());
		MI.eraseFromParent();
		}
		}
		}

// Figure out which registers should be reserved for stack access. Only after		// Figure out which registers should be reserved for stack access. Only after
// the function is legalized do we know all of the non-spill stack objects or if		// the function is legalized do we know all of the non-spill stack objects or if
// calls are present.		// calls are present.
void SITargetLowering::finalizeLowering(MachineFunction &MF) const {		void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
MachineRegisterInfo &MRI = MF.getRegInfo();		MachineRegisterInfo &MRI = MF.getRegInfo();
SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();		const SIRegisterInfo *TRI = Subtarget->getRegisterInfo();
		const SIInstrInfo *TII = Subtarget->getInstrInfo();

		if (EnableLowerSGPRToVGPRCopy)
		lowerSGPRToVGPRCopy(MF, MRI, TRI, TII);

if (Info->isEntryFunction()) {		if (Info->isEntryFunction()) {
// Callable functions have fixed registers used for stack access.		// Callable functions have fixed registers used for stack access.
reservePrivateMemoryRegs(getTargetMachine(), MF, TRI, Info);		reservePrivateMemoryRegs(getTargetMachine(), MF, TRI, Info);
}		}

assert(!TRI->isSubRegister(Info->getScratchRSrcReg(),		assert(!TRI->isSubRegister(Info->getScratchRSrcReg(),
Info->getStackPtrOffsetReg()));		Info->getStackPtrOffsetReg()));
▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fabs.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=SI -check-prefix=FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=VI -check-prefix=FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 -check-prefix=FUNC %s


	; DAGCombiner will transform:			; DAGCombiner will transform:
	; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))			; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))
	; unless isFabsFree returns true			; unless isFabsFree returns true

	; FUNC-LABEL: {{^}}s_fabs_fn_free:			; FUNC-LABEL: {{^}}s_fabs_fn_free:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|

	; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_bitset0_b32 s{{[0-9]+}}, 31
				hliaoAuthorUnsubmitted Done Reply Inline Actions The additional optimizations remove redundant instructions so that shrink pass could discover such patterns. hliao: The additional optimizations remove redundant instructions so that shrink pass could discover…
	; VI: s_bitset0_b32 s{{[0-9]+}}, 31			; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_fn_free(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @s_fabs_fn_free(float addrspace(1)* %out, i32 %in) {
	%bc= bitcast i32 %in to float			%bc= bitcast i32 %in to float
	%fabs = call float @fabs(float %bc)			%fabs = call float @fabs(float %bc)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}s_fabs_free:			; FUNC-LABEL: {{^}}s_fabs_free:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|

	; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_bitset0_b32 s{{[0-9]+}}, 31
	; VI: s_bitset0_b32 s{{[0-9]+}}, 31			; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_free(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @s_fabs_free(float addrspace(1)* %out, i32 %in) {
	%bc= bitcast i32 %in to float			%bc= bitcast i32 %in to float
	%fabs = call float @llvm.fabs.f32(float %bc)			%fabs = call float @llvm.fabs.f32(float %bc)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}s_fabs_f32:			; FUNC-LABEL: {{^}}s_fabs_f32:
	; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|			; R600: \|{{(PV\|T[0-9])\.[XYZW]}}\|

	; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x7fffffff			; SI: s_bitset0_b32 s{{[0-9]+}}, 31
	; VI: s_bitset0_b32 s{{[0-9]+}}, 31			; VI: s_bitset0_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @s_fabs_f32(float addrspace(1)* %out, float %in) {			define amdgpu_kernel void @s_fabs_f32(float addrspace(1)* %out, float %in) {
	%fabs = call float @llvm.fabs.f32(float %in)			%fabs = call float @llvm.fabs.f32(float %in)
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fabs_v2f32:			; FUNC-LABEL: {{^}}fabs_v2f32:
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fneg-fabs.ll

	Show All 28 Lines
	; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))			; (fabs (f32 bitcast (i32 a))) => (f32 bitcast (and (i32 a), 0x7FFFFFFF))
	; unless isFabsFree returns true			; unless isFabsFree returns true

	; FUNC-LABEL: {{^}}fneg_fabs_free_f32:			; FUNC-LABEL: {{^}}fneg_fabs_free_f32:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|
	; R600: -PV			; R600: -PV

	; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000			; SI: s_bitset1_b32 s{{[0-9]+}}, 31
	; VI: s_bitset1_b32 s{{[0-9]+}}, 31			; VI: s_bitset1_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @fneg_fabs_free_f32(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @fneg_fabs_free_f32(float addrspace(1)* %out, i32 %in) {
	%bc = bitcast i32 %in to float			%bc = bitcast i32 %in to float
	%fabs = call float @llvm.fabs.f32(float %bc)			%fabs = call float @llvm.fabs.f32(float %bc)
	%fsub = fsub float -0.000000e+00, %fabs			%fsub = fsub float -0.000000e+00, %fabs
	store float %fsub, float addrspace(1)* %out			store float %fsub, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fneg_fabs_fn_free_f32:			; FUNC-LABEL: {{^}}fneg_fabs_fn_free_f32:
	; R600-NOT: AND			; R600-NOT: AND
	; R600: \|PV.{{[XYZW]}}\|			; R600: \|PV.{{[XYZW]}}\|
	; R600: -PV			; R600: -PV

	; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000			; SI: s_bitset1_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @fneg_fabs_fn_free_f32(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @fneg_fabs_fn_free_f32(float addrspace(1)* %out, i32 %in) {
	%bc = bitcast i32 %in to float			%bc = bitcast i32 %in to float
	%fabs = call float @fabs(float %bc)			%fabs = call float @fabs(float %bc)
	%fsub = fsub float -0.000000e+00, %fabs			%fsub = fsub float -0.000000e+00, %fabs
	store float %fsub, float addrspace(1)* %out			store float %fsub, float addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fneg_fabs_f32:			; FUNC-LABEL: {{^}}fneg_fabs_f32:
	; SI: s_or_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000			; SI: s_bitset1_b32 s{{[0-9]+}}, 31
	define amdgpu_kernel void @fneg_fabs_f32(float addrspace(1)* %out, float %in) {			define amdgpu_kernel void @fneg_fabs_f32(float addrspace(1)* %out, float %in) {
	%fabs = call float @llvm.fabs.f32(float %in)			%fabs = call float @llvm.fabs.f32(float %in)
	%fsub = fsub float -0.000000e+00, %fabs			%fsub = fsub float -0.000000e+00, %fabs
	store float %fsub, float addrspace(1)* %out, align 4			store float %fsub, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}v_fneg_fabs_f32:			; FUNC-LABEL: {{^}}v_fneg_fabs_f32:
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/sgpr-copy-cse.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 -verify-machineinstrs -o - %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-ni:7"
				target triple = "amdgcn-amd-amdhsa"

				; CHECK-LABEL: {{^}}t0:
				; CHECK: s_load_dwordx2 s{{\[}}[[PTR_LO:[0-9]+]]:[[PTR_HI:[0-9]+]]], s[4:5], 0x0
				; CHECK-COUNT-1: v_mov_b32_e32 v{{[0-9]+}}, s[[PTR_HI]]
				; CHECK-NOT: v_mov_b32_e32 v{{[0-9]+}}, s[[PTR_HI]]
				define protected amdgpu_kernel void @t0(float addrspace(1)* %p, i32 %i0, i32 %j0, i32 %k0) {
				entry:
				hliaoAuthorUnsubmitted Done Reply Inline Actions This is the test extracted from that benchmark. hliao: This is the test extracted from that benchmark.
				%0 = tail call i32 @llvm.amdgcn.workitem.id.x()
				%i = add i32 %0, %i0
				%j = add i32 %0, %j0
				%k = add i32 %0, %k0
				%pi = getelementptr float, float addrspace(1)* %p, i32 %i
				%vi = load float, float addrspace(1)* %pi
				%pj = getelementptr float, float addrspace(1)* %p, i32 %j
				%vj = load float, float addrspace(1)* %pj
				%sum = fadd float %vi, %vj
				%pk = getelementptr float, float addrspace(1)* %p, i32 %k
				store float %sum, float addrspace(1)* %pk
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll

Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	bb:
%tmp9 = lshr exact i64 %tmp8, 32		%tmp9 = lshr exact i64 %tmp8, 32
%tmp10 = getelementptr inbounds i32, i32* %arg, i64 %tmp9		%tmp10 = getelementptr inbounds i32, i32* %arg, i64 %tmp9
store i32 %tmp7, i32* %tmp10, align 4		store i32 %tmp7, i32* %tmp10, align 4
ret void		ret void
}		}

; GCN-LABEL: barrier_vmcnt_vscnt_flat_workgroup:		; GCN-LABEL: barrier_vmcnt_vscnt_flat_workgroup:
; GCN: flat_load_dword		; GCN: flat_load_dword
; GCN: s_waitcnt vmcnt(0) lgkmcnt(0){{$}}		; GFX8_9: s_waitcnt lgkmcnt(0){{$}}
		; GFX8_9: s_waitcnt vmcnt(0){{$}}
		; GFX10: s_waitcnt vmcnt(0) lgkmcnt(0){{$}}
; GFX10: s_waitcnt_vscnt null, 0x0		; GFX10: s_waitcnt_vscnt null, 0x0
		hliaoAuthorUnsubmitted Done Reply Inline Actions as instructions are reduced, the schedule is slightly different from the previous one. Those 2 `s_waitcnt` could not be merged into a single one. hliao: as instructions are reduced, the schedule is slightly different from the previous one. Those 2…
; GCN-NEXT: s_barrier		; GCN-NEXT: s_barrier
define amdgpu_kernel void @barrier_vmcnt_vscnt_flat_workgroup(i32* %arg) {		define amdgpu_kernel void @barrier_vmcnt_vscnt_flat_workgroup(i32* %arg) {
bb:		bb:
%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()		%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
%tmp1 = zext i32 %tmp to i64		%tmp1 = zext i32 %tmp to i64
%tmp2 = shl nuw nsw i64 %tmp1, 32		%tmp2 = shl nuw nsw i64 %tmp1, 32
%tmp3 = add nuw nsw i64 %tmp2, 8589934592		%tmp3 = add nuw nsw i64 %tmp2, 8589934592
%tmp4 = lshr exact i64 %tmp3, 32		%tmp4 = lshr exact i64 %tmp3, 32
▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/wqm.ll

	Show First 20 Lines • Show All 644 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: {{^}}test_loop_vcc:			; CHECK-LABEL: {{^}}test_loop_vcc:
	; CHECK-NEXT: ; %entry			; CHECK-NEXT: ; %entry
	; CHECK-NEXT: s_mov_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], exec			; CHECK-NEXT: s_mov_b64 [[LIVE:s\[[0-9]+:[0-9]+\]]], exec
	; CHECK: s_wqm_b64 exec, exec			; CHECK: s_wqm_b64 exec, exec
	; CHECK: s_and_b64 exec, exec, [[LIVE]]			; CHECK: s_and_b64 exec, exec, [[LIVE]]
	; CHECK: image_store			; CHECK: image_store
	; CHECK: s_wqm_b64 exec, exec			; CHECK: s_wqm_b64 exec, exec
	; CHECK-DAG: v_mov_b32_e32 [[CTR:v[0-9]+]], 0			; CHECK-DAG: v_mov_b32_e32 [[CTR:v[0-9]+]], 0
	; CHECK-DAG: s_mov_b32 [[SEVEN:s[0-9]+]], 0x40e00000			; CHECK-DAG: v_mov_b32_e32 [[SEVEN:v[0-9]+]], 0x40e00000
				hliaoAuthorUnsubmitted Done Reply Inline Actions Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading immediate value followed by COPY from SGPR to VGPR. As COPY doesn't allow immediate operand. The fold only happens on the user of that COPY. As we lowering that COPY to V_MOV, that S_MOV will be folded firstly into that V_MOV. However, that's not a good one as one VGPR is used instead of one SGPR. IMHO, the proper way to fix that is to change the pre-order operand folding into a post-order one, i.e. we always fold from the user instruction. That folding site is a better place to decide how to fold, especially if more than one operands could be folded. hliao: Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading…

	; CHECK: [[LOOPHDR:BB[0-9]+_[0-9]+]]: ; %body			; CHECK: [[LOOPHDR:BB[0-9]+_[0-9]+]]: ; %body
	; CHECK: v_add_f32_e32 [[CTR]], 2.0, [[CTR]]			; CHECK: v_add_f32_e32 [[CTR]], 2.0, [[CTR]]
	; CHECK: [[LOOP:BB[0-9]+_[0-9]+]]: ; %loop			; CHECK: [[LOOP:BB[0-9]+_[0-9]+]]: ; %loop
	; CHECK: v_cmp_lt_f32_e32 vcc, [[SEVEN]], [[CTR]]			; CHECK: v_cmp_gt_f32_e32 vcc, [[CTR]], [[SEVEN]]
	; CHECK: s_cbranch_vccz [[LOOPHDR]]			; CHECK: s_cbranch_vccz [[LOOPHDR]]

	; CHECK: ; %break			; CHECK: ; %break
	; CHECK: ; return			; CHECK: ; return
	define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) nounwind {			define amdgpu_ps <4 x float> @test_loop_vcc(<4 x float> %in) nounwind {
	entry:			entry:
	call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %in, i32 15, i32 undef, <8 x i32> undef, i32 0, i32 0)			call void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float> %in, i32 15, i32 undef, <8 x i32> undef, i32 0, i32 0)
	br label %loop			br label %loop
	▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines