Download Raw Diff

Details

Reviewers

• tstellarAMD
vpykhtin
alex-t

Commits

rGa4e63ead4b43: [AMDGPU] Do not allow register coalescer to create big superregs
rL292413: [AMDGPU] Do not allow register coalescer to create big superregs

Summary

Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.

With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Jan 16 2017, 2:51 PM

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptJan 16 2017, 2:51 PM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 3 others. · View Herald Transcript

arsenm added inline comments.Jan 16 2017, 2:55 PM

test/CodeGen/AMDGPU/limit-coalesce.mir
3–4 ↗	(On Diff #84600)	This should use positive checks

Updated test for positive checks.

rampitec marked an inline comment as done.Jan 16 2017, 3:17 PM

Pre-checkin passed.

I think a little more experimentation here might be worthwhile. It's not obvious to me that this is the right heuristic. Allowing 8 or wider might be beneficial. With subregister liveness tracking I would hope that there wouldn't be much difference for 2-4 register tuples. For the larger registers I could see there being more issues.

I see a very small improvement in shader-db with this as is:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494573 (-0.00 %)
VGPRS: 941553 -> 941353 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40835224 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298015 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 19168 -> 19152 (-0.08 %)
VGPRS: 15952 -> 15752 (-1.25 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 890656 -> 894328 (0.41 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 2197 -> 2230 (1.50 %)
Wait states: 0 -> 0 (0.00 %)

If I increase the threshold to 8 I see slightly better improvements:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494549 (-0.00 %)
VGPRS: 941553 -> 941377 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40834176 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298014 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 10664 -> 10624 (-0.38 %)
VGPRS: 10624 -> 10448 (-1.66 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 627904 -> 630528 (0.42 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 1111 -> 1143 (2.88 %)
Wait states: 0 -> 0 (0.00 %)

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗	(On Diff #84603)	This isn't being used for the spill size, so this is supposed to use getRegBitWidth
1491–1493 ↗	(On Diff #84603)	We don't have sub-dword registers, so the < and comment are misleading
test/CodeGen/AMDGPU/limit-coalesce.mir
53 ↗	(On Diff #84603)	Can you add more tests for more register sizes?

This is a very conservative limitation to fix bloat in clFFT, where it saves ~600 bytes of scratch per kernel by creating vreg_96 from vreg_64. I have no doubt this place will be revisited much more times to improve heuristics as more codes are analyzed. This is really just a start of it.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗	(On Diff #84603)	What do you mean? /// getSize - Return the size of the register in bytes, which is also the size /// of a stack slot allocated to hold a spilled copy of this register.
1491–1493 ↗	(On Diff #84603)	This is for packed f16, we do not want to revisit this.

Modified test to add more sizes.

rampitec marked an inline comment as done.Jan 16 2017, 7:51 PM

arsenm added inline comments.Jan 16 2017, 8:36 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗	(On Diff #84603)	Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only the spill size, which may be different from the register bit width
1491–1493 ↗	(On Diff #84603)	Even with packed f16 we don't have smaller sub registers than 4

Updated comment concerning sub-dwords.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗	(On Diff #84603)	It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register classes. When it is submitted it can be reconsidered, but I guess it should auto generate defaults to prevent "Unexpected register class" assertion.
1491–1493 ↗	(On Diff #84603)	I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want to remove "<".

LGTM.

This revision is now accepted and ready to land.Jan 18 2017, 5:56 AM

Closed by commit rL292413: [AMDGPU] Do not allow register coalescer to create big superregs (authored by rampitec). · Explain WhyJan 18 2017, 9:41 AM

This revision was automatically updated to reflect the committed changes.

Diff 84851

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	public:
/// if explicitly requested value cannot be converted to integer, violates		/// if explicitly requested value cannot be converted to integer, violates
/// subtarget's specifications, or does not meet number of waves per execution		/// subtarget's specifications, or does not meet number of waves per execution
/// unit requirement.		/// unit requirement.
unsigned getMaxNumVGPRs(const MachineFunction &MF) const;		unsigned getMaxNumVGPRs(const MachineFunction &MF) const;

ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,		ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,
unsigned EltSize) const;		unsigned EltSize) const;

		bool shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC) const override;

private:		private:
void buildSpillLoadStore(MachineBasicBlock::iterator MI,		void buildSpillLoadStore(MachineBasicBlock::iterator MI,
unsigned LoadStoreOp,		unsigned LoadStoreOp,
int Index,		int Index,
unsigned ValueReg,		unsigned ValueReg,
bool ValueIsKill,		bool ValueIsKill,
unsigned ScratchRsrcReg,		unsigned ScratchRsrcReg,
unsigned ScratchOffsetReg,		unsigned ScratchOffsetReg,
int64_t InstrOffset,		int64_t InstrOffset,
MachineMemOperand *MMO,		MachineMemOperand *MMO,
RegScavenger *RS) const;		RegScavenger *RS) const;
};		};

} // End namespace llvm		} // End namespace llvm

#endif		#endif

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,468 Lines • ▼ Show 20 Lines	SIRegisterInfo::getRegClassForReg(const MachineRegisterInfo &MRI,

return getPhysRegClass(Reg);		return getPhysRegClass(Reg);
}		}

bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI,		bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI,
unsigned Reg) const {		unsigned Reg) const {
return hasVGPRs(getRegClassForReg(MRI, Reg));		return hasVGPRs(getRegClassForReg(MRI, Reg));
}		}

		bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC) const {
		unsigned SrcSize = SrcRC->getSize();
		unsigned DstSize = DstRC->getSize();
		unsigned NewSize = NewRC->getSize();

		// Do not increase size of registers beyond dword, we would need to allocate
		// adjacent registers and constraint regalloc more than needed.

		// Always allow dword coalescing.
		if (SrcSize <= 4 \|\| DstSize <= 4)
		return true;

		return NewSize <= DstSize \|\| NewSize <= SrcSize;
		}

llvm/trunk/test/CodeGen/AMDGPU/half.ll

	Show First 20 Lines • Show All 393 Lines • ▼ Show 20 Lines
	; XVI: buffer_load_dwordx2 [[LOAD:v\[[0-9]+:[0-9]+\]]]			; XVI: buffer_load_dwordx2 [[LOAD:v\[[0-9]+:[0-9]+\]]]
	; XVI-DAG: v_lshrrev_b32_e32 {{v[0-9]+}}, 16, {{v[0-9]+}}			; XVI-DAG: v_lshrrev_b32_e32 {{v[0-9]+}}, 16, {{v[0-9]+}}
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI-NOT: v_cvt_f32_f16			; XVI-NOT: v_cvt_f32_f16

	; GCN: buffer_load_dwordx2 v{{\[}}[[IN_LO:[0-9]+]]:[[IN_HI:[0-9]+]]			; GCN: buffer_load_dwordx2 v{{\[}}[[IN_LO:[0-9]+]]:[[IN_HI:[0-9]+]]
	; VI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]			; VI-DAG: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]
	; GCN: v_cvt_f32_f16_e32 [[Z32:v[0-9]+]], v[[IN_HI]]			; GCN-DAG: v_cvt_f32_f16_e32 [[Z32:v[0-9]+]], v[[IN_HI]]
	; GCN: v_cvt_f32_f16_e32 [[X32:v[0-9]+]], v[[IN_LO]]			; GCN-DAG: v_cvt_f32_f16_e32 [[X32:v[0-9]+]], v[[IN_LO]]
	; SI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]			; SI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]
	; GCN: v_cvt_f32_f16_e32 [[Y32:v[0-9]+]], [[Y16]]			; GCN-DAG: v_cvt_f32_f16_e32 [[Y32:v[0-9]+]], [[Y16]]

	; GCN: v_cvt_f64_f32_e32 [[Z:v\[[0-9]+:[0-9]+\]]], [[Z32]]			; GCN-DAG: v_cvt_f64_f32_e32 [[Z:v\[[0-9]+:[0-9]+\]]], [[Z32]]
	; GCN: v_cvt_f64_f32_e32 v{{\[}}[[XLO:[0-9]+]]:{{[0-9]+}}], [[X32]]			; GCN-DAG: v_cvt_f64_f32_e32 v{{\[}}[[XLO:[0-9]+]]:{{[0-9]+}}], [[X32]]
	; GCN: v_cvt_f64_f32_e32 v[{{[0-9]+}}:[[YHI:[0-9]+]]{{\]}}, [[Y32]]			; GCN-DAG: v_cvt_f64_f32_e32 v[{{[0-9]+}}:[[YHI:[0-9]+]]{{\]}}, [[Y32]]
	; GCN-NOT: v_cvt_f64_f32_e32			; GCN-NOT: v_cvt_f64_f32_e32

	; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[XLO]]:[[YHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}			; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[XLO]]:[[YHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
	; GCN-DAG: buffer_store_dwordx2 [[Z]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:16			; GCN-DAG: buffer_store_dwordx2 [[Z]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:16
	; GCN: s_endpgm			; GCN: s_endpgm
	define void @global_extload_v3f16_to_v3f64(<3 x double> addrspace(1)* %out, <3 x half> addrspace(1)* %in) #0 {			define void @global_extload_v3f16_to_v3f64(<3 x double> addrspace(1)* %out, <3 x half> addrspace(1)* %in) #0 {
	%val = load <3 x half>, <3 x half> addrspace(1)* %in			%val = load <3 x half>, <3 x half> addrspace(1)* %in
	%cvt = fpext <3 x half> %val to <3 x double>			%cvt = fpext <3 x half> %val to <3 x double>
	▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/limit-coalesce.mir

				# RUN: llc -march=amdgcn -run-pass simple-register-coalescing -o - %s \| FileCheck %s

				# Check that coalescer does not create wider register tuple than in source

				# CHECK: - { id: 2, class: vreg_64 }
				# CHECK: - { id: 3, class: vreg_64 }
				# CHECK: - { id: 4, class: vreg_64 }
				# CHECK: - { id: 5, class: vreg_96 }
				# CHECK: - { id: 6, class: vreg_96 }
				# CHECK: - { id: 7, class: vreg_128 }
				# CHECK: - { id: 8, class: vreg_128 }
				# No more registers shall be defined
				# CHECK-NEXT: liveins:
				# CHECK: FLAT_STORE_DWORDX2 %vgpr0_vgpr1, %4,
				# CHECK: FLAT_STORE_DWORDX3 %vgpr0_vgpr1, %6,

				---
				name: main
				alignment: 0
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				tracksRegLiveness: true
				registers:
				- { id: 1, class: sreg_32_xm0, preferred-register: '%1' }
				- { id: 2, class: vreg_64, preferred-register: '%2' }
				- { id: 3, class: vreg_64 }
				- { id: 4, class: vreg_64 }
				- { id: 5, class: vreg_64 }
				- { id: 6, class: vreg_96 }
				- { id: 7, class: vreg_96 }
				- { id: 8, class: vreg_128 }
				- { id: 9, class: vreg_128 }
				liveins:
				- { reg: '%sgpr6', virtual-reg: '%1' }
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 0
				adjustsStack: false
				hasCalls: false
				maxCallFrameSize: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				body: \|
				bb.0.entry:
				liveins: %sgpr0, %vgpr0_vgpr1

				%3 = IMPLICIT_DEF
				undef %4.sub0 = COPY %sgpr0
				%4.sub1 = COPY %3.sub0
				undef %5.sub0 = COPY %4.sub1
				%5.sub1 = COPY %4.sub0
				FLAT_STORE_DWORDX2 %vgpr0_vgpr1, killed %5, 0, 0, 0, implicit %exec, implicit %flat_scr

				%6 = IMPLICIT_DEF
				undef %7.sub0_sub1 = COPY %6
				%7.sub2 = COPY %3.sub0
				FLAT_STORE_DWORDX3 %vgpr0_vgpr1, killed %7, 0, 0, 0, implicit %exec, implicit %flat_scr

				%8 = IMPLICIT_DEF
				undef %9.sub0_sub1_sub2 = COPY %8
				%9.sub3 = COPY %3.sub0
				FLAT_STORE_DWORDX4 %vgpr0_vgpr1, killed %9, 0, 0, 0, implicit %exec, implicit %flat_scr
				...

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84851

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/trunk/test/CodeGen/AMDGPU/half.ll

llvm/trunk/test/CodeGen/AMDGPU/limit-coalesce.mir

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84851

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.h

llvm/trunk/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/trunk/test/CodeGen/AMDGPU/half.ll

llvm/trunk/test/CodeGen/AMDGPU/limit-coalesce.mir

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic