Download Raw Diff

Details

Reviewers

• tstellarAMD
vpykhtin
alex-t

Commits

rGa4e63ead4b43: [AMDGPU] Do not allow register coalescer to create big superregs
rL292413: [AMDGPU] Do not allow register coalescer to create big superregs

Summary

Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.

With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.Jan 16 2017, 2:51 PM

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptJan 16 2017, 2:51 PM

Herald added subscribers: tony-tye, yaxunl, nhaehnle and 3 others. · View Herald Transcript

arsenm added inline comments.Jan 16 2017, 2:55 PM

test/CodeGen/AMDGPU/limit-coalesce.mir
4–5	This should use positive checks

Updated test for positive checks.

rampitec marked an inline comment as done.Jan 16 2017, 3:17 PM

Pre-checkin passed.

I think a little more experimentation here might be worthwhile. It's not obvious to me that this is the right heuristic. Allowing 8 or wider might be beneficial. With subregister liveness tracking I would hope that there wouldn't be much difference for 2-4 register tuples. For the larger registers I could see there being more issues.

I see a very small improvement in shader-db with this as is:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494573 (-0.00 %)
VGPRS: 941553 -> 941353 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40835224 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298015 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 19168 -> 19152 (-0.08 %)
VGPRS: 15952 -> 15752 (-1.25 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 890656 -> 894328 (0.41 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 2197 -> 2230 (1.50 %)
Wait states: 0 -> 0 (0.00 %)

If I increase the threshold to 8 I see slightly better improvements:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494549 (-0.00 %)
VGPRS: 941553 -> 941377 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40834176 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298014 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 10664 -> 10624 (-0.38 %)
VGPRS: 10624 -> 10448 (-1.66 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 627904 -> 630528 (0.42 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 1111 -> 1143 (2.88 %)
Wait states: 0 -> 0 (0.00 %)

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486	This isn't being used for the spill size, so this is supposed to use getRegBitWidth
1491–1493	We don't have sub-dword registers, so the < and comment are misleading
test/CodeGen/AMDGPU/limit-coalesce.mir
54	Can you add more tests for more register sizes?

This is a very conservative limitation to fix bloat in clFFT, where it saves ~600 bytes of scratch per kernel by creating vreg_96 from vreg_64. I have no doubt this place will be revisited much more times to improve heuristics as more codes are analyzed. This is really just a start of it.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486	What do you mean? /// getSize - Return the size of the register in bytes, which is also the size /// of a stack slot allocated to hold a spilled copy of this register.
1491–1493	This is for packed f16, we do not want to revisit this.

Modified test to add more sizes.

rampitec marked an inline comment as done.Jan 16 2017, 7:51 PM

arsenm added inline comments.Jan 16 2017, 8:36 PM

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486	Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only the spill size, which may be different from the register bit width
1491–1493	Even with packed f16 we don't have smaller sub registers than 4

Updated comment concerning sub-dwords.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486	It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register classes. When it is submitted it can be reconsidered, but I guess it should auto generate defaults to prevent "Unexpected register class" assertion.
1491–1493	I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want to remove "<".

LGTM.

This revision is now accepted and ready to land.Jan 18 2017, 5:56 AM

Closed by commit rL292413: [AMDGPU] Do not allow register coalescer to create big superregs (authored by rampitec). · Explain WhyJan 18 2017, 9:41 AM

This revision was automatically updated to reflect the committed changes.

Diff 84630

lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	public:
/// if explicitly requested value cannot be converted to integer, violates		/// if explicitly requested value cannot be converted to integer, violates
/// subtarget's specifications, or does not meet number of waves per execution		/// subtarget's specifications, or does not meet number of waves per execution
/// unit requirement.		/// unit requirement.
unsigned getMaxNumVGPRs(const MachineFunction &MF) const;		unsigned getMaxNumVGPRs(const MachineFunction &MF) const;

ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,		ArrayRef<int16_t> getRegSplitParts(const TargetRegisterClass *RC,
unsigned EltSize) const;		unsigned EltSize) const;

		bool shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC) const override;

private:		private:
void buildSpillLoadStore(MachineBasicBlock::iterator MI,		void buildSpillLoadStore(MachineBasicBlock::iterator MI,
unsigned LoadStoreOp,		unsigned LoadStoreOp,
int Index,		int Index,
unsigned ValueReg,		unsigned ValueReg,
bool ValueIsKill,		bool ValueIsKill,
unsigned ScratchRsrcReg,		unsigned ScratchRsrcReg,
unsigned ScratchOffsetReg,		unsigned ScratchOffsetReg,
int64_t InstrOffset,		int64_t InstrOffset,
MachineMemOperand *MMO,		MachineMemOperand *MMO,
RegScavenger *RS) const;		RegScavenger *RS) const;
};		};

} // End namespace llvm		} // End namespace llvm

#endif		#endif

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,468 Lines • ▼ Show 20 Lines	SIRegisterInfo::getRegClassForReg(const MachineRegisterInfo &MRI,

return getPhysRegClass(Reg);		return getPhysRegClass(Reg);
}		}

bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI,		bool SIRegisterInfo::isVGPR(const MachineRegisterInfo &MRI,
unsigned Reg) const {		unsigned Reg) const {
return hasVGPRs(getRegClassForReg(MRI, Reg));		return hasVGPRs(getRegClassForReg(MRI, Reg));
}		}

		bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,
		const TargetRegisterClass *SrcRC,
		unsigned SubReg,
		const TargetRegisterClass *DstRC,
		unsigned DstSubReg,
		const TargetRegisterClass *NewRC) const {
		unsigned SrcSize = SrcRC->getSize();
		unsigned DstSize = DstRC->getSize();
		unsigned NewSize = NewRC->getSize();
		arsenmUnsubmitted Not Done Reply Inline Actions This isn't being used for the spill size, so this is supposed to use getRegBitWidth arsenm: This isn't being used for the spill size, so this is supposed to use getRegBitWidth
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions What do you mean? /// getSize - Return the size of the register in bytes, which is also the size /// of a stack slot allocated to hold a spilled copy of this register. rampitec: What do you mean? ``` /// getSize - Return the size of the register in bytes, which is also…
		arsenmUnsubmitted Not Done Reply Inline Actions Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only the spill size, which may be different from the register bit width arsenm: Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register classes. When it is submitted it can be reconsidered, but I guess it should auto generate defaults to prevent "Unexpected register class" assertion. rampitec: It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register…

		// Do not increase size of registers beyond dword, we would need to allocate
		// adjacent registers and constraint regalloc more than needed.

		// Always allow dword coalescing.
		if (SrcSize <= 4 \|\| DstSize <= 4)
		return true;
		arsenmUnsubmitted Not Done Reply Inline Actions We don't have sub-dword registers, so the < and comment are misleading arsenm: We don't have sub-dword registers, so the < and comment are misleading
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions This is for packed f16, we do not want to revisit this. rampitec: This is for packed f16, we do not want to revisit this.
		arsenmUnsubmitted Not Done Reply Inline Actions Even with packed f16 we don't have smaller sub registers than 4 arsenm: Even with packed f16 we don't have smaller sub registers than 4
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want to remove "<". rampitec: I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want…

		return NewSize <= DstSize \|\| NewSize <= SrcSize;
		}

test/CodeGen/AMDGPU/half.ll

	Show First 20 Lines • Show All 393 Lines • ▼ Show 20 Lines
	; XVI: buffer_load_dwordx2 [[LOAD:v\[[0-9]+:[0-9]+\]]]			; XVI: buffer_load_dwordx2 [[LOAD:v\[[0-9]+:[0-9]+\]]]
	; XVI-DAG: v_lshrrev_b32_e32 {{v[0-9]+}}, 16, {{v[0-9]+}}			; XVI-DAG: v_lshrrev_b32_e32 {{v[0-9]+}}, 16, {{v[0-9]+}}
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI: v_cvt_f32_f16_e32			; XVI: v_cvt_f32_f16_e32
	; XVI-NOT: v_cvt_f32_f16			; XVI-NOT: v_cvt_f32_f16

	; GCN: buffer_load_dwordx2 v{{\[}}[[IN_LO:[0-9]+]]:[[IN_HI:[0-9]+]]			; GCN: buffer_load_dwordx2 v{{\[}}[[IN_LO:[0-9]+]]:[[IN_HI:[0-9]+]]
	; VI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]			; VI-DAG: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]
	; GCN: v_cvt_f32_f16_e32 [[Z32:v[0-9]+]], v[[IN_HI]]			; GCN-DAG: v_cvt_f32_f16_e32 [[Z32:v[0-9]+]], v[[IN_HI]]
	; GCN: v_cvt_f32_f16_e32 [[X32:v[0-9]+]], v[[IN_LO]]			; GCN-DAG: v_cvt_f32_f16_e32 [[X32:v[0-9]+]], v[[IN_LO]]
	; SI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]			; SI: v_lshrrev_b32_e32 [[Y16:v[0-9]+]], 16, v[[IN_LO]]
	; GCN: v_cvt_f32_f16_e32 [[Y32:v[0-9]+]], [[Y16]]			; GCN-DAG: v_cvt_f32_f16_e32 [[Y32:v[0-9]+]], [[Y16]]

	; GCN: v_cvt_f64_f32_e32 [[Z:v\[[0-9]+:[0-9]+\]]], [[Z32]]			; GCN-DAG: v_cvt_f64_f32_e32 [[Z:v\[[0-9]+:[0-9]+\]]], [[Z32]]
	; GCN: v_cvt_f64_f32_e32 v{{\[}}[[XLO:[0-9]+]]:{{[0-9]+}}], [[X32]]			; GCN-DAG: v_cvt_f64_f32_e32 v{{\[}}[[XLO:[0-9]+]]:{{[0-9]+}}], [[X32]]
	; GCN: v_cvt_f64_f32_e32 v[{{[0-9]+}}:[[YHI:[0-9]+]]{{\]}}, [[Y32]]			; GCN-DAG: v_cvt_f64_f32_e32 v[{{[0-9]+}}:[[YHI:[0-9]+]]{{\]}}, [[Y32]]
	; GCN-NOT: v_cvt_f64_f32_e32			; GCN-NOT: v_cvt_f64_f32_e32

	; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[XLO]]:[[YHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}			; GCN-DAG: buffer_store_dwordx4 v{{\[}}[[XLO]]:[[YHI]]{{\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, 0{{$}}
	; GCN-DAG: buffer_store_dwordx2 [[Z]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:16			; GCN-DAG: buffer_store_dwordx2 [[Z]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:16
	; GCN: s_endpgm			; GCN: s_endpgm
	define void @global_extload_v3f16_to_v3f64(<3 x double> addrspace(1)* %out, <3 x half> addrspace(1)* %in) #0 {			define void @global_extload_v3f16_to_v3f64(<3 x double> addrspace(1)* %out, <3 x half> addrspace(1)* %in) #0 {
	%val = load <3 x half>, <3 x half> addrspace(1)* %in			%val = load <3 x half>, <3 x half> addrspace(1)* %in
	%cvt = fpext <3 x half> %val to <3 x double>			%cvt = fpext <3 x half> %val to <3 x double>
	▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/limit-coalesce.mir

This file was added.

				# RUN: llc -march=amdgcn -run-pass simple-register-coalescing -o - %s \| FileCheck %s

				# Check that coalescer does not create wider register tuple than in source

				# CHECK: - { id: 2, class: vreg_64 }
				arsenmUnsubmitted Done Reply Inline Actions This should use positive checks arsenm: This should use positive checks
				# CHECK: - { id: 3, class: vreg_64 }
				# CHECK: - { id: 4, class: vreg_64 }
				# CHECK: - { id: 5, class: vreg_96 }
				# CHECK: - { id: 6, class: vreg_96 }
				# CHECK: - { id: 7, class: vreg_128 }
				# CHECK: - { id: 8, class: vreg_128 }
				# No more registers shall be defined
				# CHECK-NEXT: liveins:
				# CHECK: FLAT_STORE_DWORDX2 %vgpr0_vgpr1, %4,
				# CHECK: FLAT_STORE_DWORDX3 %vgpr0_vgpr1, %6,

				---
				name: main
				alignment: 0
				exposesReturnsTwice: false
				legalized: false
				regBankSelected: false
				selected: false
				tracksRegLiveness: true
				registers:
				- { id: 1, class: sreg_32_xm0, preferred-register: '%1' }
				- { id: 2, class: vreg_64, preferred-register: '%2' }
				- { id: 3, class: vreg_64 }
				- { id: 4, class: vreg_64 }
				- { id: 5, class: vreg_64 }
				- { id: 6, class: vreg_96 }
				- { id: 7, class: vreg_96 }
				- { id: 8, class: vreg_128 }
				- { id: 9, class: vreg_128 }
				liveins:
				- { reg: '%sgpr6', virtual-reg: '%1' }
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 0
				adjustsStack: false
				hasCalls: false
				maxCallFrameSize: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				body: \|
				bb.0.entry:
				liveins: %sgpr0, %vgpr0_vgpr1

				arsenmUnsubmitted Done Reply Inline Actions Can you add more tests for more register sizes? arsenm: Can you add more tests for more register sizes?
				%3 = IMPLICIT_DEF
				undef %4.sub0 = COPY %sgpr0
				%4.sub1 = COPY %3.sub0
				undef %5.sub0 = COPY %4.sub1
				%5.sub1 = COPY %4.sub0
				FLAT_STORE_DWORDX2 %vgpr0_vgpr1, killed %5, 0, 0, 0, implicit %exec, implicit %flat_scr

				%6 = IMPLICIT_DEF
				undef %7.sub0_sub1 = COPY %6
				%7.sub2 = COPY %3.sub0
				FLAT_STORE_DWORDX3 %vgpr0_vgpr1, killed %7, 0, 0, 0, implicit %exec, implicit %flat_scr

				%8 = IMPLICIT_DEF
				undef %9.sub0_sub1_sub2 = COPY %8
				%9.sub3 = COPY %3.sub0
				FLAT_STORE_DWORDX4 %vgpr0_vgpr1, killed %9, 0, 0, 0, implicit %exec, implicit %flat_scr
				...

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84630

lib/Target/AMDGPU/SIRegisterInfo.h

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/half.ll

test/CodeGen/AMDGPU/limit-coalesce.mir

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 84630

lib/Target/AMDGPU/SIRegisterInfo.h

lib/Target/AMDGPU/SIRegisterInfo.cpp

test/CodeGen/AMDGPU/half.ll

test/CodeGen/AMDGPU/limit-coalesce.mir

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic