This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic

Authored by rampitec on Jan 16 2017, 2:51 PM.

Details

Summary

Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.

With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jan 16 2017, 2:51 PM
arsenm added inline comments.Jan 16 2017, 2:55 PM
test/CodeGen/AMDGPU/limit-coalesce.mir
3–4 ↗(On Diff #84600)

This should use positive checks

rampitec updated this revision to Diff 84603.Jan 16 2017, 3:17 PM

Updated test for positive checks.

rampitec marked an inline comment as done.Jan 16 2017, 3:17 PM

Pre-checkin passed.

I think a little more experimentation here might be worthwhile. It's not obvious to me that this is the right heuristic. Allowing 8 or wider might be beneficial. With subregister liveness tracking I would hope that there wouldn't be much difference for 2-4 register tuples. For the larger registers I could see there being more issues.

I see a very small improvement in shader-db with this as is:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494573 (-0.00 %)
VGPRS: 941553 -> 941353 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40835224 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298015 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 19168 -> 19152 (-0.08 %)
VGPRS: 15952 -> 15752 (-1.25 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 890656 -> 894328 (0.41 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 2197 -> 2230 (1.50 %)
Wait states: 0 -> 0 (0.00 %)

If I increase the threshold to 8 I see slightly better improvements:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494549 (-0.00 %)
VGPRS: 941553 -> 941377 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40834176 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298014 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 10664 -> 10624 (-0.38 %)
VGPRS: 10624 -> 10448 (-1.66 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 627904 -> 630528 (0.42 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 1111 -> 1143 (2.88 %)
Wait states: 0 -> 0 (0.00 %)

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗(On Diff #84603)

This isn't being used for the spill size, so this is supposed to use getRegBitWidth

1491–1493 ↗(On Diff #84603)

We don't have sub-dword registers, so the < and comment are misleading

test/CodeGen/AMDGPU/limit-coalesce.mir
53 ↗(On Diff #84603)

Can you add more tests for more register sizes?

This is a very conservative limitation to fix bloat in clFFT, where it saves ~600 bytes of scratch per kernel by creating vreg_96 from vreg_64. I have no doubt this place will be revisited much more times to improve heuristics as more codes are analyzed. This is really just a start of it.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗(On Diff #84603)

What do you mean?

/// getSize - Return the size of the register in bytes, which is also the size
/// of a stack slot allocated to hold a spilled copy of this register.
1491–1493 ↗(On Diff #84603)

This is for packed f16, we do not want to revisit this.

rampitec updated this revision to Diff 84624.Jan 16 2017, 7:51 PM

Modified test to add more sizes.

rampitec marked an inline comment as done.Jan 16 2017, 7:51 PM
arsenm added inline comments.Jan 16 2017, 8:36 PM
lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗(On Diff #84603)

Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only the spill size, which may be different from the register bit width

1491–1493 ↗(On Diff #84603)

Even with packed f16 we don't have smaller sub registers than 4

rampitec updated this revision to Diff 84630.Jan 16 2017, 9:36 PM

Updated comment concerning sub-dwords.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486 ↗(On Diff #84603)

It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register classes. When it is submitted it can be reconsidered, but I guess it should auto generate defaults to prevent "Unexpected register class" assertion.

1491–1493 ↗(On Diff #84603)

I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want to remove "<".

vpykhtin accepted this revision.Jan 18 2017, 5:56 AM

LGTM.

This revision is now accepted and ready to land.Jan 18 2017, 5:56 AM
This revision was automatically updated to reflect the committed changes.