This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not allow register coalescer to create big superregs
ClosedPublic

Authored by rampitec on Jan 16 2017, 2:51 PM.

Details

Summary

Limit register coalescer by not allowing it to artificially increase
size of registers beyond dword. Such super-registers are in fact
register sequences and not distinct HW registers.

With more super-regs we would need to allocate adjacent registers
and constraint regalloc more than needed. Moreover, our super
registers are overlapping. For instance we have VGPR0_VGPR1_VGPR2,
VGPR1_VGPR2_VGPR3, VGPR2_VGPR3_VGPR4 etc, which complicates registers
allocation even more, resulting in excessive spilling.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jan 16 2017, 2:51 PM
arsenm added inline comments.Jan 16 2017, 2:55 PM
test/CodeGen/AMDGPU/limit-coalesce.mir
4–5

This should use positive checks

rampitec updated this revision to Diff 84603.Jan 16 2017, 3:17 PM

Updated test for positive checks.

rampitec marked an inline comment as done.Jan 16 2017, 3:17 PM

Pre-checkin passed.

I think a little more experimentation here might be worthwhile. It's not obvious to me that this is the right heuristic. Allowing 8 or wider might be beneficial. With subregister liveness tracking I would hope that there wouldn't be much difference for 2-4 register tuples. For the larger registers I could see there being more issues.

I see a very small improvement in shader-db with this as is:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494573 (-0.00 %)
VGPRS: 941553 -> 941353 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40835224 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298015 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 19168 -> 19152 (-0.08 %)
VGPRS: 15952 -> 15752 (-1.25 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 890656 -> 894328 (0.41 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 2197 -> 2230 (1.50 %)
Wait states: 0 -> 0 (0.00 %)

If I increase the threshold to 8 I see slightly better improvements:

34622 shaders in 21459 tests
Totals:
SGPRS: 1494589 -> 1494549 (-0.00 %)
VGPRS: 941553 -> 941377 (-0.02 %)
Spilled SGPRs: 1348 -> 1348 (0.00 %)
Spilled VGPRs: 109 -> 109 (0.00 %)
Private memory VGPRs: 1644 -> 1644 (0.00 %)
Scratch size: 3320 -> 3320 (0.00 %) dwords per thread
Code Size: 40831552 -> 40834176 (0.01 %) bytes
LDS: 3021 -> 3021 (0.00 %) blocks
Max Waves: 297982 -> 298014 (0.01 %)
Wait states: 0 -> 0 (0.00 %)

Totals from affected shaders:
SGPRS: 10664 -> 10624 (-0.38 %)
VGPRS: 10624 -> 10448 (-1.66 %)
Spilled SGPRs: 0 -> 0 (0.00 %)
Spilled VGPRs: 0 -> 0 (0.00 %)
Private memory VGPRs: 0 -> 0 (0.00 %)
Scratch size: 0 -> 0 (0.00 %) dwords per thread
Code Size: 627904 -> 630528 (0.42 %) bytes
LDS: 0 -> 0 (0.00 %) blocks
Max Waves: 1111 -> 1143 (2.88 %)
Wait states: 0 -> 0 (0.00 %)

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486

This isn't being used for the spill size, so this is supposed to use getRegBitWidth

1491–1493

We don't have sub-dword registers, so the < and comment are misleading

test/CodeGen/AMDGPU/limit-coalesce.mir
54

Can you add more tests for more register sizes?

This is a very conservative limitation to fix bloat in clFFT, where it saves ~600 bytes of scratch per kernel by creating vreg_96 from vreg_64. I have no doubt this place will be revisited much more times to improve heuristics as more codes are analyzed. This is really just a start of it.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486

What do you mean?

/// getSize - Return the size of the register in bytes, which is also the size
/// of a stack slot allocated to hold a spilled copy of this register.
1491–1493

This is for packed f16, we do not want to revisit this.

rampitec updated this revision to Diff 84624.Jan 16 2017, 7:51 PM

Modified test to add more sizes.

rampitec marked an inline comment as done.Jan 16 2017, 7:51 PM
arsenm added inline comments.Jan 16 2017, 8:36 PM
lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486

Since https://reviews.llvm.org/D24631 the TargetRegisterClass is supposed to be considered only the spill size, which may be different from the register bit width

1491–1493

Even with packed f16 we don't have smaller sub registers than 4

rampitec updated this revision to Diff 84630.Jan 16 2017, 9:36 PM

Updated comment concerning sub-dwords.

lib/Target/AMDGPU/SIRegisterInfo.cpp
1484–1486

It is not yet committed, and AMDGPU::getRegBitWidth() is not even close to handle all register classes. When it is submitted it can be reconsidered, but I guess it should auto generate defaults to prevent "Unexpected register class" assertion.

1491–1493

I'm not sure it will stay this way in the future. Anyway, comment is updated, but I do not want to remove "<".

vpykhtin accepted this revision.Jan 18 2017, 5:56 AM

LGTM.

This revision is now accepted and ready to land.Jan 18 2017, 5:56 AM
This revision was automatically updated to reflect the committed changes.