This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Optimize SGPR to scratch spilling
AbandonedPublic

Authored by sebastian-ne on Feb 11 2021, 8:33 AM.

Details

Summary

Refactor and optimize spilling SGPRs to scratch. Remove unnecessary
EXEC copies, search a free VGPR with the RegScavenger and save inactive
lanes only if one is found.
Some parts of saving and restoring SGPRs are commoned up in
SGPRSpillBuilder.

Follow-up to D96336.

Diff Detail

Event Timeline

sebastian-ne created this revision.Feb 11 2021, 8:33 AM
sebastian-ne requested review of this revision.Feb 11 2021, 8:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2021, 8:33 AM
arsenm added inline comments.Feb 16 2021, 3:18 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1187

I don't think limiting the lanes helps any. Are you 100% sure this is worth the extra complexity? Does it actually reduce bandwidth?

1188

Specific feature checks are better than generation version checks

llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll
764–765

I thought you renamed this?

sebastian-ne marked 2 inline comments as done.

Rebase and add ST.hasSaveexecB32() instead of checking architecture.

arsenm accepted this revision.Feb 18 2021, 5:21 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/GCNSubtarget.h
927

Capitalize Exec?

This revision is now accepted and ready to land.Feb 18 2021, 5:21 AM
hliao added a subscriber: hliao.EditedFeb 18 2021, 4:25 PM

Shall we optimize the cases where only 1 or 2 SGPRs are to be spilled or reloaded when there's a VGPR scavenged? In this case, we only need one or two loads/stores to spill/reload that SGPR. From the number of LD/ST, that original one (based on broadcast and v_readfirstlane) is still OK but we need less code. Also, there's no latency due to the restore of that restore of that tmp VGPR.

hliao requested changes to this revision.Feb 18 2021, 4:30 PM
This revision now requires changes to proceed.Feb 18 2021, 4:30 PM

Shall we optimize the cases where only 1 or 2 SGPRs are to be spilled or reloaded when there's a VGPR scavenged? In this case, we only need one or two loads/stores to spill/reload that SGPR.

The “v_mov and readfirstlane”-approach doesn’t work when exec=0.

However, you sparked an idea:
If we can scavenge an SGPR, we can use that to save the VGPR lanes that we clobber.

For example, we want to spill s0 to scratch and s5 is currently unused:

v_readlane_b32 s5, v0, 0     ; Save v0
v_writelane_b32 v0, s0, 0    ; Save s0 to v0 and to memory
s_mov_b32 s0, exec
s_mov_b32 exec, 1
buffer_store_dword_offset v0, …
s_mov_b32 exec, s0
v_writelane_b32 v0, s5, 0    ; Restore v0

Restoring:

v_readlane_b32 s5, v0, 0     ; Save v0
v_writelane_b32 v0, s0, 0
s_mov_b32 s0, exec
s_mov_b32 exec, 1
buffer_load_dword_offset v0, …  ; Read v0 from memory and into s0
s_mov_b32 exec, s0
v_readlane_b32 s0, v0, 0
v_writelane_b32 v0, s5, 0    ; Restore v0

The downside is, it will make the code even more complicated. Especially restoring, as we need to ensure that exec is exactly 1, so we do not clobber other lanes. The above code would therefore only work in wave32 mode, not in wave64 mode. Except in the case where v0 is a scavenged register, i.e. it is unused in the currently active lanes, in which case we are allowed to clobber currently active lanes of v0, so the above code would also work in wave64 mode.

I think once we're spilling SGPRs to memory with these tricks, you've given up on performance. We just need *anything* that works, and put effort into guaranteeing we never need to do this (aside from the degenerate inline-asm case using all VGPRs). I'm not worried about getting the ideal SGPR-to-memory sequence

hliao added a comment.Feb 22 2021, 7:46 AM

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

hliao added a comment.Feb 22 2021, 7:47 AM

I think once we're spilling SGPRs to memory with these tricks, you've given up on performance. We just need *anything* that works, and put effort into guaranteeing we never need to do this (aside from the degenerate inline-asm case using all VGPRs). I'm not worried about getting the ideal SGPR-to-memory sequence

spill doesn't mean we give up performance, specially for large workload, spill is inevitable. the spill performance still matters.

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

That is the question it comes down to. If it is guaranteed that exec is never 0, i.e. at least one bit is always set, I’m in favor of your patch.

To have some numbers, I saw some functions spilling 30 SGPRs to scratch, so it can be more than just a one or two.

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

That is the question it comes down to. If it is guaranteed that exec is never 0, i.e. at least one bit is always set, I’m in favor of your patch.

To have some numbers, I saw some functions spilling 30 SGPRs to scratch, so it can be more than just a one or two.

The question is that do we have 30 SGPR spills or we have a spill of SGPR register sequence up to 30 SGPR. For there former, we cannot coalesce them together as the spill/reload points would differ quite a lot. Only on the latter one, we benefit from packing that sequence of SGPR into VGPR. For that case, I suggest we do that spill of SGPR sequence with 1 or 2 SGPR with this approach and the pack approach otherwise.

Back to the exec mask 0 concern, my point is that the change on exec mask follow the user-code semantic. If exec mask goes to 0, the user code should expect no code for that BB to be executed. If we translate that correctly, we should not have code executed when exec mask is 0. The case that code is still executed when mask is 0 is that we may optimize away branch if the branch is too short. For that case, we should check whether that branch has unwanted side effect or check whether there are instructions not honoring exec mask. If we found them, we should not remove that branch. Following that, we should not have code executed when exec mask goes to 0.

hliao added a comment.Feb 22 2021, 1:46 PM

why exec mask = 0 case is a valid one, won't we already branch away once exec mask goes to zero?

That is the question it comes down to. If it is guaranteed that exec is never 0, i.e. at least one bit is always set, I’m in favor of your patch.

To have some numbers, I saw some functions spilling 30 SGPRs to scratch, so it can be more than just a one or two.

I had updated D96980 to mark SI_SPILL_<n>_RESTORE having unwanted effect so that the optimization won't generate code with SGPR reload with 0 exec mask.

The question is that do we have 30 SGPR spills or we have a spill of SGPR register sequence up to 30 SGPR.

It is 30 separate SGPR spill that happen directly after each other. They are not coalesced currently, but they could be.

The function looks roughly like this:

100 instructions of saving VGPRs and SGPRs
30 real instructions
100 instructions of restoring VGPRs and SGPRs

Performance is – as one would expect – not exactly thrilling. There are probably several issues there.

I think once we're spilling SGPRs to memory with these tricks, you've given up on performance. We just need *anything* that works, and put effort into guaranteeing we never need to do this (aside from the degenerate inline-asm case using all VGPRs). I'm not worried about getting the ideal SGPR-to-memory sequence

spill doesn't mean we give up performance, specially for large workload, spill is inevitable. the spill performance still matters.

I don't mean general spilling. I mean spilling SGPRs to memory. This is a path that only exists due to a compiler machinery issue, and should not legitimately happen. SGPRs should only spill to VGPRs

The question is that do we have 30 SGPR spills or we have a spill of SGPR register sequence up to 30 SGPR.

It is 30 separate SGPR spill that happen directly after each other. They are not coalesced currently, but they could be.

The function looks roughly like this:

100 instructions of saving VGPRs and SGPRs
30 real instructions
100 instructions of restoring VGPRs and SGPRs

Performance is – as one would expect – not exactly thrilling. There are probably several issues there.

We see the same problem when we have large frame offsets - we end up separately scavenging a register and doing an add for every spill in the sequence to handle the offset. We would probably benefit from adding some kind of spill coalescing before these are lowered

hliao resigned from this revision.Apr 5 2021, 8:02 AM
This revision is now accepted and ready to land.Apr 5 2021, 8:02 AM
sebastian-ne abandoned this revision.Apr 7 2021, 7:32 AM

Now merged into D96336.