This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Save VGPR of whole wave when spilling
ClosedPublic

Authored by sebastian-ne on Feb 9 2021, 6:18 AM.

Details

Summary

Spilling SGPRs to scratch uses a temporary VGPR. LLVM currently cannot
determine if a VGPR is used in other lanes or not, so we need to save
all lanes of the VGPR. We even need to save the VGPR if it is marked as
dead.

The generated code depends on two things:

  • Can we scavenge an SGPR to save EXEC?
  • And can we scavenge a VGPR?

If we can scavenge an SGPR, we

  • save EXEC into the SGPR
  • set the needed lane mask
  • save the temporary VGPR
  • write the spilled SGPR into VGPR lanes
  • save the VGPR again to the target stack slot
  • restore the VGPR
  • restore EXEC

If we were not able to scavenge an SGPR, we do the same operations, but
everytime the temporary VGPR is written to memory, we

  • write VGPR to memory
  • flip exec (s_not exec, exec)
  • write VGPR again (previously inactive lanes)

Surprisingly often, we are able to scavenge an SGPR, even though we are
at the brink of running out of SGPRs.
Scavenging a VGPR does not have a great effect (saves three instructions
if no SGPR was scavenged), but we need to know if the VGPR we use is
live before or not, otherwise the machine verifier complains.

Diff Detail

Event Timeline

sebastian-ne created this revision.Feb 9 2021, 6:18 AM
sebastian-ne requested review of this revision.Feb 9 2021, 6:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2021, 6:18 AM
foad added inline comments.Feb 9 2021, 8:23 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431

If the scavenger finds a vgpr that it thinks is dead, would that mean we only have to save the inactive lanes?

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49

What causes this change?

sebastian-ne added inline comments.Feb 9 2021, 8:36 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431

Yes, we can do that.

If you don’t mind, I’ll put that in a later patch. I’m currently preparing a patch on top of this one to get rid of the code duplication here.
That should also allow to get rid of the temporary RegScavenger below.

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49

Above these tested lines, the VGPR gets saved to scratch in a buffer_store_dword.
The same VGPR is the destination in buffer_load_dword below, so waiting for expcnt(0) makes sure we do not overwrite it before the store happened (the docs say expcnt waits until writes to the last level cache happened, so I guess the store→load is the reason).

piotr added a subscriber: piotr.Feb 9 2021, 8:52 AM
piotr added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1081

Not sure if that's a concern in this context, but doesn't it potentially clobber SCC?

foad added inline comments.Feb 9 2021, 9:05 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431

If you don’t mind, I’ll put that in a later patch.

Sure. I was just checking my understanding.

I am a little dubious about the whole approach.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?
If it is truly rare, then I guess it matter a lot less how performant the resulting code is.

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1081

Clobbering SCC is an ongoing concern with spill code; however, buildSpillLoadStore can already generate arithmetic instructions which can clobber SCC, so this is not a new concern.
It's possible that if there is an issue we are going to run into it faster as this will clobber SCC everytime.

1094

Any reason for XOR rather than NOT?

llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll
1109

These two instructions are not doing anything.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

We currently unconditionally reserve one VGPR for SGPR spills. I'm working on changing this so that we have the option of reserving a variable amount of VGPRs based on some register pressure threshold. Spilling SGPRs to memory should be a last resort anyway, and I've seen the issue raised in this patch multiple times. It's worth having something less broken when we run out of lanes in reserved VGPRs.

I am a little dubious about the whole approach.

Me too, I’m also not happy about needing inline assembly, so if you have an idea to improve some or all of that, I’m all ears.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

Yes. If we knew that we need to spill an SGPR, we would just reserve a VGPR to spill the SGPR to and not spill to scratch at all. The problem is, we don’t know. (Matt plans to fix that by splitting register allocation into two phases, first allocating SGPRs, then VPGRs.)

I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?

I fear it’s less rare than we want. We hit this bug in a not-so-big shader that was forced to run with high occupancy and this limited to 64 VGPRs. However, it should get rare once the register allocation always spills SGPRs to VGPRs.

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

That would make sense, but it feels like s_scratch instructions got removed in newer hardware.

We currently unconditionally reserve one VGPR for SGPR spills.

Interesting, I missed that. As the VGPR is reserved in SITargetLowering::finalizeLowering, this is currently not done for GlobalISel?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1094

No reason, I’ll change that. Thanks for the note

llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll
1109

Right, I’m working on fixing that in a later patch, same as Jay’s optimization.

Use s_not instead of s_xor.

  • Fake use inline asm should have sideeffects
  • Use SGPRs to save EXEC only if IsKill when restoring VGPR
  • Update spill-special-sgpr.mir test
arsenm added inline comments.Feb 11 2021, 6:29 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1029

Can you spell out the expected instruction sequence in the comment

1065–1071

Definitely should not be introducing inline asm. Can't you just add an implicit def on the first instruction in the sequence, or introduce a special purpose pseudo?

1087

Do you mean identity copies?

1111

Can't you just add this as an implicit use to the last instruction in the sequence?

sebastian-ne added inline comments.Feb 11 2021, 7:22 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1065–1071

That sounds better, thanks. For stores, the first instruction is a buffer_store VGPR, …, so there is no previous instruction I can add a define to.
For a pseudo, what would be the best pass to remove it again? (Maybe SIInsertWaitcntsPass?)

1087

No, that is about the superfluous s_mov instructions that @critson noticed. I fixed that in a patch on top of this one (I’ll put that on Phabricator after removing the inline assembly).

arsenm added inline comments.Feb 11 2021, 7:29 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1065–1071

It doesn't need to be removed, it can just be emitted as a comment

sebastian-ne marked 2 inline comments as done.

Get rid of inline assembly with implicit use and FAKE_DEF pseudo (yay).

arsenm added inline comments.Feb 11 2021, 7:59 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
117–118

I think a better name would be something like COPY_INACTIVE_LANES or something like that?

sebastian-ne marked an inline comment as done.

Rename FAKE_DEF pseudo to COPY_INACTIVE_LANES.

sebastian-ne added inline comments.Feb 11 2021, 8:36 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
117–118

Hm, it doesn’t really copy anything. (Also, the VGPR could be dead in other lanes as well.) How about DEF_INACTIVE_LANES?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1029

I added a comment with generated instructions in D96517 (SIRegisterInfo.cpp:L1449).

Direct link (will cease to work when the review is updated): https://reviews.llvm.org/D96517#C2404245NL1449

arsenm added inline comments.Feb 11 2021, 1:10 PM
llvm/lib/Target/AMDGPU/SIInstructions.td
114

Replace make with mark, or remove as?

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

Why doesn't this use the normal emergency stack slot?

sebastian-ne added inline comments.Feb 12 2021, 1:49 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

How do I get the emergency stack slot?

Fix crash when emitting ISA (change COPY_INACTIVE_LANES to a real instruction).

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

We currently unconditionally reserve one VGPR for SGPR spills. I'm working on changing this so that we have the option of reserving a variable amount of VGPRs based on some register pressure threshold. Spilling SGPRs to memory should be a last resort anyway, and I've seen the issue raised in this patch multiple times. It's worth having something less broken when we run out of lanes in reserved VGPRs.

We don't need a pressure heuristic to decide to reserve VGPRs ahead of time, we can just split the allocation process as in D55301

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

You don't get it, it just is what automatically happens when you attempt to use the scavenger and it fails to find a free register. It's possible we would need to add SGPR spills as one of the conditions where it will be necessary

sebastian-ne marked 2 inline comments as done.Feb 25 2021, 4:49 AM
sebastian-ne added inline comments.
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

Thanks for the explanation. The RegScavenger won’t spill all lanes though. Also, it won’t spill if the register is dead in the currently active lanes (which we want to fix here).
So, I don’t think using the scavenger works, unless we can tell the RegScavenger to spill the whole wave, and lower that to spill – flip exec – spill – flip exec again.

arsenm added inline comments.Feb 25 2021, 5:25 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

Even if you don't use it's mechanism, it still has the emergency slot available in the function frame you can re-use

sebastian-ne added inline comments.Feb 25 2021, 6:07 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

Yes, it would be nice do re-use that, but i don’t see a way to get an emergency slot from the RegScavenger.
We could save the emergency slot in SIFrameLowering when allocating the slot, but we cannot check if it is unused when we need it to spill an SGPR. Unconditionally using it can overwrite a saved register.

arsenm added inline comments.Feb 25 2021, 6:13 AM
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495

RegScavenger::getScavengingFrameIndices reports its emergency slots. It shouldn't be live at any context where you would need it, as that would defeat the point.

Use emergency spill slot to save VGPR if there is one.

If there is none (SILowerSGPRSpills runs before PrologEpilogInserter, which creates the emergency slot), create one.

I think it doesn’t really work because the PrologEpilogInserter gets another RegScavenger than we have in SILowerSGPRSpills, so the slot will still not be shared.
Maybe save the created slot in SIFrameLowering, so it can be used in SIFrameLowering::processFunctionBeforeFrameFinalized?

arsenm added inline comments.Feb 25 2021, 2:57 PM
llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1297–1299 ↗(On Diff #326399)

I don't understand why you would need to check this

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1101

I believe you don't need to split this when using scratch instructions

1118–1119

I don't follow this UseKillFromMI vs. isKill. Just use the isKill?

1316–1322

I don't really like adding this here on demand. You need to be sure this is called after frame finalization. This should be created up front

1327

There shouldn't be any temporary reg scavenger created locally. Also, using forward scavenging is deprecated

1438

Ditto

sebastian-ne marked an inline comment as not done.

Rewrite code around reusing the emergency spill slot. I hope it looks better that way.
It now works like this:

  1. SILowerSGPRSpills calls SIRegisterInfo::spillSGPR
  2. SIRegisterInfo::spillSGPR calls SIMachineFunctionInfo::getScavengeFI, which allocates a stack slot and saves it in SIMachineFunctionInfo::ScavengeFI
  3. Later, in the PrologEpilogInserter, SIFrameLowering::processFunctionBeforeFrameFinalized reuses the stack slot from SIMachineFunctionInfo::ScavengeFI or creates a new one if there is none (through calling SIMachineFunctionInfo::getScavengeFI)
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1101

With scratch instructions, the code looks like this:

scratch_store_dword_saddr v0, s33, …
s_not_b32 exec_lo, exec_lo
scratch_store_dword_saddr v0, s33, …
s_not_b32 exec_lo, exec_lo

scratch instructions obey the exec mask, so I don’t think we can fuse this.

1118–1119

Fixed, should be more obvious now.

1327

I fixed that in D96517. It it should be part of this patch, I can move it.

arsenm added inline comments.Mar 11 2021, 6:10 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1318–1319

Still have a temporary scavenger here. Also should use the reverse iteration method

arsenm requested changes to this revision.Mar 30 2021, 3:31 PM
This revision now requires changes to proceed.Mar 30 2021, 3:31 PM
t-tye added a comment.Apr 4 2021, 10:01 AM

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

That would make sense, but it feels like s_scratch instructions got removed in newer hardware.

The scalar cache is not coherent with the vector cache so how would s_scratch be used? Seems it would need explicit invalidation of both the vector and scalar caches. The scalar cache is also a writeback cache so it would need to be explicitly written back to avoid clobbering memory. I also believe the scalar writes were removed in recent hardware.

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49

Are you sure exp_cnt does what you describe? In older hardware exp_cnt was used to ensure input registers had been consumed by an instruction, but that is not longer true as the hardware now has interlocks making using expr_cnt no longer serve this purpose (although are hazards in some multi-dword cases.

The other wait_cnt counters act to indicate if the memory operation is visible. But the hardware ensures single location coherence per thread so why must this be waited on?

sebastian-ne edited the summary of this revision. (Show Details)Apr 7 2021, 6:30 AM
sebastian-ne marked 2 inline comments as done.
sebastian-ne added inline comments.
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49

The test checks GFX6, does that count as old hardware? :)

Merged with D96517 and rewritten.
I hope the new version is easier to understand and creates better code.

I tried using scavengeRegisterBackwards, but it turned out that the RegScavenger is in forward mode, so we would need to switch back and forth. Also, scavenging backwards does not necessarily coincide with the liveness information, which was the main point of using the scavenger here.

The largest performance hit of this change is the s_waitcnt after restoring the temporary VGPR.
We do need to add a use of the load somewhere, otherwise it can be eliminated. I tried marking the load as volatile, which prevents it from being removed, but that also adds an s_waitcnt straight after the load.

Fix spilling when no SGPR can be scavenged to save exec. Storing the VGPR when it holds the SGPRs needs to be unconditionally done for active and inactive lanes.
Also add a test case for this case.

Remove now unused code from a previous revision.

arsenm accepted this revision.Apr 9 2021, 2:08 PM

LGTM, although should look into updating the MFI serialization

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
483–485

Should also add this to the serialized MachineFunctionInfo. This may be a separate patch because I'm not sure we are correctly serializing any frame indexes right now, so some new infrastructure changes may be required

This revision is now accepted and ready to land.Apr 9 2021, 2:08 PM
This revision was landed with ongoing or failed builds.Apr 12 2021, 2:12 AM
This revision was automatically updated to reflect the committed changes.