This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
3/3
SIInstructions.td
2/8
SIMachineFunctionInfo.h
-
SIRegisterInfo.h
9/24
SIRegisterInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
control-flow-fastregalloc.ll
-
frame-setup-without-sgpr-to-vgpr-spills.ll
1/2
partial-sgpr-to-vgpr-spills.ll
-
sgpr-spill.mir
-
si-spill-sgpr-stack.ll
2/4
spill-scavenge-offset.ll
-
spill-special-sgpr.mir

Differential D96336

[AMDGPU] Save VGPR of whole wave when spilling
ClosedPublic

Authored by sebastian-ne on Feb 9 2021, 6:18 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
foad
critson
madhur13490

Commits

rGf9a8c6a0e505: [AMDGPU] Save VGPR of whole wave when spilling

Summary

Spilling SGPRs to scratch uses a temporary VGPR. LLVM currently cannot
determine if a VGPR is used in other lanes or not, so we need to save
all lanes of the VGPR. We even need to save the VGPR if it is marked as
dead.

The generated code depends on two things:

Can we scavenge an SGPR to save EXEC?
And can we scavenge a VGPR?

If we can scavenge an SGPR, we

save EXEC into the SGPR
set the needed lane mask
save the temporary VGPR
write the spilled SGPR into VGPR lanes
save the VGPR again to the target stack slot
restore the VGPR
restore EXEC

If we were not able to scavenge an SGPR, we do the same operations, but
everytime the temporary VGPR is written to memory, we

write VGPR to memory
flip exec (s_not exec, exec)
write VGPR again (previously inactive lanes)

Surprisingly often, we are able to scavenge an SGPR, even though we are
at the brink of running out of SGPRs.
Scavenging a VGPR does not have a great effect (saves three instructions
if no SGPR was scavenged), but we need to know if the VGPR we use is
live before or not, otherwise the machine verifier complains.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	80 ms	x64 debian > Polly.ScopInfo::user_provided_assumptions.ll

Event Timeline

sebastian-ne created this revision.Feb 9 2021, 6:18 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 8 others. · View Herald TranscriptFeb 9 2021, 6:18 AM

sebastian-ne requested review of this revision.Feb 9 2021, 6:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2021, 6:18 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B88458: Diff 322371.Feb 9 2021, 7:55 AM

foad added inline comments.Feb 9 2021, 8:23 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431	If the scavenger finds a vgpr that it thinks is dead, would that mean we only have to save the inactive lanes?
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49	What causes this change?

sebastian-ne added inline comments.Feb 9 2021, 8:36 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431	Yes, we can do that. If you don’t mind, I’ll put that in a later patch. I’m currently preparing a patch on top of this one to get rid of the code duplication here. That should also allow to get rid of the temporary RegScavenger below.
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49	Above these tested lines, the VGPR gets saved to scratch in a buffer_store_dword. The same VGPR is the destination in buffer_load_dword below, so waiting for expcnt(0) makes sure we do not overwrite it before the store happened (the docs say expcnt waits until writes to the last level cache happened, so I guess the store→load is the reason).

piotr added a subscriber: piotr.Feb 9 2021, 8:52 AM

piotr added inline comments.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1081	Not sure if that's a concern in this context, but doesn't it potentially clobber SCC?

foad added inline comments.Feb 9 2021, 9:05 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431	If you don’t mind, I’ll put that in a later patch. Sure. I was just checking my understanding.

I am a little dubious about the whole approach.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?
If it is truly rare, then I guess it matter a lot less how performant the resulting code is.

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1081	Clobbering SCC is an ongoing concern with spill code; however, buildSpillLoadStore can already generate arithmetic instructions which can clobber SCC, so this is not a new concern. It's possible that if there is an issue we are going to run into it faster as this will clobber SCC everytime.
1094	Any reason for XOR rather than NOT?
llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll
1109	These two instructions are not doing anything.

In D96336#2553008, @critson wrote:

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

We currently unconditionally reserve one VGPR for SGPR spills. I'm working on changing this so that we have the option of reserving a variable amount of VGPRs based on some register pressure threshold. Spilling SGPRs to memory should be a last resort anyway, and I've seen the issue raised in this patch multiple times. It's worth having something less broken when we run out of lanes in reserved VGPRs.

I am a little dubious about the whole approach.

Me too, I’m also not happy about needing inline assembly, so if you have an idea to improve some or all of that, I’m all ears.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

Yes. If we knew that we need to spill an SGPR, we would just reserve a VGPR to spill the SGPR to and not spill to scratch at all. The problem is, we don’t know. (Matt plans to fix that by splitting register allocation into two phases, first allocating SGPRs, then VPGRs.)

I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?

I fear it’s less rare than we want. We hit this bug in a not-so-big shader that was forced to run with high occupancy and this limited to 64 VGPRs. However, it should get rare once the register allocation always spills SGPRs to VGPRs.

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

That would make sense, but it feels like s_scratch instructions got removed in newer hardware.

We currently unconditionally reserve one VGPR for SGPR spills.

Interesting, I missed that. As the VGPR is reserved in SITargetLowering::finalizeLowering, this is currently not done for GlobalISel?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1094	No reason, I’ll change that. Thanks for the note
llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll
1109	Right, I’m working on fixing that in a later patch, same as Jay’s optimization.

Use s_not instead of s_xor.

Harbormaster completed remote builds in B88597: Diff 322622.Feb 10 2021, 2:25 AM

Fake use inline asm should have sideeffects
Use SGPRs to save EXEC only if IsKill when restoring VGPR
Update spill-special-sgpr.mir test

arsenm added inline comments.Feb 11 2021, 6:29 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1029	Can you spell out the expected instruction sequence in the comment
1065–1071	Definitely should not be introducing inline asm. Can't you just add an implicit def on the first instruction in the sequence, or introduce a special purpose pseudo?
1087	Do you mean identity copies?
1111	Can't you just add this as an implicit use to the last instruction in the sequence?

Harbormaster completed remote builds in B88795: Diff 322978.Feb 11 2021, 7:11 AM

sebastian-ne added inline comments.Feb 11 2021, 7:22 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1065–1071	That sounds better, thanks. For stores, the first instruction is a `buffer_store VGPR, …`, so there is no previous instruction I can add a define to. For a pseudo, what would be the best pass to remove it again? (Maybe SIInsertWaitcntsPass?)
1087	No, that is about the superfluous s_mov instructions that @critson noticed. I fixed that in a patch on top of this one (I’ll put that on Phabricator after removing the inline assembly).

arsenm added inline comments.Feb 11 2021, 7:29 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1065–1071	It doesn't need to be removed, it can just be emitted as a comment

Get rid of inline assembly with implicit use and FAKE_DEF pseudo (yay).

arsenm added inline comments.Feb 11 2021, 7:59 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
117–118	I think a better name would be something like COPY_INACTIVE_LANES or something like that?

Rename FAKE_DEF pseudo to COPY_INACTIVE_LANES.

sebastian-ne mentioned this in D96517: [AMDGPU] Optimize SGPR to scratch spilling.Feb 11 2021, 8:33 AM

sebastian-ne added inline comments.Feb 11 2021, 8:36 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
117–118	Hm, it doesn’t really copy anything. (Also, the VGPR could be dead in other lanes as well.) How about DEF_INACTIVE_LANES?
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1029	I added a comment with generated instructions in D96517 (SIRegisterInfo.cpp:L1449). Direct link (will cease to work when the review is updated): https://reviews.llvm.org/D96517#C2404245NL1449

Harbormaster completed remote builds in B88826: Diff 323020.Feb 11 2021, 9:45 AM

Harbormaster completed remote builds in B88832: Diff 323033.Feb 11 2021, 11:47 AM

arsenm added inline comments.Feb 11 2021, 1:10 PM

llvm/lib/Target/AMDGPU/SIInstructions.td
114	Replace make with mark, or remove as?
llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	Why doesn't this use the normal emergency stack slot?

sebastian-ne added inline comments.Feb 12 2021, 1:49 AM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	How do I get the emergency stack slot?

Fix crash when emitting ISA (change COPY_INACTIVE_LANES to a real instruction).

Harbormaster completed remote builds in B89342: Diff 323927.Feb 16 2021, 2:50 AM

sebastian-ne added a child revision: D96517: [AMDGPU] Optimize SGPR to scratch spilling.Feb 16 2021, 7:09 AM

sebastian-ne mentioned this in D96869: [AMDGPU] Fix saving fp and bp.Feb 17 2021, 7:14 AM

sebastian-ne added a child revision: D96869: [AMDGPU] Fix saving fp and bp.Feb 17 2021, 7:15 AM

arsenm mentioned this in D96980: [amdgpu] Revert agnostic SGPR spill..Feb 18 2021, 2:42 PM

In D96336#2553296, @kerbowa wrote:

In D96336#2553008, @critson wrote:

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

We currently unconditionally reserve one VGPR for SGPR spills. I'm working on changing this so that we have the option of reserving a variable amount of VGPRs based on some register pressure threshold. Spilling SGPRs to memory should be a last resort anyway, and I've seen the issue raised in this patch multiple times. It's worth having something less broken when we run out of lanes in reserved VGPRs.

We don't need a pressure heuristic to decide to reserve VGPRs ahead of time, we can just split the allocation process as in D55301

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	You don't get it, it just is what automatically happens when you attempt to use the scavenger and it fails to find a free register. It's possible we would need to add SGPR spills as one of the conditions where it will be necessary

sebastian-ne marked 2 inline comments as done.Feb 25 2021, 4:49 AM

sebastian-ne added inline comments.

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	Thanks for the explanation. The RegScavenger won’t spill all lanes though. Also, it won’t spill if the register is dead in the currently active lanes (which we want to fix here). So, I don’t think using the scavenger works, unless we can tell the RegScavenger to spill the whole wave, and lower that to spill – flip exec – spill – flip exec again.

arsenm added inline comments.Feb 25 2021, 5:25 AM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	Even if you don't use it's mechanism, it still has the emergency slot available in the function frame you can re-use

sebastian-ne added inline comments.Feb 25 2021, 6:07 AM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	Yes, it would be nice do re-use that, but i don’t see a way to get an emergency slot from the RegScavenger. We could save the emergency slot in SIFrameLowering when allocating the slot, but we cannot check if it is unused when we need it to spill an SGPR. Unconditionally using it can overwrite a saved register.

arsenm added inline comments.Feb 25 2021, 6:13 AM

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
494–495	RegScavenger::getScavengingFrameIndices reports its emergency slots. It shouldn't be live at any context where you would need it, as that would defeat the point.

Use emergency spill slot to save VGPR if there is one.

If there is none (SILowerSGPRSpills runs before PrologEpilogInserter, which creates the emergency slot), create one.

I think it doesn’t really work because the PrologEpilogInserter gets another RegScavenger than we have in SILowerSGPRSpills, so the slot will still not be shared.
Maybe save the created slot in SIFrameLowering, so it can be used in SIFrameLowering::processFunctionBeforeFrameFinalized?

Harbormaster completed remote builds in B90828: Diff 326399.Feb 25 2021, 9:59 AM

arsenm added inline comments.Feb 25 2021, 2:57 PM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
1297–1299 ↗	(On Diff #326399)	I don't understand why you would need to check this
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1101	I believe you don't need to split this when using scratch instructions
1118–1119	I don't follow this UseKillFromMI vs. isKill. Just use the isKill?
1316–1322	I don't really like adding this here on demand. You need to be sure this is called after frame finalization. This should be created up front
1327	There shouldn't be any temporary reg scavenger created locally. Also, using forward scavenging is deprecated
1438	Ditto

Rewrite code around reusing the emergency spill slot. I hope it looks better that way.
It now works like this:

SILowerSGPRSpills calls SIRegisterInfo::spillSGPR
SIRegisterInfo::spillSGPR calls SIMachineFunctionInfo::getScavengeFI, which allocates a stack slot and saves it in SIMachineFunctionInfo::ScavengeFI
Later, in the PrologEpilogInserter, SIFrameLowering::processFunctionBeforeFrameFinalized reuses the stack slot from SIMachineFunctionInfo::ScavengeFI or creates a new one if there is none (through calling SIMachineFunctionInfo::getScavengeFI)

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1101	With scratch instructions, the code looks like this: scratch_store_dword_saddr v0, s33, … s_not_b32 exec_lo, exec_lo scratch_store_dword_saddr v0, s33, … s_not_b32 exec_lo, exec_lo scratch instructions obey the exec mask, so I don’t think we can fuse this.
1118–1119	Fixed, should be more obvious now.
1327	I fixed that in D96517. It it should be part of this patch, I can move it.

Harbormaster completed remote builds in B90995: Diff 326635.Feb 26 2021, 4:08 AM

Rebased

Harbormaster completed remote builds in B93317: Diff 329979.Mar 11 2021, 2:47 PM

arsenm added inline comments.Mar 11 2021, 6:10 PM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1318–1319	Still have a temporary scavenger here. Also should use the reverse iteration method

sebastian-ne removed a child revision: D96869: [AMDGPU] Fix saving fp and bp.Mar 30 2021, 2:11 AM

arsenm requested changes to this revision.Mar 30 2021, 3:31 PM

This revision now requires changes to proceed.Mar 30 2021, 3:31 PM

nhaehnle mentioned this in D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..Apr 4 2021, 6:35 AM

In D96336#2553493, @sebastian-ne wrote:

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

That would make sense, but it feels like s_scratch instructions got removed in newer hardware.

The scalar cache is not coherent with the vector cache so how would s_scratch be used? Seems it would need explicit invalidation of both the vector and scalar caches. The scalar cache is also a writeback cache so it would need to be explicitly written back to avoid clobbering memory. I also believe the scalar writes were removed in recent hardware.

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49	Are you sure exp_cnt does what you describe? In older hardware exp_cnt was used to ensure input registers had been consumed by an instruction, but that is not longer true as the hardware now has interlocks making using expr_cnt no longer serve this purpose (although are hazards in some multi-dword cases. The other wait_cnt counters act to indicate if the memory operation is visible. But the hardware ensures single location coherence per thread so why must this be waited on?

sebastian-ne edited the summary of this revision. (Show Details)Apr 7 2021, 6:30 AM

sebastian-ne marked 2 inline comments as done.

sebastian-ne added inline comments.

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
49	The test checks GFX6, does that count as old hardware? :)

Merged with D96517 and rewritten.
I hope the new version is easier to understand and creates better code.

I tried using scavengeRegisterBackwards, but it turned out that the RegScavenger is in forward mode, so we would need to switch back and forth. Also, scavenging backwards does not necessarily coincide with the liveness information, which was the main point of using the scavenger here.

The largest performance hit of this change is the s_waitcnt after restoring the temporary VGPR.
We do need to add a use of the load somewhere, otherwise it can be eliminated. I tried marking the load as volatile, which prevents it from being removed, but that also adds an s_waitcnt straight after the load.

Harbormaster completed remote builds in B97501: Diff 335796.Apr 7 2021, 7:03 AM

Fix spilling when no SGPR can be scavenged to save exec. Storing the VGPR when it holds the SGPRs needs to be unconditionally done for active and inactive lanes.
Also add a test case for this case.

Harbormaster completed remote builds in B97686: Diff 336051.Apr 8 2021, 4:37 AM

Remove now unused code from a previous revision.

Harbormaster completed remote builds in B97716: Diff 336091.Apr 8 2021, 7:31 AM

LGTM, although should look into updating the MFI serialization

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h
483–485	Should also add this to the serialized MachineFunctionInfo. This may be a separate patch because I'm not sure we are correctly serializing any frame indexes right now, so some new infrastructure changes may be required

This revision is now accepted and ready to land.Apr 9 2021, 2:08 PM

This revision was landed with ongoing or failed builds.Apr 12 2021, 2:12 AM

Closed by commit rGf9a8c6a0e505: [AMDGPU] Save VGPR of whole wave when spilling (authored by sebastian-ne). · Explain Why

This revision was automatically updated to reflect the committed changes.

sebastian-ne added a commit: rGf9a8c6a0e505: [AMDGPU] Save VGPR of whole wave when spilling.

hliao mentioned this in D106449: [amdgpu] Handle the case where there is no scavenged register..Sep 29 2021, 7:31 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstructions.td

8 lines

SIMachineFunctionInfo.h

3 lines

SIRegisterInfo.h

11 lines

SIRegisterInfo.cpp

225 lines

test/

CodeGen/

AMDGPU/

control-flow-fastregalloc.ll

3 lines

frame-setup-without-sgpr-to-vgpr-spills.ll

4 lines

partial-sgpr-to-vgpr-spills.ll

272 lines

sgpr-spill.mir

2 lines

si-spill-sgpr-stack.ll

6 lines

spill-scavenge-offset.ll

1 line

spill-special-sgpr.mir

46 lines

Diff 323927

llvm/lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines
	def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),			def V_MOV_B64_PSEUDO : VPseudoInstSI <(outs VReg_64:$vdst),
	(ins VSrc_b64:$src0)>;			(ins VSrc_b64:$src0)>;

	// 64-bit vector move with dpp. Expanded post-RA.			// 64-bit vector move with dpp. Expanded post-RA.
	def V_MOV_B64_DPP_PSEUDO : VOP_DPP_Pseudo <"v_mov_b64_dpp", VOP_I64_I64_DPP> {			def V_MOV_B64_DPP_PSEUDO : VOP_DPP_Pseudo <"v_mov_b64_dpp", VOP_I64_I64_DPP> {
	let Size = 16; // Requires two 8-byte v_mov_b32_dpp to complete.			let Size = 16; // Requires two 8-byte v_mov_b32_dpp to complete.
	}			}

				// Pseudoinstruction to mark a register as live, even when it isn't.
				arsenmUnsubmitted Done Reply Inline Actions Replace make with mark, or remove as? arsenm: Replace make with mark, or remove as?
				// This is used to circumvent limitations of lifetime tracking. A VGPR can
				// be live in a currently inactive lane, but LLVM does not track this.
				def COPY_INACTIVE_LANES : InstSI <(outs VGPR_32:$vdst), (ins),
				";;#COPY_INACTIVE_LANES $vdst"> {
				arsenmUnsubmitted Done Reply Inline Actions I think a better name would be something like COPY_INACTIVE_LANES or something like that? arsenm: I think a better name would be something like COPY_INACTIVE_LANES or something like that?
				sebastian-neAuthorUnsubmitted Done Reply Inline Actions Hm, it doesn’t really copy anything. (Also, the VGPR could be dead in other lanes as well.) How about DEF_INACTIVE_LANES? sebastian-ne: Hm, it doesn’t really copy anything. (Also, the VGPR could be dead in other lanes as well.) How…
				field bits<0> Inst = 0;
				}

	// Pseudoinstruction for @llvm.amdgcn.wqm. It is turned into a copy after the			// Pseudoinstruction for @llvm.amdgcn.wqm. It is turned into a copy after the
	// WQM pass processes it.			// WQM pass processes it.
	def WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;			def WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;

	// Pseudoinstruction for @llvm.amdgcn.softwqm. Like @llvm.amdgcn.wqm it is			// Pseudoinstruction for @llvm.amdgcn.softwqm. Like @llvm.amdgcn.wqm it is
	// turned into a copy by WQM pass, but does not seed WQM requirements.			// turned into a copy by WQM pass, but does not seed WQM requirements.
	def SOFT_WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;			def SOFT_WQM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;

	▲ Show 20 Lines • Show All 2,561 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show First 20 Lines • Show All 474 Lines • ▼ Show 20 Lines	private:
DenseMap<int, VGPRSpillToAGPR> VGPRToAGPRSpills;		DenseMap<int, VGPRSpillToAGPR> VGPRToAGPRSpills;

// AGPRs used for VGPR spills.		// AGPRs used for VGPR spills.
SmallVector<MCPhysReg, 32> SpillAGPR;		SmallVector<MCPhysReg, 32> SpillAGPR;

// VGPRs used for AGPR spills.		// VGPRs used for AGPR spills.
SmallVector<MCPhysReg, 32> SpillVGPR;		SmallVector<MCPhysReg, 32> SpillVGPR;

public: // FIXME		public: // FIXME
/// If this is set, an SGPR used for save/restore of the register used for the		/// If this is set, an SGPR used for save/restore of the register used for the
/// frame pointer.		/// frame pointer.
		arsenmUnsubmitted Not Done Reply Inline Actions Should also add this to the serialized MachineFunctionInfo. This may be a separate patch because I'm not sure we are correctly serializing any frame indexes right now, so some new infrastructure changes may be required arsenm: Should also add this to the serialized MachineFunctionInfo. This may be a separate patch…
Register SGPRForFPSaveRestoreCopy;		Register SGPRForFPSaveRestoreCopy;
Optional<int> FramePointerSaveIndex;		Optional<int> FramePointerSaveIndex;

/// If this is set, an SGPR used for save/restore of the register used for the		/// If this is set, an SGPR used for save/restore of the register used for the
/// base pointer.		/// base pointer.
Register SGPRForBPSaveRestoreCopy;		Register SGPRForBPSaveRestoreCopy;
Optional<int> BasePointerSaveIndex;		Optional<int> BasePointerSaveIndex;

		/// When spilling SGPRs, we may need a temporary stack slot to free a VGPR.
		Optional<int> SpillSGPRTmpIndex;
		arsenmUnsubmitted Not Done Reply Inline Actions Why doesn't this use the normal emergency stack slot? arsenm: Why doesn't this use the normal emergency stack slot?
		sebastian-neAuthorUnsubmitted Not Done Reply Inline Actions How do I get the emergency stack slot? sebastian-ne: How do I get the emergency stack slot?
		arsenmUnsubmitted Not Done Reply Inline Actions You don't get it, it just is what automatically happens when you attempt to use the scavenger and it fails to find a free register. It's possible we would need to add SGPR spills as one of the conditions where it will be necessary arsenm: You don't get it, it just is what automatically happens when you attempt to use the scavenger…
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions Thanks for the explanation. The RegScavenger won’t spill all lanes though. Also, it won’t spill if the register is dead in the currently active lanes (which we want to fix here). So, I don’t think using the scavenger works, unless we can tell the RegScavenger to spill the whole wave, and lower that to spill – flip exec – spill – flip exec again. sebastian-ne: Thanks for the explanation. The RegScavenger won’t spill all lanes though. Also, it won’t spill…
		arsenmUnsubmitted Not Done Reply Inline Actions Even if you don't use it's mechanism, it still has the emergency slot available in the function frame you can re-use arsenm: Even if you don't use it's mechanism, it still has the emergency slot available in the function…
		sebastian-neAuthorUnsubmitted Not Done Reply Inline Actions Yes, it would be nice do re-use that, but i don’t see a way to get an emergency slot from the RegScavenger. We could save the emergency slot in SIFrameLowering when allocating the slot, but we cannot check if it is unused when we need it to spill an SGPR. Unconditionally using it can overwrite a saved register. sebastian-ne: Yes, it would be nice do re-use that, but i don’t see a way to get an emergency slot from the…
		arsenmUnsubmitted Done Reply Inline Actions RegScavenger::getScavengingFrameIndices reports its emergency slots. It shouldn't be live at any context where you would need it, as that would defeat the point. arsenm: RegScavenger::getScavengingFrameIndices reports its emergency slots. It shouldn't be live at…

Register VGPRReservedForSGPRSpill;		Register VGPRReservedForSGPRSpill;
bool isCalleeSavedReg(const MCPhysReg *CSRegs, MCPhysReg Reg);		bool isCalleeSavedReg(const MCPhysReg *CSRegs, MCPhysReg Reg);

public:		public:
SIMachineFunctionInfo(const MachineFunction &MF);		SIMachineFunctionInfo(const MachineFunction &MF);

bool initializeBaseYamlFields(const yaml::SIMachineFunctionInfo &YamlMFI);		bool initializeBaseYamlFields(const yaml::SIMachineFunctionInfo &YamlMFI);

▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	void resolveFrameIndex(MachineInstr &MI, Register BaseReg,
int64_t Offset) const override;		int64_t Offset) const override;

bool isFrameOffsetLegal(const MachineInstr *MI, Register BaseReg,		bool isFrameOffsetLegal(const MachineInstr *MI, Register BaseReg,
int64_t Offset) const override;		int64_t Offset) const override;

const TargetRegisterClass *getPointerRegClass(		const TargetRegisterClass *getPointerRegClass(
const MachineFunction &MF, unsigned Kind = 0) const override;		const MachineFunction &MF, unsigned Kind = 0) const override;

		void buildWaveVGPRSpillLoadStore(MachineBasicBlock::iterator MI, int Index,
		Register VGPR, RegScavenger *RS, bool IsLoad,
		bool VGPRLive = false,
		Register FreeSGPR = 0) const;

		void buildVGPRSpillLoadStore(MachineBasicBlock::iterator MI, int Index,
		int Offset, unsigned EltSize, Register VGPR,
		RegScavenger *RS, bool IsLoad,
		bool UseKillFromMI = true,
		bool IsKill = true) const;

void buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI, int Index,		void buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI, int Index,
int Offset, unsigned EltSize, Register VGPR,		int Offset, unsigned EltSize, Register VGPR,
int64_t VGPRLanes, RegScavenger *RS,		int64_t VGPRLanes, RegScavenger *RS,
bool IsLoad) const;		bool IsLoad) const;

/// If \p OnlyToVGPR is true, this will only succeed if this		/// If \p OnlyToVGPR is true, this will only succeed if this
bool spillSGPR(MachineBasicBlock::iterator MI,		bool spillSGPR(MachineBasicBlock::iterator MI,
int FI, RegScavenger *RS,		int FI, RegScavenger *RS,
▲ Show 20 Lines • Show All 219 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,014 Lines • ▼ Show 20 Lines	void SIRegisterInfo::buildSpillLoadStore(MachineBasicBlock::iterator MI,
if (ScratchOffsetRegDelta != 0) {		if (ScratchOffsetRegDelta != 0) {
// Subtract the offset we added to the ScratchOffset register.		// Subtract the offset we added to the ScratchOffset register.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), SOffset)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), SOffset)
.addReg(SOffset)		.addReg(SOffset)
.addImm(ScratchOffsetRegDelta);		.addImm(ScratchOffsetRegDelta);
}		}
}		}

		/// Save or restore all lanes of a VGPR to a stack slot without using any SGPRs.
		///
		/// We need to save all lanes if we overwrite some lanes when storing an SGPR
		/// into a VGPR with v_writelane. If we currently try to spill an SGPR, we do
		/// not have a free SGPR to save EXEC to, so we save all currently active lanes,
		/// then flip EXEC (EXEC = EXEC ^ -1), then save the rest of the lanes and flip
		/// EXEC again to restore its original value.
		arsenmUnsubmitted Not Done Reply Inline Actions Can you spell out the expected instruction sequence in the comment arsenm: Can you spell out the expected instruction sequence in the comment
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions I added a comment with generated instructions in D96517 (SIRegisterInfo.cpp:L1449). Direct link (will cease to work when the review is updated): https://reviews.llvm.org/D96517#C2404245NL1449 sebastian-ne: I added a comment with generated instructions in D96517 (SIRegisterInfo.cpp:L1449). Direct…
		void SIRegisterInfo::buildWaveVGPRSpillLoadStore(MachineBasicBlock::iterator MI,
		int Index, Register VGPR,
		RegScavenger *RS, bool IsLoad,
		bool VGPRLive,
		Register FreeSGPR) const {
		unsigned EltSize = 4;
		MachineBasicBlock *MBB = MI->getParent();
		const DebugLoc &DL = MI->getDebugLoc();
		const SIInstrInfo *TII = ST.getInstrInfo();

		// If we have free SGPRs, use that to save EXEC.
		Register SavedExecReg = AMDGPU::NoRegister;
		if (FreeSGPR) {
		const TargetRegisterClass *RC = getPhysRegClass(FreeSGPR);

		ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize);
		unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size();

		if (isWave32) {
		SavedExecReg = NumSubRegs == 1
		? FreeSGPR
		: Register(getSubReg(FreeSGPR, SplitParts[0]));
		} else {
		// If src/dst is an odd size it is possible subreg0 is not aligned.
		for (unsigned ExecLane = 0; ExecLane < (NumSubRegs - 1); ++ExecLane) {
		SavedExecReg =
		getMatchingSuperReg(getSubReg(FreeSGPR, SplitParts[ExecLane]),
		AMDGPU::sub0, &AMDGPU::SReg_64_XEXECRegClass);
		if (SavedExecReg)
		break;
		}
		}
		}

		if (!IsLoad && !VGPRLive) {
		// FIXME LLVM may not know that VGPR is live in other lanes, we need to mark
		// it as live, otherwise the MachineIR verifier complains.
		// Only add this if VGPR is currently not live.
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::COPY_INACTIVE_LANES), VGPR);
		}

		// FIXME LLVM may not know that VGPR is used in other lanes, we need to mark
		arsenmUnsubmitted Not Done Reply Inline Actions Definitely should not be introducing inline asm. Can't you just add an implicit def on the first instruction in the sequence, or introduce a special purpose pseudo? arsenm: Definitely should not be introducing inline asm. Can't you just add an implicit def on the…
		sebastian-neAuthorUnsubmitted Not Done Reply Inline Actions That sounds better, thanks. For stores, the first instruction is a `buffer_store VGPR, …`, so there is no previous instruction I can add a define to. For a pseudo, what would be the best pass to remove it again? (Maybe SIInsertWaitcntsPass?) sebastian-ne: That sounds better, thanks. For stores, the first instruction is a `buffer_store VGPR, …`, so…
		arsenmUnsubmitted Done Reply Inline Actions It doesn't need to be removed, it can just be emitted as a comment arsenm: It doesn't need to be removed, it can just be emitted as a comment
		// it as used, otherwise it can be removed.
		// Only add this if VGPR was not live before.
		bool NeedsUse = IsLoad && !VGPRLive;

		if (SavedExecReg) {
		// Use SGPRs to save exec
		Register ExecReg = isWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
		unsigned ExecMovOpc = isWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
		unsigned ExecOrOpc =
		isWave32 ? AMDGPU::S_OR_SAVEEXEC_B32 : AMDGPU::S_OR_SAVEEXEC_B64;
		piotrUnsubmitted Not Done Reply Inline Actions Not sure if that's a concern in this context, but doesn't it potentially clobber SCC? piotr: Not sure if that's a concern in this context, but doesn't it potentially clobber SCC?
		critsonUnsubmitted Not Done Reply Inline Actions Clobbering SCC is an ongoing concern with spill code; however, buildSpillLoadStore can already generate arithmetic instructions which can clobber SCC, so this is not a new concern. It's possible that if there is an issue we are going to run into it faster as this will clobber SCC everytime. critson: Clobbering SCC is an ongoing concern with spill code; however, buildSpillLoadStore can already…
		// Save exec and activate all lanes
		BuildMI(*MBB, MI, DL, TII->get(ExecOrOpc), SavedExecReg).addImm(-1);

		// Save/restore VGPR
		buildVGPRSpillLoadStore(MI, Index, 0, EltSize, VGPR, RS, IsLoad);

		arsenmUnsubmitted Not Done Reply Inline Actions Do you mean identity copies? arsenm: Do you mean identity copies?
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions No, that is about the superfluous s_mov instructions that @critson noticed. I fixed that in a patch on top of this one (I’ll put that on Phabricator after removing the inline assembly). sebastian-ne: No, that is about the superfluous s_mov instructions that @critson noticed. I fixed that in a…
		// Restore exec
		// FIXME This often creates unnecessary exec moves
		auto LastI = BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg)
		.addReg(SavedExecReg);
		if (NeedsUse)
		LastI.addReg(VGPR, RegState::Implicit);
		} else {
		critsonUnsubmitted Not Done Reply Inline Actions Any reason for XOR rather than NOT? critson: Any reason for XOR rather than NOT?
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions No reason, I’ll change that. Thanks for the note sebastian-ne: No reason, I’ll change that. Thanks for the note
		// We cannot set exec to -1, because we do not have a free SGPR, so save the
		// currently active lanes, flip exec, save the rest of the lanes and flip
		// exec again.
		Register ExecReg = isWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
		unsigned ExecNotOpc = isWave32 ? AMDGPU::S_NOT_B32 : AMDGPU::S_NOT_B64;
		MachineInstrBuilder LastI;
		for (unsigned I = 0; I < 2; I++) {
		arsenmUnsubmitted Not Done Reply Inline Actions I believe you don't need to split this when using scratch instructions arsenm: I believe you don't need to split this when using scratch instructions
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions With scratch instructions, the code looks like this: scratch_store_dword_saddr v0, s33, … s_not_b32 exec_lo, exec_lo scratch_store_dword_saddr v0, s33, … s_not_b32 exec_lo, exec_lo scratch instructions obey the exec mask, so I don’t think we can fuse this. sebastian-ne: With scratch instructions, the code looks like this: ``` scratch_store_dword_saddr v0, s33, ……
		// Mark the second store as kill
		buildVGPRSpillLoadStore(MI, Index, 0, EltSize, VGPR, RS, IsLoad, false,
		!IsLoad && I == 1);
		// Flip exec
		LastI =
		BuildMI(*MBB, MI, DL, TII->get(ExecNotOpc), ExecReg).addReg(ExecReg);
		}
		if (NeedsUse)
		LastI.addReg(VGPR, RegState::Implicit);
		}
		arsenmUnsubmitted Done Reply Inline Actions Can't you just add this as an implicit use to the last instruction in the sequence? arsenm: Can't you just add this as an implicit use to the last instruction in the sequence?
		}

		void SIRegisterInfo::buildVGPRSpillLoadStore(MachineBasicBlock::iterator MI,
		int Index, int Offset,
		unsigned EltSize, Register VGPR,
		RegScavenger *RS, bool IsLoad,
		bool UseKillFromMI,
		bool IsKill) const {
		arsenmUnsubmitted Not Done Reply Inline Actions I don't follow this UseKillFromMI vs. isKill. Just use the isKill? arsenm: I don't follow this UseKillFromMI vs. isKill. Just use the isKill?
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions Fixed, should be more obvious now. sebastian-ne: Fixed, should be more obvious now.
		MachineBasicBlock *MBB = MI->getParent();
		MachineFunction *MF = MBB->getParent();
		SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();

		if (UseKillFromMI)
		IsKill = MI->getOperand(0).isKill();

		// Load/store VGPR
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
		assert(FrameInfo.getStackID(Index) != TargetStackID::SGPRSpill);

		Register FrameReg = FrameInfo.isFixedObjectIndex(Index) && hasBasePointer(*MF)
		? getBaseRegister()
		: getFrameRegister(*MF);

		Align Alignment = FrameInfo.getObjectAlign(Index);
		MachinePointerInfo PtrInfo = MachinePointerInfo::getFixedStack(*MF, Index);
		MachineMemOperand *MMO = MF->getMachineMemOperand(
		PtrInfo, IsLoad ? MachineMemOperand::MOLoad : MachineMemOperand::MOStore,
		EltSize, Alignment);

		if (IsLoad) {
		unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_LOAD_DWORD_SADDR
		: AMDGPU::BUFFER_LOAD_DWORD_OFFSET;
		buildSpillLoadStore(MI, Opc, Index, VGPR, false, FrameReg, Offset * EltSize,
		MMO, RS);
		} else {
		unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_STORE_DWORD_SADDR
		: AMDGPU::BUFFER_STORE_DWORD_OFFSET;
		buildSpillLoadStore(MI, Opc, Index, VGPR, IsKill, FrameReg,
		Offset * EltSize, MMO, RS);
		// This only ever adds one VGPR spill
		MFI->addToSpilledVGPRs(1);
		}
		}

// Generate a VMEM access which loads or stores the VGPR containing an SGPR		// Generate a VMEM access which loads or stores the VGPR containing an SGPR
// spill such that all the lanes set in VGPRLanes are loaded or stored.		// spill such that all the lanes set in VGPRLanes are loaded or stored.
// This generates exec mask manipulation and will use SGPRs available in MI		// This generates exec mask manipulation and will use SGPRs available in MI
// or VGPR lanes in the VGPR to save and restore the exec mask.		// or VGPR lanes in the VGPR to save and restore the exec mask.
void SIRegisterInfo::buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI,		void SIRegisterInfo::buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI,
int Index, int Offset,		int Index, int Offset,
unsigned EltSize, Register VGPR,		unsigned EltSize, Register VGPR,
int64_t VGPRLanes,		int64_t VGPRLanes,
RegScavenger *RS,		RegScavenger *RS,
bool IsLoad) const {		bool IsLoad) const {
MachineBasicBlock *MBB = MI->getParent();		MachineBasicBlock *MBB = MI->getParent();
MachineFunction *MF = MBB->getParent();
SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();

Register SuperReg = MI->getOperand(0).getReg();		Register SuperReg = MI->getOperand(0).getReg();
const TargetRegisterClass *RC = getPhysRegClass(SuperReg);		const TargetRegisterClass *RC = getPhysRegClass(SuperReg);
ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize);		ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize);
unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size();		unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size();
unsigned FirstPart = Offset * 32;		unsigned FirstPart = Offset * 32;
unsigned ExecLane = 0;		unsigned ExecLane = 0;
Show All 30 Lines	if (OnlyExecLo) {
}		}
}		}
assert(SavedExecReg);		assert(SavedExecReg);
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), SavedExecReg).addReg(ExecReg);		BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), SavedExecReg).addReg(ExecReg);

// Setup EXEC		// Setup EXEC
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg).addImm(VGPRLanes);		BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg).addImm(VGPRLanes);

// Load/store VGPR		buildVGPRSpillLoadStore(MI, Index, Offset, EltSize, VGPR, RS, IsLoad);
MachineFrameInfo &FrameInfo = MF->getFrameInfo();
assert(FrameInfo.getStackID(Index) != TargetStackID::SGPRSpill);

Register FrameReg = FrameInfo.isFixedObjectIndex(Index) && hasBasePointer(*MF)
? getBaseRegister()
: getFrameRegister(*MF);

Align Alignment = FrameInfo.getObjectAlign(Index);
MachinePointerInfo PtrInfo =
MachinePointerInfo::getFixedStack(*MF, Index);
MachineMemOperand *MMO = MF->getMachineMemOperand(
PtrInfo, IsLoad ? MachineMemOperand::MOLoad : MachineMemOperand::MOStore,
EltSize, Alignment);

if (IsLoad) {
unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_LOAD_DWORD_SADDR
: AMDGPU::BUFFER_LOAD_DWORD_OFFSET;
buildSpillLoadStore(MI, Opc,
Index,
VGPR, false,
FrameReg,
Offset * EltSize, MMO,
RS);
} else {
unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_STORE_DWORD_SADDR
: AMDGPU::BUFFER_STORE_DWORD_OFFSET;
buildSpillLoadStore(MI, Opc, Index, VGPR,
IsKill, FrameReg,
Offset * EltSize, MMO, RS);
// This only ever adds one VGPR spill
MFI->addToSpilledVGPRs(1);
}

// Restore EXEC		// Restore EXEC
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg)		BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg)
.addReg(SavedExecReg, getKillRegState(IsLoad \|\| IsKill));		.addReg(SavedExecReg, getKillRegState(IsLoad \|\| IsKill));

// Restore clobbered SGPRs		// Restore clobbered SGPRs
if (IsLoad) {		if (IsLoad) {
// Nothing to do; register will be overwritten		// Nothing to do; register will be overwritten
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {

// FIXME: Since this spills to another register instead of an actual		// FIXME: Since this spills to another register instead of an actual
// frame index, we should delete the frame index when all references to		// frame index, we should delete the frame index when all references to
// it are fixed.		// it are fixed.
}		}
} else {		} else {
// Scavenged temporary VGPR to use. It must be scavenged once for any number		// Scavenged temporary VGPR to use. It must be scavenged once for any number
// of spilled subregs.		// of spilled subregs.
Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);		// FIXME: The liveness analysis is limited and does not tell if a register
		// is in use in lanes that are currently inactive. We can never be sure if
		// a register as actually in use in another lane, so we need to save all
		// lanes of the chosen VGPR. Pick v0 because it doesn't make a difference.
		Register TmpVGPR = AMDGPU::VGPR0;
RS->setRegUsed(TmpVGPR);		RS->setRegUsed(TmpVGPR);

		// Reserve temporary stack slot
		if (!MFI->SpillSGPRTmpIndex.hasValue()) {
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
		MFI->SpillSGPRTmpIndex = FrameInfo.CreateSpillStackObject(4, Align(4));
		}
		unsigned TmpVGPRIndex = *MFI->SpillSGPRTmpIndex;

		// Check if TmpVGPR is currently live according to LLVM liveness info
		arsenmUnsubmitted Not Done Reply Inline Actions Still have a temporary scavenger here. Also should use the reverse iteration method arsenm: Still have a temporary scavenger here. Also should use the reverse iteration method
		RegScavenger TmpRS;
		TmpRS.enterBasicBlock(*MBB);
		TmpRS.forward(MI);
		arsenmUnsubmitted Not Done Reply Inline Actions I don't really like adding this here on demand. You need to be sure this is called after frame finalization. This should be created up front arsenm: I don't really like adding this here on demand. You need to be sure this is called after frame…
		bool TmpVGPRLive = TmpRS.isRegUsed(TmpVGPR);

		// Save TmpVGPR
		buildWaveVGPRSpillLoadStore(MI, TmpVGPRIndex, TmpVGPR, RS, false,
		TmpVGPRLive);
		arsenmUnsubmitted Not Done Reply Inline Actions There shouldn't be any temporary reg scavenger created locally. Also, using forward scavenging is deprecated arsenm: There shouldn't be any temporary reg scavenger created locally. Also, using forward scavenging…
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions I fixed that in D96517. It it should be part of this patch, I can move it. sebastian-ne: I fixed that in D96517. It it should be part of this patch, I can move it.

// SubReg carries the "Kill" flag when SubReg == SuperReg.		// SubReg carries the "Kill" flag when SubReg == SuperReg.
unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);		unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);

unsigned PerVGPR = 32;		unsigned PerVGPR = 32;
unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;		unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;
int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;		int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;

for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {		for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
Show All 24 Lines	for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
WriteLane.addReg(SuperReg, RegState::Implicit \| SuperKillState);		WriteLane.addReg(SuperReg, RegState::Implicit \| SuperKillState);
}		}
}		}

// Write out VGPR		// Write out VGPR
buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,		buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,
RS, false);		RS, false);
}		}

		// Restore temporary VGPR
		buildWaveVGPRSpillLoadStore(MI, TmpVGPRIndex, TmpVGPR, RS, true,
		TmpVGPRLive,
		IsKill ? SuperReg : AMDGPU::NoRegister);
}		}

MI->eraseFromParent();		MI->eraseFromParent();
MFI->addToSpilledSGPRs(NumSubRegs);		MFI->addToSpilledSGPRs(NumSubRegs);
return true;		return true;
}		}

bool SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI,		bool SIRegisterInfo::restoreSGPR(MachineBasicBlock::iterator MI,
Show All 35 Lines	for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
SIMachineFunctionInfo::SpilledReg Spill = VGPRSpills[i];		SIMachineFunctionInfo::SpilledReg Spill = VGPRSpills[i];
auto MIB = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)		auto MIB = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
.addReg(Spill.VGPR)		.addReg(Spill.VGPR)
.addImm(Spill.Lane);		.addImm(Spill.Lane);
if (NumSubRegs > 1 && i == 0)		if (NumSubRegs > 1 && i == 0)
MIB.addReg(SuperReg, RegState::ImplicitDefine);		MIB.addReg(SuperReg, RegState::ImplicitDefine);
}		}
} else {		} else {
Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);		// Scavenged temporary VGPR to use. It must be scavenged once for any number
		// of spilled subregs.
		// FIXME: The liveness analysis is limited and does not tell if a register
		// is in use in lanes that are currently inactive. We can never be sure if
		// a register as actually in use in another lane, so we need to save all
		// lanes of the chosen VGPR. Pick v0 because it doesn't make a difference.
		Register TmpVGPR = AMDGPU::VGPR0;
		foadUnsubmitted Not Done Reply Inline Actions If the scavenger finds a vgpr that it thinks is dead, would that mean we only have to save the inactive lanes? foad: If the scavenger finds a vgpr that it thinks is dead, would that mean we only have to save the…
		sebastian-neAuthorUnsubmitted Not Done Reply Inline Actions Yes, we can do that. If you don’t mind, I’ll put that in a later patch. I’m currently preparing a patch on top of this one to get rid of the code duplication here. That should also allow to get rid of the temporary RegScavenger below. sebastian-ne: Yes, we can do that. If you don’t mind, I’ll put that in a later patch. I’m currently…
		foadUnsubmitted Done Reply Inline Actions If you don’t mind, I’ll put that in a later patch. Sure. I was just checking my understanding. foad: > If you don’t mind, I’ll put that in a later patch. Sure. I was just checking my…
RS->setRegUsed(TmpVGPR);		RS->setRegUsed(TmpVGPR);

		if (!MFI->SpillSGPRTmpIndex.hasValue()) {
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
		MFI->SpillSGPRTmpIndex = FrameInfo.CreateSpillStackObject(4, Align(4));
		}
		unsigned TmpVGPRIndex = MFI->SpillSGPRTmpIndex.getValue();
		arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto

		// Check if TmpVGPR is currently live according to LLVM liveness info
		RegScavenger TmpRS;
		TmpRS.enterBasicBlock(*MBB);
		TmpRS.forward(MI);
		bool TmpVGPRLive = TmpRS.isRegUsed(TmpVGPR);

		// Save temporary VGPR
		buildWaveVGPRSpillLoadStore(MI, TmpVGPRIndex, TmpVGPR, RS, false,
		TmpVGPRLive, SuperReg);

unsigned PerVGPR = 32;		unsigned PerVGPR = 32;
unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;		unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;
int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;		int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;

for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {		for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
// Load in VGPR data		// Load in VGPR data
buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,		buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,
RS, true);		RS, true);
Show All 10 Lines	for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
auto MIB =		auto MIB =
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
.addReg(TmpVGPR, getKillRegState(LastSubReg))		.addReg(TmpVGPR, getKillRegState(LastSubReg))
.addImm(i);		.addImm(i);
if (NumSubRegs > 1 && i == 0)		if (NumSubRegs > 1 && i == 0)
MIB.addReg(SuperReg, RegState::ImplicitDefine);		MIB.addReg(SuperReg, RegState::ImplicitDefine);
}		}
}		}

		// Restore TmpVGPR
		buildWaveVGPRSpillLoadStore(MI, TmpVGPRIndex, TmpVGPR, RS, true,
		TmpVGPRLive);
}		}

MI->eraseFromParent();		MI->eraseFromParent();
return true;		return true;
}		}

/// Special case of eliminateFrameIndex. Returns true if the SGPR was spilled to		/// Special case of eliminateFrameIndex. Returns true if the SGPR was spilled to
/// a VGPR and the stack slot can be safely eliminated when all other users are		/// a VGPR and the stack slot can be safely eliminated when all other users are
▲ Show 20 Lines • Show All 929 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

	Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}
	; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]


	; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:
	; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]			; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]
	; GCN: s_cmp_lg_u32			; GCN: s_cmp_lg_u32
				; VMEM: buffer_store_dword
				; VMEM: buffer_store_dword
				; VMEM: buffer_store_dword
	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:{{[0-9]+}} ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:{{[0-9]+}} ; 4-byte Folded Spill
	; GCN-NEXT: s_cbranch_scc1 [[LOOP]]			; GCN-NEXT: s_cbranch_scc1 [[LOOP]]

	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: [[END]]:			; GCN: [[END]]:
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]
	▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/frame-setup-without-sgpr-to-vgpr-spills.ll

	Show All 9 Lines
	; SPILL-TO-VGPR: v_writelane_b32 v40, s33, 2			; SPILL-TO-VGPR: v_writelane_b32 v40, s33, 2
	; NO-SPILL-TO-VGPR: v_mov_b32_e32 v0, s33			; NO-SPILL-TO-VGPR: v_mov_b32_e32 v0, s33
	; NO-SPILL-TO-VGPR: buffer_store_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill			; NO-SPILL-TO-VGPR: buffer_store_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill

	; GCN: s_swappc_b64 s[30:31], s[4:5]			; GCN: s_swappc_b64 s[30:31], s[4:5]

	; SPILL-TO-VGPR: v_readlane_b32 s4, v40, 0			; SPILL-TO-VGPR: v_readlane_b32 s4, v40, 0
	; SPILL-TO-VGPR: v_readlane_b32 s5, v40, 1			; SPILL-TO-VGPR: v_readlane_b32 s5, v40, 1
	; NO-SPILL-TO-VGPR: v_readlane_b32 s4, v1, 0			; NO-SPILL-TO-VGPR: v_readlane_b32 s4, v0, 0
	; NO-SPILL-TO-VGPR: v_readlane_b32 s5, v1, 1			; NO-SPILL-TO-VGPR: v_readlane_b32 s5, v0, 1

	; SPILL-TO-VGPR: v_readlane_b32 s33, v40, 2			; SPILL-TO-VGPR: v_readlane_b32 s33, v40, 2
	; NO-SPILL-TO-VGPR: buffer_load_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload			; NO-SPILL-TO-VGPR: buffer_load_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload
	; NO-SPILL-TO-VGPR: v_readfirstlane_b32 s33, v0			; NO-SPILL-TO-VGPR: v_readfirstlane_b32 s33, v0
	define void @callee_with_stack_and_call() #0 {			define void @callee_with_stack_and_call() #0 {
	%alloca = alloca i32, addrspace(5)			%alloca = alloca i32, addrspace(5)
	store volatile i32 0, i32 addrspace(5)* %alloca			store volatile i32 0, i32 addrspace(5)* %alloca
	call void @external_void_func_void()			call void @external_void_func_void()
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll

Show First 20 Lines • Show All 755 Lines • ▼ Show 20 Lines
; GCN-NEXT: v_writelane_b32 v31, s15, 59		; GCN-NEXT: v_writelane_b32 v31, s15, 59
; GCN-NEXT: v_writelane_b32 v31, s16, 60		; GCN-NEXT: v_writelane_b32 v31, s16, 60
; GCN-NEXT: v_writelane_b32 v31, s17, 61		; GCN-NEXT: v_writelane_b32 v31, s17, 61
; GCN-NEXT: v_writelane_b32 v31, s18, 62		; GCN-NEXT: v_writelane_b32 v31, s18, 62
; GCN-NEXT: v_writelane_b32 v31, s19, 63		; GCN-NEXT: v_writelane_b32 v31, s19, 63
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; def s[2:3]		; GCN-NEXT: ; def s[2:3]
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#COPY_INACTIVE_LANES v0
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_not_b64 exec, exec
; GCN-NEXT: v_writelane_b32 v0, s2, 0		; GCN-NEXT: v_writelane_b32 v0, s2, 0
; GCN-NEXT: v_writelane_b32 v0, s3, 1		; GCN-NEXT: v_writelane_b32 v0, s3, 1
; GCN-NEXT: s_mov_b64 s[2:3], exec		; GCN-NEXT: s_mov_b64 s[2:3], exec
; GCN-NEXT: s_mov_b64 exec, 3		; GCN-NEXT: s_mov_b64 exec, 3
; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Spill
; GCN-NEXT: s_mov_b64 exec, s[2:3]		; GCN-NEXT: s_mov_b64 exec, s[2:3]
		; GCN-NEXT: s_or_saveexec_b64 s[2:3], -1
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_waitcnt vmcnt(0)
		; GCN-NEXT: s_mov_b64 exec, s[2:3]
; GCN-NEXT: s_mov_b32 s1, 0		; GCN-NEXT: s_mov_b32 s1, 0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; GCN-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: s_cmp_lg_u32 s0, s1		; GCN-NEXT: s_cmp_lg_u32 s0, s1
; GCN-NEXT: s_cbranch_scc1 BB2_2		; GCN-NEXT: s_cbranch_scc1 BB2_2
; GCN-NEXT: ; %bb.1: ; %bb0		; GCN-NEXT: ; %bb.1: ; %bb0
; GCN-NEXT: v_readlane_b32 s36, v31, 32		; GCN-NEXT: v_readlane_b32 s36, v31, 32
; GCN-NEXT: v_readlane_b32 s37, v31, 33		; GCN-NEXT: v_readlane_b32 s37, v31, 33
; GCN-NEXT: v_readlane_b32 s38, v31, 34		; GCN-NEXT: v_readlane_b32 s38, v31, 34
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
; GCN-NEXT: v_readlane_b32 s12, v31, 56		; GCN-NEXT: v_readlane_b32 s12, v31, 56
; GCN-NEXT: v_readlane_b32 s13, v31, 57		; GCN-NEXT: v_readlane_b32 s13, v31, 57
; GCN-NEXT: v_readlane_b32 s14, v31, 58		; GCN-NEXT: v_readlane_b32 s14, v31, 58
; GCN-NEXT: v_readlane_b32 s15, v31, 59		; GCN-NEXT: v_readlane_b32 s15, v31, 59
; GCN-NEXT: v_readlane_b32 s16, v31, 60		; GCN-NEXT: v_readlane_b32 s16, v31, 60
; GCN-NEXT: v_readlane_b32 s17, v31, 61		; GCN-NEXT: v_readlane_b32 s17, v31, 61
; GCN-NEXT: v_readlane_b32 s18, v31, 62		; GCN-NEXT: v_readlane_b32 s18, v31, 62
; GCN-NEXT: v_readlane_b32 s19, v31, 63		; GCN-NEXT: v_readlane_b32 s19, v31, 63
		; GCN-NEXT: ;;#COPY_INACTIVE_LANES v0
		; GCN-NEXT: s_or_saveexec_b64 s[0:1], -1
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_mov_b64 exec, s[0:1]
; GCN-NEXT: s_mov_b64 s[0:1], exec		; GCN-NEXT: s_mov_b64 s[0:1], exec
; GCN-NEXT: s_mov_b64 exec, 3		; GCN-NEXT: s_mov_b64 exec, 3
; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Reload		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Reload
; GCN-NEXT: s_mov_b64 exec, s[0:1]		; GCN-NEXT: s_mov_b64 exec, s[0:1]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_readlane_b32 s0, v0, 0		; GCN-NEXT: v_readlane_b32 s0, v0, 0
; GCN-NEXT: v_readlane_b32 s1, v0, 1		; GCN-NEXT: v_readlane_b32 s1, v0, 1
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_waitcnt vmcnt(0)
		; GCN-NEXT: s_not_b64 exec, exec
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; use s[36:51]		; GCN-NEXT: ; use s[36:51]
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; use s[4:19]		; GCN-NEXT: ; use s[4:19]
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND
; GCN-NEXT: ;;#ASMSTART		; GCN-NEXT: ;;#ASMSTART
; GCN-NEXT: ; use s[0:1]		; GCN-NEXT: ; use s[0:1]
Show All 22 Lines	bb0:
call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr3) #0		call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr3) #0
call void asm sideeffect "; use $0", "s"(<2 x i32> %wide.sgpr4) #0		call void asm sideeffect "; use $0", "s"(<2 x i32> %wide.sgpr4) #0
br label %ret		br label %ret

ret:		ret:
ret void		ret void
}		}

		; Same as @no_vgprs_last_sgpr_spill, some SGPR spills must go to memory.
		; Additionally, v0 is live throughout the function.
		define amdgpu_kernel void @no_vgprs_last_sgpr_spill_live_v0(i32 %in) #1 {
		; GCN-LABEL: no_vgprs_last_sgpr_spill_live_v0:
		; GCN: ; %bb.0:
		; GCN-NEXT: s_mov_b32 s52, SCRATCH_RSRC_DWORD0
		; GCN-NEXT: s_mov_b32 s53, SCRATCH_RSRC_DWORD1
		; GCN-NEXT: s_mov_b32 s54, -1
		; GCN-NEXT: s_mov_b32 s55, 0xe8f000
		; GCN-NEXT: s_add_u32 s52, s52, s3
		; GCN-NEXT: s_addc_u32 s53, s53, 0
		; GCN-NEXT: s_load_dword s0, s[0:1], 0x9
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def s[4:19]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: v_writelane_b32 v31, s4, 0
		; GCN-NEXT: v_writelane_b32 v31, s5, 1
		; GCN-NEXT: v_writelane_b32 v31, s6, 2
		; GCN-NEXT: v_writelane_b32 v31, s7, 3
		; GCN-NEXT: v_writelane_b32 v31, s8, 4
		; GCN-NEXT: v_writelane_b32 v31, s9, 5
		; GCN-NEXT: v_writelane_b32 v31, s10, 6
		; GCN-NEXT: v_writelane_b32 v31, s11, 7
		; GCN-NEXT: v_writelane_b32 v31, s12, 8
		; GCN-NEXT: v_writelane_b32 v31, s13, 9
		; GCN-NEXT: v_writelane_b32 v31, s14, 10
		; GCN-NEXT: v_writelane_b32 v31, s15, 11
		; GCN-NEXT: v_writelane_b32 v31, s16, 12
		; GCN-NEXT: v_writelane_b32 v31, s17, 13
		; GCN-NEXT: v_writelane_b32 v31, s18, 14
		; GCN-NEXT: v_writelane_b32 v31, s19, 15
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def s[4:19]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: v_writelane_b32 v31, s4, 16
		; GCN-NEXT: v_writelane_b32 v31, s5, 17
		; GCN-NEXT: v_writelane_b32 v31, s6, 18
		; GCN-NEXT: v_writelane_b32 v31, s7, 19
		; GCN-NEXT: v_writelane_b32 v31, s8, 20
		; GCN-NEXT: v_writelane_b32 v31, s9, 21
		; GCN-NEXT: v_writelane_b32 v31, s10, 22
		; GCN-NEXT: v_writelane_b32 v31, s11, 23
		; GCN-NEXT: v_writelane_b32 v31, s12, 24
		; GCN-NEXT: v_writelane_b32 v31, s13, 25
		; GCN-NEXT: v_writelane_b32 v31, s14, 26
		; GCN-NEXT: v_writelane_b32 v31, s15, 27
		; GCN-NEXT: v_writelane_b32 v31, s16, 28
		; GCN-NEXT: v_writelane_b32 v31, s17, 29
		; GCN-NEXT: v_writelane_b32 v31, s18, 30
		; GCN-NEXT: v_writelane_b32 v31, s19, 31
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def s[4:19]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: v_writelane_b32 v31, s4, 32
		; GCN-NEXT: v_writelane_b32 v31, s5, 33
		; GCN-NEXT: v_writelane_b32 v31, s6, 34
		; GCN-NEXT: v_writelane_b32 v31, s7, 35
		; GCN-NEXT: v_writelane_b32 v31, s8, 36
		; GCN-NEXT: v_writelane_b32 v31, s9, 37
		; GCN-NEXT: v_writelane_b32 v31, s10, 38
		; GCN-NEXT: v_writelane_b32 v31, s11, 39
		; GCN-NEXT: v_writelane_b32 v31, s12, 40
		; GCN-NEXT: v_writelane_b32 v31, s13, 41
		; GCN-NEXT: v_writelane_b32 v31, s14, 42
		; GCN-NEXT: v_writelane_b32 v31, s15, 43
		; GCN-NEXT: v_writelane_b32 v31, s16, 44
		; GCN-NEXT: v_writelane_b32 v31, s17, 45
		; GCN-NEXT: v_writelane_b32 v31, s18, 46
		; GCN-NEXT: v_writelane_b32 v31, s19, 47
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def s[4:19]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: v_writelane_b32 v31, s4, 48
		; GCN-NEXT: v_writelane_b32 v31, s5, 49
		; GCN-NEXT: v_writelane_b32 v31, s6, 50
		; GCN-NEXT: v_writelane_b32 v31, s7, 51
		; GCN-NEXT: v_writelane_b32 v31, s8, 52
		; GCN-NEXT: v_writelane_b32 v31, s9, 53
		; GCN-NEXT: v_writelane_b32 v31, s10, 54
		; GCN-NEXT: v_writelane_b32 v31, s11, 55
		; GCN-NEXT: v_writelane_b32 v31, s12, 56
		; GCN-NEXT: v_writelane_b32 v31, s13, 57
		; GCN-NEXT: v_writelane_b32 v31, s14, 58
		; GCN-NEXT: v_writelane_b32 v31, s15, 59
		; GCN-NEXT: v_writelane_b32 v31, s16, 60
		; GCN-NEXT: v_writelane_b32 v31, s17, 61
		; GCN-NEXT: v_writelane_b32 v31, s18, 62
		; GCN-NEXT: v_writelane_b32 v31, s19, 63
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def s[2:3]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#COPY_INACTIVE_LANES v0
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: v_writelane_b32 v0, s2, 0
		; GCN-NEXT: v_writelane_b32 v0, s3, 1
		; GCN-NEXT: s_mov_b64 s[2:3], exec
		; GCN-NEXT: s_mov_b64 exec, 3
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Spill
		; GCN-NEXT: s_mov_b64 exec, s[2:3]
		; GCN-NEXT: s_or_saveexec_b64 s[2:3], -1
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_waitcnt vmcnt(0)
		; GCN-NEXT: s_mov_b64 exec, s[2:3]
		; GCN-NEXT: s_mov_b32 s1, 0
		; GCN-NEXT: s_waitcnt lgkmcnt(0)
		; GCN-NEXT: s_cmp_lg_u32 s0, s1
		; GCN-NEXT: s_cbranch_scc1 BB3_2
		; GCN-NEXT: ; %bb.1: ; %bb0
		; GCN-NEXT: v_readlane_b32 s36, v31, 32
		; GCN-NEXT: v_readlane_b32 s37, v31, 33
		; GCN-NEXT: v_readlane_b32 s38, v31, 34
		; GCN-NEXT: v_readlane_b32 s39, v31, 35
		; GCN-NEXT: v_readlane_b32 s40, v31, 36
		; GCN-NEXT: v_readlane_b32 s41, v31, 37
		; GCN-NEXT: v_readlane_b32 s42, v31, 38
		; GCN-NEXT: v_readlane_b32 s43, v31, 39
		; GCN-NEXT: v_readlane_b32 s44, v31, 40
		; GCN-NEXT: v_readlane_b32 s45, v31, 41
		; GCN-NEXT: v_readlane_b32 s46, v31, 42
		; GCN-NEXT: v_readlane_b32 s47, v31, 43
		; GCN-NEXT: v_readlane_b32 s48, v31, 44
		; GCN-NEXT: v_readlane_b32 s49, v31, 45
		; GCN-NEXT: v_readlane_b32 s50, v31, 46
		; GCN-NEXT: v_readlane_b32 s51, v31, 47
		; GCN-NEXT: v_readlane_b32 s0, v31, 16
		; GCN-NEXT: v_readlane_b32 s1, v31, 17
		; GCN-NEXT: v_readlane_b32 s2, v31, 18
		; GCN-NEXT: v_readlane_b32 s3, v31, 19
		; GCN-NEXT: v_readlane_b32 s4, v31, 20
		; GCN-NEXT: v_readlane_b32 s5, v31, 21
		; GCN-NEXT: v_readlane_b32 s6, v31, 22
		; GCN-NEXT: v_readlane_b32 s7, v31, 23
		; GCN-NEXT: v_readlane_b32 s8, v31, 24
		; GCN-NEXT: v_readlane_b32 s9, v31, 25
		; GCN-NEXT: v_readlane_b32 s10, v31, 26
		; GCN-NEXT: v_readlane_b32 s11, v31, 27
		; GCN-NEXT: v_readlane_b32 s12, v31, 28
		; GCN-NEXT: v_readlane_b32 s13, v31, 29
		; GCN-NEXT: v_readlane_b32 s14, v31, 30
		; GCN-NEXT: v_readlane_b32 s15, v31, 31
		; GCN-NEXT: v_readlane_b32 s16, v31, 0
		; GCN-NEXT: v_readlane_b32 s17, v31, 1
		; GCN-NEXT: v_readlane_b32 s18, v31, 2
		; GCN-NEXT: v_readlane_b32 s19, v31, 3
		; GCN-NEXT: v_readlane_b32 s20, v31, 4
		; GCN-NEXT: v_readlane_b32 s21, v31, 5
		; GCN-NEXT: v_readlane_b32 s22, v31, 6
		; GCN-NEXT: v_readlane_b32 s23, v31, 7
		; GCN-NEXT: v_readlane_b32 s24, v31, 8
		; GCN-NEXT: v_readlane_b32 s25, v31, 9
		; GCN-NEXT: v_readlane_b32 s26, v31, 10
		; GCN-NEXT: v_readlane_b32 s27, v31, 11
		; GCN-NEXT: v_readlane_b32 s28, v31, 12
		; GCN-NEXT: v_readlane_b32 s29, v31, 13
		; GCN-NEXT: v_readlane_b32 s30, v31, 14
		; GCN-NEXT: v_readlane_b32 s31, v31, 15
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; def v0
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use s[16:31]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use s[0:15]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: v_readlane_b32 s4, v31, 48
		; GCN-NEXT: v_readlane_b32 s5, v31, 49
		; GCN-NEXT: v_readlane_b32 s6, v31, 50
		; GCN-NEXT: v_readlane_b32 s7, v31, 51
		; GCN-NEXT: v_readlane_b32 s8, v31, 52
		; GCN-NEXT: v_readlane_b32 s9, v31, 53
		; GCN-NEXT: v_readlane_b32 s10, v31, 54
		; GCN-NEXT: v_readlane_b32 s11, v31, 55
		; GCN-NEXT: v_readlane_b32 s12, v31, 56
		; GCN-NEXT: v_readlane_b32 s13, v31, 57
		; GCN-NEXT: v_readlane_b32 s14, v31, 58
		; GCN-NEXT: v_readlane_b32 s15, v31, 59
		; GCN-NEXT: v_readlane_b32 s16, v31, 60
		; GCN-NEXT: v_readlane_b32 s17, v31, 61
		; GCN-NEXT: v_readlane_b32 s18, v31, 62
		; GCN-NEXT: v_readlane_b32 s19, v31, 63
		; GCN-NEXT: s_or_saveexec_b64 s[0:1], -1
		; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
		; GCN-NEXT: s_mov_b64 exec, s[0:1]
		; GCN-NEXT: s_mov_b64 s[0:1], exec
		; GCN-NEXT: s_mov_b64 exec, 3
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Reload
		critsonUnsubmitted Not Done Reply Inline Actions These two instructions are not doing anything. critson: These two instructions are not doing anything.
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions Right, I’m working on fixing that in a later patch, same as Jay’s optimization. sebastian-ne: Right, I’m working on fixing that in a later patch, same as Jay’s optimization.
		; GCN-NEXT: s_mov_b64 exec, s[0:1]
		; GCN-NEXT: s_waitcnt vmcnt(0)
		; GCN-NEXT: v_readlane_b32 s0, v0, 0
		; GCN-NEXT: v_readlane_b32 s1, v0, 1
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 ; 4-byte Folded Reload
		; GCN-NEXT: s_not_b64 exec, exec
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use s[36:51]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use s[4:19]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use s[0:1]
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: s_waitcnt vmcnt(0)
		; GCN-NEXT: ;;#ASMSTART
		; GCN-NEXT: ; use v0
		; GCN-NEXT: ;;#ASMEND
		; GCN-NEXT: BB3_2: ; %ret
		; GCN-NEXT: s_endpgm
		call void asm sideeffect "", "~{v[0:7]}" () #0
		call void asm sideeffect "", "~{v[8:15]}" () #0
		call void asm sideeffect "", "~{v[16:23]}" () #0
		call void asm sideeffect "", "~{v[24:27]}"() #0
		call void asm sideeffect "", "~{v[28:29]}"() #0
		call void asm sideeffect "", "~{v30}"() #0

		%wide.sgpr0 = call <16 x i32> asm sideeffect "; def $0", "=s" () #0
		%wide.sgpr1 = call <16 x i32> asm sideeffect "; def $0", "=s" () #0
		%wide.sgpr2 = call <16 x i32> asm sideeffect "; def $0", "=s" () #0
		%wide.sgpr3 = call <16 x i32> asm sideeffect "; def $0", "=s" () #0
		%wide.sgpr4 = call <2 x i32> asm sideeffect "; def $0", "=s" () #0
		%cmp = icmp eq i32 %in, 0
		br i1 %cmp, label %bb0, label %ret

		bb0:
		%vgpr0 = call i32 asm sideeffect "; def $0", "=v" () #0
		call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr0) #0
		call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr1) #0
		call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr2) #0
		call void asm sideeffect "; use $0", "s"(<16 x i32> %wide.sgpr3) #0
		call void asm sideeffect "; use $0", "s"(<2 x i32> %wide.sgpr4) #0
		call void asm sideeffect "; use $0", "v"(i32 %vgpr0) #0
		br label %ret

		ret:
		ret void
		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind "amdgpu-waves-per-eu"="8,8" }		attributes #1 = { nounwind "amdgpu-waves-per-eu"="8,8" }

llvm/test/CodeGen/AMDGPU/sgpr-spill.mir

	# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,MUBUF %s			# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,MUBUF %s
	# RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN32,MUBUF %s			# RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN32,MUBUF %s
	# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,FLATSCR %s			# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,FLATSCR %s
				# RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -filetype=obj -verify-machineinstrs -start-before=prologepilog %s -o /dev/null
				# Check not crashing when emitting ISA


	# CHECK-LABEL: name: check_spill			# CHECK-LABEL: name: check_spill

	# FLATSCR: $sgpr33 = S_MOV_B32 0			# FLATSCR: $sgpr33 = S_MOV_B32 0
	# FLATSCR: $flat_scr_lo = S_ADD_U32 $sgpr0, $sgpr11, implicit-def $scc			# FLATSCR: $flat_scr_lo = S_ADD_U32 $sgpr0, $sgpr11, implicit-def $scc
	# FLATSCR: $flat_scr_hi = S_ADDC_U32 $sgpr1, 0, implicit-def $scc, implicit $scc			# FLATSCR: $flat_scr_hi = S_ADDC_U32 $sgpr1, 0, implicit-def $scc, implicit $scc

	▲ Show 20 Lines • Show All 456 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll

	; RUN: llc -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SGPR %s			; RUN: llc -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SGPR %s

	; Make sure this doesn't crash.			; Make sure this doesn't crash.
	; ALL-LABEL: {{^}}test:			; ALL-LABEL: {{^}}test:
	; ALL: s_mov_b32 s[[LO:[0-9]+]], SCRATCH_RSRC_DWORD0			; ALL: s_mov_b32 s[[LO:[0-9]+]], SCRATCH_RSRC_DWORD0
	; ALL: s_mov_b32 s[[HI:[0-9]+]], 0xe80000			; ALL: s_mov_b32 s[[HI:[0-9]+]], 0xe80000

	; Make sure we are handling hazards correctly.			; Make sure we are handling hazards correctly.
	; SGPR: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], 0 offset:4			; SGPR: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], 0 offset:4
	; SGPR-NEXT: s_mov_b64 exec, s[0:1]			; SGPR-NEXT: s_mov_b64 exec, s[0:1]
	; SGPR-NEXT: s_waitcnt vmcnt(0)			; SGPR-NEXT: s_waitcnt vmcnt(0)
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 0			; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 0
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 1			; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 1
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 2			; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 2
	; SGPR-NEXT: v_readlane_b32 s[[HI:[0-9]+]], [[VHI]], 3			; SGPR-NEXT: v_readlane_b32 s[[HI:[0-9]+]], [[VHI]], 3
	; SGPR-NEXT: s_nop 4			; SGPR-NEXT: buffer_load_dword v0, off, s[{{[0-9]+:[0-9]+}}], 0
				; SGPR-NEXT: s_not_b64 exec, exec
				; SGPR-NEXT: buffer_load_dword v0, off, s[96:99], 0 ; 4-byte Folded Reload
				; SGPR-NEXT: s_not_b64 exec, exec
				; SGPR-NEXT: s_waitcnt vmcnt(0)
	; SGPR-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0			; SGPR-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0

	; ALL: s_endpgm			; ALL: s_endpgm
	define amdgpu_kernel void @test(i32 addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @test(i32 addrspace(1)* %out, i32 %in) {
	call void asm sideeffect "", "~{s[0:7]}" ()			call void asm sideeffect "", "~{s[0:7]}" ()
	call void asm sideeffect "", "~{s[8:15]}" ()			call void asm sideeffect "", "~{s[8:15]}" ()
	call void asm sideeffect "", "~{s[16:23]}" ()			call void asm sideeffect "", "~{s[16:23]}" ()
	call void asm sideeffect "", "~{s[24:31]}" ()			call void asm sideeffect "", "~{s[24:31]}" ()
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll

Show All 40 Lines	; mark most VGPR registers as used to increase register pressure
%outptr = getelementptr <1280 x i32>, <1280 x i32> addrspace(1)* %out, i32 %tid		%outptr = getelementptr <1280 x i32>, <1280 x i32> addrspace(1)* %out, i32 %tid
store <1280 x i32> %a, <1280 x i32> addrspace(1)* %outptr		store <1280 x i32> %a, <1280 x i32> addrspace(1)* %outptr

ret void		ret void
}		}

; CHECK-LABEL: test_limited_sgpr		; CHECK-LABEL: test_limited_sgpr
; GFX6: s_add_u32 s32, s32, 0x[[OFFSET:[0-9a-f]+]]		; GFX6: s_add_u32 s32, s32, 0x[[OFFSET:[0-9a-f]+]]
		; GFX6-NEXT: s_waitcnt expcnt(0)
		foadUnsubmitted Not Done Reply Inline Actions What causes this change? foad: What causes this change?
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions Above these tested lines, the VGPR gets saved to scratch in a buffer_store_dword. The same VGPR is the destination in buffer_load_dword below, so waiting for expcnt(0) makes sure we do not overwrite it before the store happened (the docs say expcnt waits until writes to the last level cache happened, so I guess the store→load is the reason). sebastian-ne: Above these tested lines, the VGPR gets saved to scratch in a buffer_store_dword. The same VGPR…
		t-tyeUnsubmitted Not Done Reply Inline Actions Are you sure exp_cnt does what you describe? In older hardware exp_cnt was used to ensure input registers had been consumed by an instruction, but that is not longer true as the hardware now has interlocks making using expr_cnt no longer serve this purpose (although are hazards in some multi-dword cases. The other wait_cnt counters act to indicate if the memory operation is visible. But the hardware ensures single location coherence per thread so why must this be waited on? t-tye: Are you sure exp_cnt does what you describe? In older hardware exp_cnt was used to ensure input…
		sebastian-neAuthorUnsubmitted Done Reply Inline Actions The test checks GFX6, does that count as old hardware? :) sebastian-ne: The test checks GFX6, does that count as old hardware? :)
; GFX6-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[{{[0-9:]+}}], s32		; GFX6-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[{{[0-9:]+}}], s32
; GFX6-NEXT: s_sub_u32 s32, s32, 0x[[OFFSET:[0-9a-f]+]]		; GFX6-NEXT: s_sub_u32 s32, s32, 0x[[OFFSET:[0-9a-f]+]]
; GFX6: NumSgprs: 48		; GFX6: NumSgprs: 48
; GFX6: ScratchSize: 8608		; GFX6: ScratchSize: 8608

; FLATSCR: s_movk_i32 [[SOFF1:s[0-9]+]], 0x		; FLATSCR: s_movk_i32 [[SOFF1:s[0-9]+]], 0x
; GFX9-FLATSCR-NEXT: s_waitcnt vmcnt(0)		; GFX9-FLATSCR-NEXT: s_waitcnt vmcnt(0)
; FLATSCR-NEXT: scratch_store_dwordx4 off, v[{{[0-9:]+}}], [[SOFF1]] ; 16-byte Folded Spill		; FLATSCR-NEXT: scratch_store_dwordx4 off, v[{{[0-9:]+}}], [[SOFF1]] ; 16-byte Folded Spill
▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/spill-special-sgpr.mir

Show All 40 Lines	bb.0:
; GFX9: $sgpr33 = S_MOV_B32 0		; GFX9: $sgpr33 = S_MOV_B32 0
; GFX9: $sgpr12 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr12 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr13 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr13 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr14 = S_MOV_B32 4294967295, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr14 = S_MOV_B32 4294967295, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr15 = S_MOV_B32 14680064, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr15 = S_MOV_B32 14680064, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr12 = S_ADD_U32 $sgpr12, $sgpr9, implicit-def $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr12 = S_ADD_U32 $sgpr12, $sgpr9, implicit-def $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr13 = S_ADDC_U32 $sgpr13, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr13 = S_ADDC_U32 $sgpr13, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $vcc = IMPLICIT_DEF		; GFX9: $vcc = IMPLICIT_DEF
		; GFX9: $vgpr0 = COPY_INACTIVE_LANES implicit $exec
		; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc		; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc
; GFX9: $vcc = S_MOV_B64 $exec		; GFX9: $vcc = S_MOV_B64 $exec
; GFX9: $exec = S_MOV_B64 3		; GFX9: $exec = S_MOV_B64 3
; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 $vcc		; GFX9: $exec = S_MOV_B64 $vcc
; GFX9: $vcc_hi = V_READLANE_B32 $vgpr0, 1		; GFX9: $vcc_hi = V_READLANE_B32 $vgpr0, 1
; GFX9: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0		; GFX9: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0
		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc, implicit $vgpr0
; GFX9: $vcc = IMPLICIT_DEF		; GFX9: $vcc = IMPLICIT_DEF
		; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc		; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc
; GFX9: $vcc = S_MOV_B64 $exec		; GFX9: $vcc = S_MOV_B64 $exec
; GFX9: $exec = S_MOV_B64 3		; GFX9: $exec = S_MOV_B64 3
; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 killed $vcc		; GFX9: $exec = S_MOV_B64 killed $vcc
		; GFX9: $vcc = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def $scc, implicit $exec
		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX9: $exec = S_MOV_B64 $vcc
		; GFX9: $vcc = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def $scc, implicit $exec
		; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX9: $exec = S_MOV_B64 $vcc
; GFX9: $vcc = S_MOV_B64 $exec		; GFX9: $vcc = S_MOV_B64 $exec
; GFX9: $exec = S_MOV_B64 3		; GFX9: $exec = S_MOV_B64 3
; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 killed $vcc		; GFX9: $exec = S_MOV_B64 killed $vcc
; GFX9: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc		; GFX9: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc
; GFX9: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1		; GFX9: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1
		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX9: $exec = S_NOT_B64 $exec, implicit-def $scc
; GFX10-LABEL: name: check_vcc		; GFX10-LABEL: name: check_vcc
; GFX10: liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr9		; GFX10: liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr9
; GFX10: $sgpr33 = S_MOV_B32 0		; GFX10: $sgpr33 = S_MOV_B32 0
; GFX10: $sgpr96 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr96 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr97 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr97 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr98 = S_MOV_B32 4294967295, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr98 = S_MOV_B32 4294967295, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr99 = S_MOV_B32 836853760, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr99 = S_MOV_B32 836853760, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr96 = S_ADD_U32 $sgpr96, $sgpr9, implicit-def $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr96 = S_ADD_U32 $sgpr96, $sgpr9, implicit-def $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr97 = S_ADDC_U32 $sgpr97, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr97 = S_ADDC_U32 $sgpr97, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $vcc = IMPLICIT_DEF		; GFX10: $vcc = IMPLICIT_DEF
		; GFX10: $vgpr0 = COPY_INACTIVE_LANES implicit $exec
		; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc		; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc
; GFX10: $vcc = S_MOV_B64 $exec		; GFX10: $vcc = S_MOV_B64 $exec
; GFX10: $exec = S_MOV_B64 3		; GFX10: $exec = S_MOV_B64 3
; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 $vcc		; GFX10: $exec = S_MOV_B64 $vcc
; GFX10: $vcc_hi = V_READLANE_B32 $vgpr0, 1		; GFX10: $vcc_hi = V_READLANE_B32 $vgpr0, 1
; GFX10: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0		; GFX10: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0
		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc, implicit $vgpr0
; GFX10: $vcc = IMPLICIT_DEF		; GFX10: $vcc = IMPLICIT_DEF
		; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc		; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc
; GFX10: $vcc = S_MOV_B64 $exec		; GFX10: $vcc = S_MOV_B64 $exec
; GFX10: $exec = S_MOV_B64 3		; GFX10: $exec = S_MOV_B64 3
; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 killed $vcc		; GFX10: $exec = S_MOV_B64 killed $vcc
		; GFX10: $vcc = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def $scc, implicit $exec
		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX10: $exec = S_MOV_B64 $vcc
		; GFX10: $vcc = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def $scc, implicit $exec
		; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.1, addrspace 5)
		; GFX10: $exec = S_MOV_B64 $vcc
; GFX10: $vcc = S_MOV_B64 $exec		; GFX10: $vcc = S_MOV_B64 $exec
; GFX10: $exec = S_MOV_B64 3		; GFX10: $exec = S_MOV_B64 3
; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 killed $vcc		; GFX10: $exec = S_MOV_B64 killed $vcc
; GFX10: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc		; GFX10: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc
; GFX10: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1		; GFX10: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1
		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.1, addrspace 5)
		; GFX10: $exec = S_NOT_B64 $exec, implicit-def $scc
$vcc = IMPLICIT_DEF		$vcc = IMPLICIT_DEF
SI_SPILL_S64_SAVE $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		SI_SPILL_S64_SAVE $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

$vcc = IMPLICIT_DEF		$vcc = IMPLICIT_DEF
SI_SPILL_S64_SAVE killed $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		SI_SPILL_S64_SAVE killed $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

$vcc = SI_SPILL_S64_RESTORE %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		$vcc = SI_SPILL_S64_RESTORE %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32
...		...