This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
SIRegisterInfo.h
1
SIRegisterInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
control-flow-fastregalloc.ll
-
frame-setup-without-sgpr-to-vgpr-spills.ll
-
partial-sgpr-to-vgpr-spills.ll
-
sgpr-spill.mir
-
si-spill-sgpr-stack.ll
-
spill-m0.ll
-
spill-special-sgpr.mir

Differential D96980

[amdgpu] Revert agnostic SGPR spill.
AbandonedPublic

Authored by hliao on Feb 18 2021, 11:04 AM.

Download Raw Diff

Details

Reviewers

critson
rampitec
arsenm
tpr
foad
sameerds
sebastian-ne

Summary

With that explicit exec mask manipulation, we may clobber global VGPRs during SGPR spilling or reloading. For instance, the following pseudo code illustrate such a clobbering:

v[62:63] = def(...);
spill(v[62:63], stack.0);
for (loop.cond) {
  ... = use(v[62:63]);
  if (branch.cond) {
    ... reuse of v[62:v63];
    ... SGPR reload through v62 or v63;
    v[62:63] = reload(stack.0);
    // At this point, v[62:v63] is clobbered if branch.cond doesn't
    // cover lanes 0 and 1.
  }
}

For concerns in the origianl patch, we should not worry about the different exec masks between SGPR spills and reloads. As the IR is deSSA-ed from the original SSA form, we guarantee a def always dominates all its uses and thus the point spilling a value always dominates the point where that value is reloaded again. The exec mask at the reloading point is guaranteed to be a subset of the exec mask at the spilling point. As long as the SGPR is broadcasted to VGPR in the spilling point and v_readfirstlane is used to load SGPR from VGPR in the reloading point, the original SGPR value is always reloaded regard to that exec mask.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	1,290 ms	x64 debian > UBSan-AddressSanitizer-lld-x86_64.TestCases/Misc::coverage-levels.cpp
	2,400 ms	x64 debian > UBSan-AddressSanitizer-x86_64.TestCases/Misc::coverage-levels.cpp
	1,050 ms	x64 debian > UBSan-MemorySanitizer-lld-x86_64.TestCases/Misc::coverage-levels.cpp
	2,250 ms	x64 debian > UBSan-MemorySanitizer-x86_64.TestCases/Misc::coverage-levels.cpp
	970 ms	x64 debian > UBSan-Standalone-lld-x86_64.TestCases/Misc::coverage-levels.cpp
		View Full Test Results (6 Failed)

Event Timeline

hliao created this revision.Feb 18 2021, 11:04 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptFeb 18 2021, 11:04 AM

hliao requested review of this revision.Feb 18 2021, 11:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2021, 11:04 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

hliao edited the summary of this revision. (Show Details)Feb 18 2021, 11:11 AM

Harbormaster completed remote builds in B89761: Diff 324709.Feb 18 2021, 12:07 PM

Conflicts with D96336/D96517

In D96980#2572918, @arsenm wrote:

Conflicts with D96336/D96517

Just have brief look on that two changes, they seem always manipulate the exec mask explicitly and should have the same issue on clobbering global VGPRs. We should not change the current exec mask if we need a temp VGPR during SGPR spilling. That's quite risky that a global VGPR is clobbered.

In D96980#2572918, @arsenm wrote:

Conflicts with D96336/D96517

Just have a detailed look of that two. Shall we optimize the case where 1 or 2 SGPRs need spilling/reloading and there's a scavenged register available. For that case, we just use the original implementation (based on broadcast and v_readfirstlane). In terms of LD/ST, the original is comparable to the proposed but we have much less code and also remove the dependency on the restore of that store of the whole tmp VGPR. HPC workloads mostly spill 1 or 2 SGPRs.

HPC workloads mostly spill 1 or 2 SGPRs.

Can you explain this a bit more? Spilling SGPRs to memory is supposed to be a very rare case. Why is it common for HPC workloads? Is there a better way to fix it?

Is it at all possible to encapsulate your failure case in a lit test?

Is the problem not with the exec manipulation, but really that the register scavenger choosing an inappropriate register, w.r.t. wave level CFG?

In D96980#2574001, @critson wrote:

Is it at all possible to encapsulate your failure case in a lit test?

Is the problem not with the exec manipulation, but really that the register scavenger choosing an inappropriate register, w.r.t. wave level CFG?

The register scavenge may be improved to a better candidate without live values in the inactive lanes. But, we should the cases where such a candidate is not available at all. The issue is exactly from the explicit exec mask manipulation, which doesn't honor the current exec mask and tries to access the inactive parts. Without inactive lanes being protected, we would always have the issue.

In D96980#2573982, @foad wrote:

HPC workloads mostly spill 1 or 2 SGPRs.

Can you explain this a bit more? Spilling SGPRs to memory is supposed to be a very rare case. Why is it common for HPC workloads? Is there a better way to fix it?

Pointers are extensively used in most HPC workloads, where double is also frequently used. There are lots of uniform 64-bit values in either pointer or double.

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.
exec=0 is a corner case, but I don’t think we can build on that.

Other than that, it surely is more efficient than the (worst case) 4 memory operations introduced by saving all lanes of the used VGPR :)

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.
exec=0 is a corner case, but I don’t think we can build on that.

Other than that, it surely is more efficient than the (worst case) 4 memory operations introduced by saving all lanes of the used VGPR :)

That's the part I don't understand. Why code path is still executed when exec mask is 0? For the regular code by the compiler, exec mask 0 always results in branch away on that code path. There's even no chance to execute that. Could you elaborate on how a code path could be executed with exec mask 0?

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

In D96980#2575164, @hliao wrote:

That's the part I don't understand. Why code path is still executed when exec mask is 0? For the regular code by the compiler, exec mask 0 always results in branch away on that code path. There's even no chance to execute that. Could you elaborate on how a code path could be executed with exec mask 0?

SIRemoveShortExecBranches.cpp is one source of executing instructions when exec == 0. I am not sure if there are any others.

In D96980#2575930, @ruiling wrote:

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

In D96980#2576062, @arsenm wrote:

In D96980#2575930, @ruiling wrote:

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I realized SIRemoveShortExecBranches.cpp has done correctness checks. So removing SIInsertSkips should work if we make sure there will always be a branching instruction for each control flow change.

In D96980#2576128, @ruiling wrote:

In D96980#2576062, @arsenm wrote:

In D96980#2575930, @ruiling wrote:

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I realized SIRemoveShortExecBranches.cpp has done correctness checks. So removing SIInsertSkips should work if we make sure there will always be a branching instruction for each control flow change.

When exec mask goes to zero, an SGPR spill in that corresponding code path has no effect at all. As our IR is deSSAed from the SSA form. A def always dominates all its uses. An SGPR spill always dominates its reloads. If that SGPR spill has exec mask 0, all its reloads have exec mask 0 as well. It's simply OK to ignore that spill as, semantically, there is no change in the program state.

In D96980#2576062, @arsenm wrote:

In D96980#2575930, @ruiling wrote:

In D96980#2574913, @sebastian-ne wrote:

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

For that elimination, we could choose to skip it if the body of the branch has unwanted effect.

Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.

hliao mentioned this in D96517: [AMDGPU] Optimize SGPR to scratch spilling.Feb 22 2021, 1:46 PM

Harbormaster completed remote builds in B90277: Diff 325557.Feb 22 2021, 3:00 PM

In D96980#2579955, @hliao wrote:

Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.

I assume you mean "[...] so they would not be executed under exec mask 0"?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3316 ↗	(On Diff #325557)	This seems reasonable. It avoids the case where an EXEC=0 region attempts to restore an SGPR for immediate use in the same region. I guess we are assuring ourselves that semantically an EXEC=0 region never restores an SGPR expecting it to be live beyond that region? (And that EXEC=0 regions will never save SGPRs expecting them to have meaningful effects elsewhere.)

ruiling added inline comments.Feb 22 2021, 5:18 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3302–3315 ↗	(On Diff #325557)	I don't think we need this, these instructions have already been lowered to v_readfirstlane when we try to optimize off the skip-jump.

critson added inline comments.Feb 22 2021, 6:58 PM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3302–3315 ↗	(On Diff #325557)	True if these have already been lowered then this code will have no additional effect.

hliao added inline comments.Feb 23 2021, 9:33 AM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3316 ↗	(On Diff #325557)	This seems reasonable. It avoids the case where an EXEC=0 region attempts to restore an SGPR for immediate use in the same region. I guess we are assuring ourselves that semantically an EXEC=0 region never restores an SGPR expecting it to be live beyond that region? (And that EXEC=0 regions will never save SGPRs expecting them to have meaningful effects elsewhere.) So that, it means technically, we won't have EXEC = 0 case at all, right?

Remove that unnecessary change and add rationale why that's safe for the original concerns.

In addition, we already heavily use v_readfirstlane in our codegen due to some patterns benefits by using vector instructions when no corresponding scalar instructions could be used. I believed it's quite safe that it's guaranteed v_readfirstlane won't be executed when exec mask goes to 0.

hliao added a reviewer: sebastian-ne.Feb 24 2021, 7:04 AM

Harbormaster completed remote builds in B90600: Diff 326077.Feb 24 2021, 7:36 AM

Fine with me. It would be nice if someone knowledgable of LLVM can say if relying on exec != 0 in MIR works or not.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1233–1234	Shouldn’t this still define SuperReg on the first v_readfirstlane, like before?

LGTM - but please address the SuperReg definition issue raised by Sebastian.

I will test this for graphics and we can address any issues arising in a follow up patch.
(I do not expect any.)

This revision is now accepted and ready to land.Feb 24 2021, 5:04 PM

In D96980#2585083, @sebastian-ne wrote:

Fine with me. It would be nice if someone knowledgable of LLVM can say if relying on exec != 0 in MIR works or not.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

Unfortunately I realized and have verified at least one problem with this and WQM.
With WQM the assumption that the EXEC mask for restore of the SGPR is a subset of the spill EXEC mask is not true.
Specifically an SGPR can be saved before entering WQM, then restored in WQM (so the readfirstlane will return junk).
This could potentially be addressed by ensuring the exec mask mode matches between spills and restores in the WQM pass.
(I can add this to WQM pass if needed, but will think a bit more on it first.)

Note: this issue is not necessarily an immediate problem for graphics, because we try to avoid all spills (particular SGPR spills to memory).
I am explicitly testing with SGPR to VGPR spills disabled and lowered SGPR counts to perturb the problem.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

In D96980#2587370, @critson wrote:

Unfortunately I realized and have verified at least one problem with this and WQM.
With WQM the assumption that the EXEC mask for restore of the SGPR is a subset of the spill EXEC mask is not true.
Specifically an SGPR can be saved before entering WQM, then restored in WQM (so the readfirstlane will return junk).

Yes, that sounds a serious problem. But the issue may not be only specific to this scenario. We sometimes have to broadcast uniform value into VGPR because we only have the V_xxx instruction instead of S_xxx instruction, and later we may use v_readfirstlane to read the value back into SGPR for later scalar operations. If one operation happens before WQM while another in WQM, things may also be wrong. I am not quite sure whether this would happen in real-world case. But sounds possible to me.

This could potentially be addressed by ensuring the exec mask mode matches between spills and restores in the WQM pass.
(I can add this to WQM pass if needed, but will think a bit more on it first.)

Note: this issue is not necessarily an immediate problem for graphics, because we try to avoid all spills (particular SGPR spills to memory).

I want to say that the issue this patch tries to solve is a blocking issue for us. That's why @sebastian-ne is also actively working to solve it.

In D96980#2587642, @ruiling wrote:

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

Could you explain a little bit more? do you mean a transformation that may merge/duplicate blocks so that predecessor changed? What kind of verification do you think is needed to catch the problem in your mind?

In D96980#2587870, @arsenm wrote:

In D96980#2587642, @ruiling wrote:

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

I think EXEC = 0 problem can only happen when there is a conditional-branching which can make active lanes less. Unconditional branching will never be a point to bring in EXEC = 0 issue. I think MIR transform should ensure the branching is always there for conditional-branching, transformations are only allowed to delete unconditional branching while keeping the semantic of the program unchanged. I don't see which step may possibly go wrong.

I also want to say that the possible EXEC = 0 problem (if any) is not specific to such spill solution. If there is any such situation, we definitely should fix it elsewhere. Normal program which includes scalar instructions suffering from "exec=0 sideeffect" would also have such problem.

sebastian-ne mentioned this in D96869: [AMDGPU] Fix saving fp and bp.Mar 2 2021, 12:01 AM

As far as I can see, the current state of this is

Carl found a bug when WQM is entered between save and restore
Matt and Stas have concerns about the correctness and want more verification

David claimed that we can handle the first issue (WQM) downstream (although I don’t know how he intends to do this, just hoping this never happens in practice?).
The second issue (more verification) seems to be stuck over the last weeks.

As a faster solution, can we submit D96336 to fix the correctness issue until the remaining issues here are solved?
I can revert D96336 once this patch is ready to go in.

I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
Imagine the following pseudocode:

function main() {
  // exec = 0xff
  if (<divergent condition>) {
    // exec = 0xf0
    foo();
  }
  // continue doing things
}

function foo() {
  // exec = 0xf0
  <spill s0>

  // Kills all currently active lanes
  // However, more lanes are active outside the call, so we can’t s_endpgm
  llvm.amdgcn.kill(false);

  // exec = 0x00
  // We still need to restore s0 (if it is a callee-save register)
  <restore s0>
}

In D96980#2602610, @sebastian-ne wrote:

I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
... code example omitted...

I agree this could happen if we use function calls in pixel shaders (PS); however, I don't think we have any plans to do this yet?
(Currently the only expected place for a kill is in PS, although we do support it elsewhere.)

In D96980#2602610, @sebastian-ne wrote:
I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
Imagine the following pseudocode:
function main() {
  // exec = 0xff
  if (<divergent condition>) {
    // exec = 0xf0
    foo();
  }
  // continue doing things
}

function foo() {
  // exec = 0xf0
  <spill s0>

  // Kills all currently active lanes
  // However, more lanes are active outside the call, so we can’t s_endpgm
  llvm.amdgcn.kill(false);

  // exec = 0x00
  // We still need to restore s0 (if it is a callee-save register)
  <restore s0>
}

We don't use kills at compute side and I personally don't believe it can be legal under any high level labguage model to do so. There are things like throw and even abort, but those lead to immediate control transfer and backed by the HW.

hliao mentioned this in D99507: [amdgpu] Add a pass to avoid jump into blocks with 0 exec mask..Mar 29 2021, 7:52 AM

hliao abandoned this revision.May 25 2021, 11:36 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIRegisterInfo.h

5 lines

SIRegisterInfo.cpp

231 lines

test/

CodeGen/

AMDGPU/

control-flow-fastregalloc.ll

62 lines

frame-setup-without-sgpr-to-vgpr-spills.ll

6 lines

partial-sgpr-to-vgpr-spills.ll

17 lines

sgpr-spill.mir

si-spill-sgpr-stack.ll

8 lines

spill-m0.ll

7 lines

spill-special-sgpr.mir

56 lines

Diff 326077

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	void resolveFrameIndex(MachineInstr &MI, Register BaseReg,
int64_t Offset) const override;		int64_t Offset) const override;

bool isFrameOffsetLegal(const MachineInstr *MI, Register BaseReg,		bool isFrameOffsetLegal(const MachineInstr *MI, Register BaseReg,
int64_t Offset) const override;		int64_t Offset) const override;

const TargetRegisterClass *getPointerRegClass(		const TargetRegisterClass *getPointerRegClass(
const MachineFunction &MF, unsigned Kind = 0) const override;		const MachineFunction &MF, unsigned Kind = 0) const override;

void buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI, int Index,
int Offset, unsigned EltSize, Register VGPR,
int64_t VGPRLanes, RegScavenger *RS,
bool IsLoad) const;

/// If \p OnlyToVGPR is true, this will only succeed if this		/// If \p OnlyToVGPR is true, this will only succeed if this
bool spillSGPR(MachineBasicBlock::iterator MI,		bool spillSGPR(MachineBasicBlock::iterator MI,
int FI, RegScavenger *RS,		int FI, RegScavenger *RS,
bool OnlyToVGPR = false) const;		bool OnlyToVGPR = false) const;

bool restoreSGPR(MachineBasicBlock::iterator MI,		bool restoreSGPR(MachineBasicBlock::iterator MI,
int FI, RegScavenger *RS,		int FI, RegScavenger *RS,
bool OnlyToVGPR = false) const;		bool OnlyToVGPR = false) const;
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 1,042 Lines • ▼ Show 20 Lines	void SIRegisterInfo::buildSpillLoadStore(MachineBasicBlock::iterator MI,
if (ScratchOffsetRegDelta != 0) {		if (ScratchOffsetRegDelta != 0) {
// Subtract the offset we added to the ScratchOffset register.		// Subtract the offset we added to the ScratchOffset register.
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), SOffset)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_SUB_U32), SOffset)
.addReg(SOffset)		.addReg(SOffset)
.addImm(ScratchOffsetRegDelta);		.addImm(ScratchOffsetRegDelta);
}		}
}		}

// Generate a VMEM access which loads or stores the VGPR containing an SGPR
// spill such that all the lanes set in VGPRLanes are loaded or stored.
// This generates exec mask manipulation and will use SGPRs available in MI
// or VGPR lanes in the VGPR to save and restore the exec mask.
void SIRegisterInfo::buildSGPRSpillLoadStore(MachineBasicBlock::iterator MI,
int Index, int Offset,
unsigned EltSize, Register VGPR,
int64_t VGPRLanes,
RegScavenger *RS,
bool IsLoad) const {
MachineBasicBlock *MBB = MI->getParent();
MachineFunction *MF = MBB->getParent();
SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
const SIInstrInfo *TII = ST.getInstrInfo();

Register SuperReg = MI->getOperand(0).getReg();
const TargetRegisterClass *RC = getPhysRegClass(SuperReg);
ArrayRef<int16_t> SplitParts = getRegSplitParts(RC, EltSize);
unsigned NumSubRegs = SplitParts.empty() ? 1 : SplitParts.size();
unsigned FirstPart = Offset * 32;
unsigned ExecLane = 0;

bool IsKill = MI->getOperand(0).isKill();
const DebugLoc &DL = MI->getDebugLoc();

// Cannot handle load/store to EXEC
assert(SuperReg != AMDGPU::EXEC_LO && SuperReg != AMDGPU::EXEC_HI &&
SuperReg != AMDGPU::EXEC && "exec should never spill");

// On Wave32 only handle EXEC_LO.
// On Wave64 only update EXEC_HI if there is sufficent space for a copy.
bool OnlyExecLo = isWave32 \|\| NumSubRegs == 1 \|\| SuperReg == AMDGPU::EXEC_HI;

unsigned ExecMovOpc = OnlyExecLo ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
Register ExecReg = OnlyExecLo ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
Register SavedExecReg;

// Backup EXEC
if (OnlyExecLo) {
SavedExecReg =
NumSubRegs == 1
? SuperReg
: Register(getSubReg(SuperReg, SplitParts[FirstPart + ExecLane]));
} else {
// If src/dst is an odd size it is possible subreg0 is not aligned.
for (; ExecLane < (NumSubRegs - 1); ++ExecLane) {
SavedExecReg = getMatchingSuperReg(
getSubReg(SuperReg, SplitParts[FirstPart + ExecLane]), AMDGPU::sub0,
&AMDGPU::SReg_64_XEXECRegClass);
if (SavedExecReg)
break;
}
}
assert(SavedExecReg);
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), SavedExecReg).addReg(ExecReg);

// Setup EXEC
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg).addImm(VGPRLanes);

// Load/store VGPR
MachineFrameInfo &FrameInfo = MF->getFrameInfo();
assert(FrameInfo.getStackID(Index) != TargetStackID::SGPRSpill);

Register FrameReg = FrameInfo.isFixedObjectIndex(Index) && hasBasePointer(*MF)
? getBaseRegister()
: getFrameRegister(*MF);

Align Alignment = FrameInfo.getObjectAlign(Index);
MachinePointerInfo PtrInfo =
MachinePointerInfo::getFixedStack(*MF, Index);
MachineMemOperand *MMO = MF->getMachineMemOperand(
PtrInfo, IsLoad ? MachineMemOperand::MOLoad : MachineMemOperand::MOStore,
EltSize, Alignment);

if (IsLoad) {
unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_LOAD_DWORD_SADDR
: AMDGPU::BUFFER_LOAD_DWORD_OFFSET;
buildSpillLoadStore(MI, Opc,
Index,
VGPR, false,
FrameReg,
Offset * EltSize, MMO,
RS);
} else {
unsigned Opc = ST.enableFlatScratch() ? AMDGPU::SCRATCH_STORE_DWORD_SADDR
: AMDGPU::BUFFER_STORE_DWORD_OFFSET;
buildSpillLoadStore(MI, Opc, Index, VGPR,
IsKill, FrameReg,
Offset * EltSize, MMO, RS);
// This only ever adds one VGPR spill
MFI->addToSpilledVGPRs(1);
}

// Restore EXEC
BuildMI(*MBB, MI, DL, TII->get(ExecMovOpc), ExecReg)
.addReg(SavedExecReg, getKillRegState(IsLoad \|\| IsKill));

// Restore clobbered SGPRs
if (IsLoad) {
// Nothing to do; register will be overwritten
} else if (!IsKill) {
// Restore SGPRs from appropriate VGPR lanes
if (!OnlyExecLo) {
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32),
getSubReg(SuperReg, SplitParts[FirstPart + ExecLane + 1]))
.addReg(VGPR)
.addImm(ExecLane + 1);
}
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32),
NumSubRegs == 1 ? SavedExecReg
: Register(getSubReg(
SuperReg, SplitParts[FirstPart + ExecLane])))
.addReg(VGPR, RegState::Kill)
.addImm(ExecLane);
}
}

bool SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI,		bool SIRegisterInfo::spillSGPR(MachineBasicBlock::iterator MI,
int Index,		int Index,
RegScavenger *RS,		RegScavenger *RS,
bool OnlyToVGPR) const {		bool OnlyToVGPR) const {
MachineBasicBlock *MBB = MI->getParent();		MachineBasicBlock *MBB = MI->getParent();
MachineFunction *MF = MBB->getParent();		MachineFunction *MF = MBB->getParent();
SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();

▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
if (NumSubRegs > 1)		if (NumSubRegs > 1)
MIB.addReg(SuperReg, getKillRegState(UseKill) \| RegState::Implicit);		MIB.addReg(SuperReg, getKillRegState(UseKill) \| RegState::Implicit);

// FIXME: Since this spills to another register instead of an actual		// FIXME: Since this spills to another register instead of an actual
// frame index, we should delete the frame index when all references to		// frame index, we should delete the frame index when all references to
// it are fixed.		// it are fixed.
}		}
} else {		} else {
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
// Scavenged temporary VGPR to use. It must be scavenged once for any number		// Scavenged temporary VGPR to use. It must be scavenged once for any number
// of spilled subregs.		// of spilled subregs.
Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);		Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);
RS->setRegUsed(TmpVGPR);		RS->setRegUsed(TmpVGPR);

// SubReg carries the "Kill" flag when SubReg == SuperReg.		// SubReg carries the "Kill" flag when SubReg == SuperReg.
unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);		unsigned SubKillState = getKillRegState((NumSubRegs == 1) && IsKill);

unsigned PerVGPR = 32;		for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for variable 'e' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]…
unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;
int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;

for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
unsigned TmpVGPRFlags = RegState::Undef;

// Write sub registers into the VGPR
for (unsigned i = Offset * PerVGPR,
e = std::min((Offset + 1) * PerVGPR, NumSubRegs);
i < e; ++i) {
Register SubReg = NumSubRegs == 1		Register SubReg = NumSubRegs == 1
? SuperReg		? SuperReg
: Register(getSubReg(SuperReg, SplitParts[i]));		: Register(getSubReg(SuperReg, SplitParts[i]));
		MachineInstrBuilder Mov =
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpVGPR)
		.addReg(SubReg, SubKillState);

MachineInstrBuilder WriteLane =
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_WRITELANE_B32), TmpVGPR)
.addReg(SubReg, SubKillState)
.addImm(i % PerVGPR)
.addReg(TmpVGPR, TmpVGPRFlags);
TmpVGPRFlags = 0;

// There could be undef components of a spilled super register.
// TODO: Can we detect this and skip the spill?
if (NumSubRegs > 1) {		if (NumSubRegs > 1) {
// The last implicit use of the SuperReg carries the "Kill" flag.		// The last implicit use of the SuperReg carries the "Kill" flag.
unsigned SuperKillState = 0;		unsigned SuperKillState = 0;
if (i + 1 == NumSubRegs)		if (i + 1 == e)
SuperKillState \|= getKillRegState(IsKill);		SuperKillState \|= getKillRegState(IsKill);
WriteLane.addReg(SuperReg, RegState::Implicit \| SuperKillState);		Mov.addReg(SuperReg, RegState::Implicit \| SuperKillState);
}
}		}

// Write out VGPR		Align Alignment = FrameInfo.getObjectAlign(Index);
buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,		MachinePointerInfo PtrInfo =
RS, false);		MachinePointerInfo::getFixedStack(MF, Index, EltSize i);
		MachineMemOperand *MMO =
		MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOStore, EltSize,
		commonAlignment(Alignment, EltSize * i));
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_SAVE))
		.addReg(TmpVGPR, RegState::Kill) // src
		.addFrameIndex(Index) // vaddr
		.addReg(MFI->getStackPtrOffsetReg()) // soffset
		.addImm(i * 4) // offset
		.addMemOperand(MMO);
}		}
}		}

MI->eraseFromParent();		MI->eraseFromParent();
MFI->addToSpilledSGPRs(NumSubRegs);		MFI->addToSpilledSGPRs(NumSubRegs);
return true;		return true;
}		}

Show All 36 Lines	for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
SIMachineFunctionInfo::SpilledReg Spill = VGPRSpills[i];		SIMachineFunctionInfo::SpilledReg Spill = VGPRSpills[i];
auto MIB = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)		auto MIB = BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
.addReg(Spill.VGPR)		.addReg(Spill.VGPR)
.addImm(Spill.Lane);		.addImm(Spill.Lane);
if (NumSubRegs > 1 && i == 0)		if (NumSubRegs > 1 && i == 0)
MIB.addReg(SuperReg, RegState::ImplicitDefine);		MIB.addReg(SuperReg, RegState::ImplicitDefine);
}		}
} else {		} else {
		MachineFrameInfo &FrameInfo = MF->getFrameInfo();
		// Scavenged temporary VGPR to use. It must be scavenged once for any number
		// of spilled subregs.
Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);		Register TmpVGPR = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);
RS->setRegUsed(TmpVGPR);		RS->setRegUsed(TmpVGPR);

unsigned PerVGPR = 32;		for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming] not useful clang-tidy: warning: invalid case style for variable 'e' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]…
unsigned NumVGPRs = (NumSubRegs + (PerVGPR - 1)) / PerVGPR;
int64_t VGPRLanes = (1LL << std::min(PerVGPR, NumSubRegs)) - 1LL;

for (unsigned Offset = 0; Offset < NumVGPRs; ++Offset) {
// Load in VGPR data
buildSGPRSpillLoadStore(MI, Index, Offset, EltSize, TmpVGPR, VGPRLanes,
RS, true);

// Unpack lanes
for (unsigned i = Offset * PerVGPR,
e = std::min((Offset + 1) * PerVGPR, NumSubRegs);
i < e; ++i) {
Register SubReg = NumSubRegs == 1		Register SubReg = NumSubRegs == 1
? SuperReg		? SuperReg
: Register(getSubReg(SuperReg, SplitParts[i]));		: Register(getSubReg(SuperReg, SplitParts[i]));
		Align Alignment = FrameInfo.getObjectAlign(Index);

		MachinePointerInfo PtrInfo =
		MachinePointerInfo::getFixedStack(MF, Index, EltSize i);

		MachineMemOperand *MMO =
		MF->getMachineMemOperand(PtrInfo, MachineMemOperand::MOLoad, EltSize,
		commonAlignment(Alignment, EltSize * i));

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::SI_SPILL_V32_RESTORE), TmpVGPR)
		.addFrameIndex(Index) // vaddr
		.addReg(MFI->getStackPtrOffsetReg()) // soffset
		.addImm(i * 4) // offset
		.addMemOperand(MMO);

bool LastSubReg = (i + 1 == e);
auto MIB =		auto MIB =
BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READFIRSTLANE_B32), SubReg)
.addReg(TmpVGPR, getKillRegState(LastSubReg))		.addReg(TmpVGPR, RegState::Kill);
.addImm(i);
if (NumSubRegs > 1 && i == 0)		if (NumSubRegs > 1)
MIB.addReg(SuperReg, RegState::ImplicitDefine);		MIB.addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine);
		sebastian-neUnsubmitted Not Done Reply Inline Actions Shouldn’t this still define SuperReg on the first v_readfirstlane, like before? sebastian-ne: Shouldn’t this still define SuperReg on the first v_readfirstlane, like before?
}
}		}
}		}

MI->eraseFromParent();		MI->eraseFromParent();
return true;		return true;
}		}

/// Special case of eliminateFrameIndex. Returns true if the SGPR was spilled to		/// Special case of eliminateFrameIndex. Returns true if the SGPR was spilled to
▲ Show 20 Lines • Show All 937 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

	Show All 17 Lines
	; GCN: ds_read_b32 [[LOAD0:v[0-9]+]]			; GCN: ds_read_b32 [[LOAD0:v[0-9]+]]

	; Spill load			; Spill load
	; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], 0 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], 0 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: v_cmp_eq_u32_e64 [[CMP0:s\[[0-9]+:[0-9]\]]], s{{[0-9]+}}, v0			; GCN: v_cmp_eq_u32_e64 [[CMP0:s\[[0-9]+:[0-9]\]]], s{{[0-9]+}}, v0

	; Spill saved exec			; Spill saved exec
	; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec			; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec

	; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_writelane_b32 v[[V_SAVEEXEC:[0-9]+]], s[[SAVEEXEC_LO]], 0			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: v_writelane_b32 v[[V_SAVEEXEC]], s[[SAVEEXEC_HI]], 1			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], 0 offset:4 ; 4-byte Folded Spill
	; VMEM: buffer_store_dword v[[V_SAVEEXEC]], off, s[0:3], 0 offset:[[V_EXEC_SPILL_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
				; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], 0 offset:8 ; 4-byte Folded Spill

	; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}, [[CMP0]]			; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}, [[CMP0]]
	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}

	; GCN: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: ; %bb.{{[0-9]+}}: ; %if			; GCN: ; %bb.{{[0-9]+}}: ; %if
	; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword [[RELOAD_LOAD0:v[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: s_mov_b32 m0, -1			; GCN: s_mov_b32 m0, -1
	; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]			; GCN: ds_read_b32 [[LOAD1:v[0-9]+]]
	; GCN: s_waitcnt vmcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) lgkmcnt(0)


	; Spill val register			; Spill val register
	; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]			; GCN: v_add_i32_e32 [[VAL:v[0-9]+]], vcc, [[LOAD1]], [[RELOAD_LOAD0]]
	; GCN: buffer_store_dword [[VAL]], off, s[0:3], 0 offset:[[VAL_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL]], off, s[0:3], 0 offset:[[VAL_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; VMEM: [[ENDIF]]:			; VMEM: [[ENDIF]]:

	; Reload and restore exec mask			; Reload and restore exec mask
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]

	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC:[0-9]+]], off, s[0:3], 0 offset:[[V_EXEC_SPILL_OFFSET]] ; 4-byte Folded Reload			; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], 0 offset:4 ; 4-byte Folded Reload
				; VMEM: s_waitcnt vmcnt(0)
				; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC_LO]]
				; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_HI:[0-9]+]], off, s[0:3], 0 offset:8 ; 4-byte Folded Reload
	; VMEM: s_waitcnt vmcnt(0)			; VMEM: s_waitcnt vmcnt(0)
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 0			; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC_HI]]
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 1

	; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}

	; Restore val			; Restore val
	; GCN: buffer_load_dword [[RELOAD_VAL:v[0-9]+]], off, s[0:3], 0 offset:[[VAL_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword [[RELOAD_VAL:v[0-9]+]], off, s[0:3], 0 offset:[[VAL_OFFSET]] ; 4-byte Folded Reload

	; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RELOAD_VAL]]			; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, [[RELOAD_VAL]]
	define amdgpu_kernel void @divergent_if_endif(i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @divergent_if_endif(i32 addrspace(1)* %out) #0 {
	Show All 26 Lines
	; Spill load			; Spill load
	; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], 0 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[LOAD0]], off, s[0:3], 0 offset:[[LOAD0_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec			; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec

	; Spill saved exec			; Spill saved exec
	; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_writelane_b32 v[[V_SAVEEXEC:[0-9]+]], s[[SAVEEXEC_LO]], 0			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: v_writelane_b32 v[[V_SAVEEXEC]], s[[SAVEEXEC_HI]], 1			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], 0 offset:4 ; 4-byte Folded Spill
	; VMEM: buffer_store_dword v[[V_SAVEEXEC]], off, s[0:3], 0 offset:[[V_EXEC_SPILL_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
				; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], 0 offset:8 ; 4-byte Folded Spill


	; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, [[CMP0]]			; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, [[CMP0]]
	; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}			; GCN: s_mov_b64 exec, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}
	; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[END:BB[0-9]+_[0-9]+]]


	; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP:BB[0-9]+_[0-9]+]]:
	; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[VAL_LOOP_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]			; GCN: v_subrev_i32_e32 [[VAL_LOOP:v[0-9]+]], vcc, v{{[0-9]+}}, v[[VAL_LOOP_RELOAD]]
	; GCN: s_cmp_lg_u32			; GCN: s_cmp_lg_u32
	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:{{[0-9]+}} ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:{{[0-9]+}} ; 4-byte Folded Spill
	; GCN-NEXT: s_cbranch_scc1 [[LOOP]]			; GCN-NEXT: s_cbranch_scc1 [[LOOP]]

	; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[VAL_LOOP]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: [[END]]:			; GCN: [[END]]:
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]

	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC:[0-9]+]], off, s[0:3], 0 offset:[[V_EXEC_SPILL_OFFSET]] ; 4-byte Folded Reload			; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], 0 offset:4 ; 4-byte Folded Reload
				; VMEM: s_waitcnt vmcnt(0)
				; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC_LO]]
				; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_HI:[0-9]+]], off, s[0:3], 0 offset:8 ; 4-byte Folded Reload
	; VMEM: s_waitcnt vmcnt(0)			; VMEM: s_waitcnt vmcnt(0)
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 0			; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC_HI]]
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 1

	; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}
	; GCN: buffer_load_dword v[[VAL_END:[0-9]+]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[VAL_END:[0-9]+]], off, s[0:3], 0 offset:[[VAL_SUB_OFFSET]] ; 4-byte Folded Reload

	; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, v[[VAL_END]]			; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, v[[VAL_END]]
	define amdgpu_kernel void @divergent_loop(i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @divergent_loop(i32 addrspace(1)* %out) #0 {
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	Show All 32 Lines
	; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec			; GCN: s_mov_b64 s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, exec
	; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, [[CMP0]]			; GCN: s_and_b64 s{{\[}}[[ANDEXEC_LO:[0-9]+]]:[[ANDEXEC_HI:[0-9]+]]{{\]}}, s{{\[}}[[SAVEEXEC_LO:[0-9]+]]:[[SAVEEXEC_HI:[0-9]+]]{{\]}}, [[CMP0]]
	; GCN: s_xor_b64 s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}, s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}			; GCN: s_xor_b64 s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}, s{{\[}}[[ANDEXEC_LO]]:[[ANDEXEC_HI]]{{\]}}, s{{\[}}[[SAVEEXEC_LO]]:[[SAVEEXEC_HI]]{{\]}}

	; Spill saved exec			; Spill saved exec
	; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR:v[0-9]+]], s[[SAVEEXEC_LO]], [[SAVEEXEC_LO_LANE:[0-9]+]]
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[SAVEEXEC_HI]], [[SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_writelane_b32 v[[V_SAVEEXEC:[0-9]+]], s[[SAVEEXEC_LO]], 0			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_LO:[0-9]+]], s[[SAVEEXEC_LO]]
	; VMEM: v_writelane_b32 v[[V_SAVEEXEC]], s[[SAVEEXEC_HI]], 1			; VMEM: buffer_store_dword v[[V_SAVEEXEC_LO]], off, s[0:3], 0 offset:[[SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: buffer_store_dword v[[V_SAVEEXEC]], off, s[0:3], 0 offset:[[SAVEEXEC_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: v_mov_b32_e32 v[[V_SAVEEXEC_HI:[0-9]+]], s[[SAVEEXEC_HI]]
				; VMEM: buffer_store_dword v[[V_SAVEEXEC_HI]], off, s[0:3], 0 offset:[[SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_mov_b64 exec, [[CMP0]]			; GCN: s_mov_b64 exec, [[CMP0]]

	; FIXME: It makes no sense to put this skip here			; FIXME: It makes no sense to put this skip here
	; GCN: s_cbranch_execz [[FLOW:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_execz [[FLOW:BB[0-9]+_[0-9]+]]
	; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ELSE:BB[0-9]+_[0-9]+]]

	; GCN: [[FLOW]]: ; %Flow			; GCN: [[FLOW]]: ; %Flow
	; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[SAVEEXEC_HI_LANE]]

	; VMEM: buffer_load_dword v[[FLOW_V_RELOAD_SAVEEXEC:[0-9]+]], off, s[0:3], 0 offset:[[SAVEEXEC_OFFSET]]			; VMEM: buffer_load_dword v[[FLOW_V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], 0 offset:[[SAVEEXEC_LO_OFFSET]]
	; VMEM: s_waitcnt vmcnt(0)			; VMEM: s_waitcnt vmcnt(0)
	; VMEM: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[FLOW_V_RELOAD_SAVEEXEC]], 0			; VMEM: v_readfirstlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[FLOW_V_RELOAD_SAVEEXEC_LO]]
	; VMEM: v_readlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[FLOW_V_RELOAD_SAVEEXEC]], 1			; VMEM: buffer_load_dword v[[FLOW_V_RELOAD_SAVEEXEC_HI:[0-9]+]], off, s[0:3], 0 offset:[[SAVEEXEC_HI_OFFSET]] ; 4-byte Folded Reload
				; VMEM: s_waitcnt vmcnt(0)
				; VMEM: v_readfirstlane_b32 s[[FLOW_S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[FLOW_V_RELOAD_SAVEEXEC_HI]]

	; GCN: s_or_saveexec_b64 s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO_SAVEEXEC:[0-9]+]]:[[FLOW_S_RELOAD_SAVEEXEC_HI_SAVEEXEC:[0-9]+]]{{\]}}, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_or_saveexec_b64 s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO_SAVEEXEC:[0-9]+]]:[[FLOW_S_RELOAD_SAVEEXEC_HI_SAVEEXEC:[0-9]+]]{{\]}}, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO]]:[[FLOW_S_RELOAD_SAVEEXEC_HI]]{{\]}}

	; Regular spill value restored after exec modification			; Regular spill value restored after exec modification
	; GCN: buffer_load_dword [[FLOW_VAL:v[0-9]+]], off, s[0:3], 0 offset:[[FLOW_VAL_OFFSET:[0-9]+]] ; 4-byte Folded Reload			; GCN: buffer_load_dword [[FLOW_VAL:v[0-9]+]], off, s[0:3], 0 offset:[[FLOW_VAL_OFFSET:[0-9]+]] ; 4-byte Folded Reload
	; Followed by spill			; Followed by spill
	; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], 0 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[FLOW_VAL]], off, s[0:3], 0 offset:[[RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_and_b64 s{{\[}}[[FLOW_AND_EXEC_LO:[0-9]+]]:[[FLOW_AND_EXEC_HI:[0-9]+]]{{\]}}, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO_SAVEEXEC]]:[[FLOW_S_RELOAD_SAVEEXEC_HI_SAVEEXEC]]{{\]}}			; GCN: s_and_b64 s{{\[}}[[FLOW_AND_EXEC_LO:[0-9]+]]:[[FLOW_AND_EXEC_HI:[0-9]+]]{{\]}}, exec, s{{\[}}[[FLOW_S_RELOAD_SAVEEXEC_LO_SAVEEXEC]]:[[FLOW_S_RELOAD_SAVEEXEC_HI_SAVEEXEC]]{{\]}}

	; Spill saved exec			; Spill saved exec
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_LO]], [[FLOW_SAVEEXEC_LO_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_LO]], [[FLOW_SAVEEXEC_LO_LANE:[0-9]+]]
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_HI]], [[FLOW_SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_HI]], [[FLOW_SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_writelane_b32 v[[FLOW_V_SAVEEXEC:[0-9]+]], s[[FLOW_AND_EXEC_LO]], 0			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]
	; VMEM: v_writelane_b32 v[[FLOW_V_SAVEEXEC]], s[[FLOW_AND_EXEC_HI]], 1			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]
				; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}			; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}
	; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]


	; GCN: ; %bb.{{[0-9]+}}: ; %if			; GCN: ; %bb.{{[0-9]+}}: ; %if
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: ds_read_b32			; GCN: ds_read_b32
	; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill
	; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ELSE]]: ; %else			; GCN: [[ELSE]]: ; %else
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: v_subrev_i32_e32 [[SUB:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_subrev_i32_e32 [[SUB:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[FLOW_RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[FLOW_RESULT_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; GCN-NEXT: s_branch [[FLOW]]			; GCN-NEXT: s_branch [[FLOW]]

	; GCN: [[ENDIF]]:			; GCN: [[ENDIF]]:
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_LO_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_LO_LANE]]
	; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_HI_LANE]]			; VGPR: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], [[SPILL_VGPR]], [[FLOW_SAVEEXEC_HI_LANE]]

				; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_LO:[0-9]+]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_LO_OFFSET]] ; 4-byte Folded Reload
	; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC:[0-9]+]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_OFFSET]] ; 4-byte Folded Reload			; VMEM: s_waitcnt vmcnt(0)
				; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC_LO]]
				; VMEM: buffer_load_dword v[[V_RELOAD_SAVEEXEC_HI:[0-9]+]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_HI_OFFSET]] ; 4-byte Folded Reload
	; VMEM: s_waitcnt vmcnt(0)			; VMEM: s_waitcnt vmcnt(0)
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_LO:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 0			; VMEM: v_readfirstlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC_HI]]
	; VMEM: v_readlane_b32 s[[S_RELOAD_SAVEEXEC_HI:[0-9]+]], v[[V_RELOAD_SAVEEXEC]], 1

	; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}			; GCN: s_or_b64 exec, exec, s{{\[}}[[S_RELOAD_SAVEEXEC_LO]]:[[S_RELOAD_SAVEEXEC_HI]]{{\]}}

	; GCN: buffer_load_dword v[[RESULT:[0-9]+]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[RESULT:[0-9]+]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Reload
	; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, v[[RESULT]]			; GCN: flat_store_dword v{{\[[0-9]+:[0-9]+\]}}, v[[RESULT]]
	define amdgpu_kernel void @divergent_if_else_endif(i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @divergent_if_else_endif(i32 addrspace(1)* %out) #0 {
	entry:			entry:
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	Show All 24 Lines

llvm/test/CodeGen/AMDGPU/frame-setup-without-sgpr-to-vgpr-spills.ll

	Show All 9 Lines
	; SPILL-TO-VGPR: v_writelane_b32 v40, s33, 2			; SPILL-TO-VGPR: v_writelane_b32 v40, s33, 2
	; NO-SPILL-TO-VGPR: v_mov_b32_e32 v0, s33			; NO-SPILL-TO-VGPR: v_mov_b32_e32 v0, s33
	; NO-SPILL-TO-VGPR: buffer_store_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill			; NO-SPILL-TO-VGPR: buffer_store_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Spill

	; GCN: s_swappc_b64 s[30:31], s[4:5]			; GCN: s_swappc_b64 s[30:31], s[4:5]

	; SPILL-TO-VGPR: v_readlane_b32 s4, v40, 0			; SPILL-TO-VGPR: v_readlane_b32 s4, v40, 0
	; SPILL-TO-VGPR: v_readlane_b32 s5, v40, 1			; SPILL-TO-VGPR: v_readlane_b32 s5, v40, 1
	; NO-SPILL-TO-VGPR: v_readlane_b32 s4, v1, 0			; NO-SPILL-TO_VGPR: v_mov_b32_e32 v1, s30
	; NO-SPILL-TO-VGPR: v_readlane_b32 s5, v1, 1			; NO-SPILL-TO-VGPR: buffer_load_dword v1, off, s[0:3], s33 offset:4 ; 4-byte Folded Reload
				; NO-SPILL-TO_VGPR: v_mov_b32_e32 v1, s31
				; NO-SPILL-TO-VGPR: buffer_load_dword v1, off, s[0:3], s33 offset:8 ; 4-byte Folded Reload

	; SPILL-TO-VGPR: v_readlane_b32 s33, v40, 2			; SPILL-TO-VGPR: v_readlane_b32 s33, v40, 2
	; NO-SPILL-TO-VGPR: buffer_load_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload			; NO-SPILL-TO-VGPR: buffer_load_dword v0, off, s[0:3], s32 offset:12 ; 4-byte Folded Reload
	; NO-SPILL-TO-VGPR: v_readfirstlane_b32 s33, v0			; NO-SPILL-TO-VGPR: v_readfirstlane_b32 s33, v0
	define void @callee_with_stack_and_call() #0 {			define void @callee_with_stack_and_call() #0 {
	%alloca = alloca i32, addrspace(5)			%alloca = alloca i32, addrspace(5)
	store volatile i32 0, i32 addrspace(5)* %alloca			store volatile i32 0, i32 addrspace(5)* %alloca
	call void @external_void_func_void()			call void @external_void_func_void()
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll

	Show First 20 Lines • Show All 755 Lines • ▼ Show 20 Lines
	; GCN-NEXT: v_writelane_b32 v31, s15, 59			; GCN-NEXT: v_writelane_b32 v31, s15, 59
	; GCN-NEXT: v_writelane_b32 v31, s16, 60			; GCN-NEXT: v_writelane_b32 v31, s16, 60
	; GCN-NEXT: v_writelane_b32 v31, s17, 61			; GCN-NEXT: v_writelane_b32 v31, s17, 61
	; GCN-NEXT: v_writelane_b32 v31, s18, 62			; GCN-NEXT: v_writelane_b32 v31, s18, 62
	; GCN-NEXT: v_writelane_b32 v31, s19, 63			; GCN-NEXT: v_writelane_b32 v31, s19, 63
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ; def s[2:3]			; GCN-NEXT: ; def s[2:3]
	; GCN-NEXT: ;;#ASMEND			; GCN-NEXT: ;;#ASMEND
	; GCN-NEXT: v_writelane_b32 v0, s2, 0			; GCN-NEXT: v_mov_b32_e32 v0, s2
	; GCN-NEXT: v_writelane_b32 v0, s3, 1
	; GCN-NEXT: s_mov_b64 s[2:3], exec
	; GCN-NEXT: s_mov_b64 exec, 3
	; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Spill			; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Spill
	; GCN-NEXT: s_mov_b64 exec, s[2:3]			; GCN-NEXT: v_mov_b32_e32 v0, s3
				; GCN-NEXT: buffer_store_dword v0, off, s[52:55], 0 offset:8 ; 4-byte Folded Spill
	; GCN-NEXT: s_mov_b32 s1, 0			; GCN-NEXT: s_mov_b32 s1, 0
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: s_cmp_lg_u32 s0, s1			; GCN-NEXT: s_cmp_lg_u32 s0, s1
	; GCN-NEXT: s_cbranch_scc1 BB2_2			; GCN-NEXT: s_cbranch_scc1 BB2_2
	; GCN-NEXT: ; %bb.1: ; %bb0			; GCN-NEXT: ; %bb.1: ; %bb0
	; GCN-NEXT: v_readlane_b32 s36, v31, 32			; GCN-NEXT: v_readlane_b32 s36, v31, 32
	; GCN-NEXT: v_readlane_b32 s37, v31, 33			; GCN-NEXT: v_readlane_b32 s37, v31, 33
	; GCN-NEXT: v_readlane_b32 s38, v31, 34			; GCN-NEXT: v_readlane_b32 s38, v31, 34
	▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	; GCN-NEXT: v_readlane_b32 s12, v31, 56			; GCN-NEXT: v_readlane_b32 s12, v31, 56
	; GCN-NEXT: v_readlane_b32 s13, v31, 57			; GCN-NEXT: v_readlane_b32 s13, v31, 57
	; GCN-NEXT: v_readlane_b32 s14, v31, 58			; GCN-NEXT: v_readlane_b32 s14, v31, 58
	; GCN-NEXT: v_readlane_b32 s15, v31, 59			; GCN-NEXT: v_readlane_b32 s15, v31, 59
	; GCN-NEXT: v_readlane_b32 s16, v31, 60			; GCN-NEXT: v_readlane_b32 s16, v31, 60
	; GCN-NEXT: v_readlane_b32 s17, v31, 61			; GCN-NEXT: v_readlane_b32 s17, v31, 61
	; GCN-NEXT: v_readlane_b32 s18, v31, 62			; GCN-NEXT: v_readlane_b32 s18, v31, 62
	; GCN-NEXT: v_readlane_b32 s19, v31, 63			; GCN-NEXT: v_readlane_b32 s19, v31, 63
	; GCN-NEXT: s_mov_b64 s[0:1], exec
	; GCN-NEXT: s_mov_b64 exec, 3
	; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Reload			; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:4 ; 4-byte Folded Reload
	; GCN-NEXT: s_mov_b64 exec, s[0:1]
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_readlane_b32 s0, v0, 0			; GCN-NEXT: v_readfirstlane_b32 s0, v0
	; GCN-NEXT: v_readlane_b32 s1, v0, 1			; GCN-NEXT: buffer_load_dword v0, off, s[52:55], 0 offset:8 ; 4-byte Folded Reload
				; GCN-NEXT: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_readfirstlane_b32 s1, v0
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ; use s[36:51]			; GCN-NEXT: ; use s[36:51]
	; GCN-NEXT: ;;#ASMEND			; GCN-NEXT: ;;#ASMEND
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ; use s[4:19]			; GCN-NEXT: ; use s[4:19]
	; GCN-NEXT: ;;#ASMEND			; GCN-NEXT: ;;#ASMEND
	; GCN-NEXT: ;;#ASMSTART			; GCN-NEXT: ;;#ASMSTART
	; GCN-NEXT: ; use s[0:1]			; GCN-NEXT: ; use s[0:1]
	Show All 32 Lines

llvm/test/CodeGen/AMDGPU/sgpr-spill.mir

This file was deleted.

	# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,MUBUF %s
	# RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN32,MUBUF %s
	# RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch -run-pass=prologepilog %s -o - \| FileCheck -check-prefixes=CHECK,GCN64,FLATSCR %s


	# CHECK-LABEL: name: check_spill

	# FLATSCR: $sgpr33 = S_MOV_B32 0
	# FLATSCR: $flat_scr_lo = S_ADD_U32 $sgpr0, $sgpr11, implicit-def $scc
	# FLATSCR: $flat_scr_hi = S_ADDC_U32 $sgpr1, 0, implicit-def $scc, implicit $scc

	# S32 with kill
	# CHECK: V_WRITELANE
	# CHECK: $sgpr12 = S_MOV_B32 $exec_lo
	# CHECK: $exec_lo = S_MOV_B32 1
	# MUBUF: BUFFER_STORE_DWORD_OFFSET killed $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 4
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR killed $vgpr{{[0-9]+}}, $sgpr33, 4
	# CHECK: $exec_lo = S_MOV_B32 killed $sgpr12

	# S32 without kill
	# CHECK: V_WRITELANE
	# CHECK: $sgpr12 = S_MOV_B32 $exec_lo
	# CHECK: $exec_lo = S_MOV_B32 1
	# MUBUF: BUFFER_STORE_DWORD_OFFSET $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 4
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR $vgpr{{[0-9]+}}, $sgpr33, 4
	# CHECK: $sgpr12 = V_READLANE

	# S64 with kill
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 3
	# GCN64: $exec = S_MOV_B64 3
	# MUBUF: BUFFER_STORE_DWORD_OFFSET killed $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 8
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR killed $vgpr{{[0-9]+}}, $sgpr33, 8
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S64 without kill
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 3
	# GCN64: $exec = S_MOV_B64 3
	# MUBUF: BUFFER_STORE_DWORD_OFFSET $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 8
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR $vgpr{{[0-9]+}}, $sgpr33, 8
	# GCN32: $exec_lo = S_MOV_B32 $sgpr12
	# GCN64: $exec = S_MOV_B64 $sgpr12_sgpr13
	# GCN64: $sgpr13 = V_READLANE
	# CHECK: $sgpr12 = V_READLANE

	# S96
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 7
	# GCN64: $exec = S_MOV_B64 7
	# MUBUF: BUFFER_STORE_DWORD_OFFSET killed $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 16
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR killed $vgpr{{[0-9]+}}, $sgpr33, 16
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S128
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 15
	# GCN64: $exec = S_MOV_B64 15
	# MUBUF: BUFFER_STORE_DWORD_OFFSET killed $vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 28
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR killed $vgpr{{[0-9]+}}, $sgpr33, 28
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S160
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 31
	# GCN64: $exec = S_MOV_B64 31
	# MUBUF: BUFFER_STORE_DWORD_OFFSET {{(killed )?}}$vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 44
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR {{(killed )?}}$vgpr{{[0-9]+}}, $sgpr33, 44
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S256
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 255
	# GCN64: $exec = S_MOV_B64 255
	# MUBUF: BUFFER_STORE_DWORD_OFFSET {{(killed )?}}$vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 64
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR {{(killed )?}}$vgpr{{[0-9]+}}, $sgpr33, 64
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S512
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 65535
	# GCN64: $exec = S_MOV_B64 65535
	# MUBUF: BUFFER_STORE_DWORD_OFFSET {{(killed )?}}$vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 96
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR {{(killed )?}}$vgpr{{[0-9]+}}, $sgpr33, 96
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13

	# S1024
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# CHECK: V_WRITELANE
	# GCN32: $sgpr64 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr64_sgpr65 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 4294967295
	# GCN64: $exec = S_MOV_B64 4294967295
	# MUBUF: BUFFER_STORE_DWORD_OFFSET {{(killed )?}}$vgpr{{[0-9]+}}, ${{(sgpr[0-9_]+)*}}, $sgpr33, 160
	# FLATSCR: SCRATCH_STORE_DWORD_SADDR {{(killed )?}}$vgpr{{[0-9]+}}, $sgpr33, 160
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr64
	# GCN64: $exec = S_MOV_B64 killed $sgpr64_sgpr65

	--- \|

	define amdgpu_kernel void @check_spill() #0 {
	ret void
	}

	define amdgpu_kernel void @check_reload() #0 {
	ret void
	}

	attributes #0 = { "frame-pointer"="all" }
	...
	---
	name: check_spill
	tracksRegLiveness: true
	liveins:
	- { reg: '$sgpr4_sgpr5' }
	- { reg: '$sgpr6_sgpr7' }
	- { reg: '$sgpr8' }
	frameInfo:
	maxAlignment: 4
	stack:
	- { id: 0, type: spill-slot, size: 4, alignment: 4 }
	- { id: 1, type: spill-slot, size: 8, alignment: 4 }
	- { id: 2, type: spill-slot, size: 12, alignment: 4 }
	- { id: 3, type: spill-slot, size: 16, alignment: 4 }
	- { id: 4, type: spill-slot, size: 20, alignment: 4 }
	- { id: 5, type: spill-slot, size: 32, alignment: 4 }
	- { id: 6, type: spill-slot, size: 64, alignment: 4 }
	- { id: 7, type: spill-slot, size: 128, alignment: 4 }
	machineFunctionInfo:
	explicitKernArgSize: 660
	maxKernArgAlign: 4
	isEntryFunction: true
	waveLimiter: true
	scratchRSrcReg: '$sgpr96_sgpr97_sgpr98_sgpr99'
	stackPtrOffsetReg: '$sgpr32'
	frameOffsetReg: '$sgpr33'
	argumentInfo:
	flatScratchInit: { reg: '$sgpr0_sgpr1' }
	dispatchPtr: { reg: '$sgpr2_sgpr3' }
	privateSegmentBuffer: { reg: '$sgpr4_sgpr5_sgpr6_sgpr7' }
	kernargSegmentPtr: { reg: '$sgpr8_sgpr9' }
	workGroupIDX: { reg: '$sgpr10' }
	privateSegmentWaveByteOffset: { reg: '$sgpr11' }
	body: \|
	bb.0:
	liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7

	renamable $sgpr12 = IMPLICIT_DEF
	SI_SPILL_S32_SAVE killed $sgpr12, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12 = IMPLICIT_DEF
	SI_SPILL_S32_SAVE $sgpr12, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13 = IMPLICIT_DEF
	SI_SPILL_S64_SAVE killed $sgpr12_sgpr13, %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13 = IMPLICIT_DEF
	SI_SPILL_S64_SAVE $sgpr12_sgpr13, %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14 = IMPLICIT_DEF
	SI_SPILL_S96_SAVE killed $sgpr12_sgpr13_sgpr14, %stack.2, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15 = IMPLICIT_DEF
	SI_SPILL_S128_SAVE killed $sgpr12_sgpr13_sgpr14_sgpr15, %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16 = IMPLICIT_DEF
	SI_SPILL_S160_SAVE killed $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16, %stack.4, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19 = IMPLICIT_DEF
	SI_SPILL_S256_SAVE killed $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19, %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19_sgpr20_sgpr21_sgpr22_sgpr23_sgpr24_sgpr25_sgpr26_sgpr27 = IMPLICIT_DEF
	SI_SPILL_S512_SAVE killed $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19_sgpr20_sgpr21_sgpr22_sgpr23_sgpr24_sgpr25_sgpr26_sgpr27, %stack.6, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr64_sgpr65_sgpr66_sgpr67_sgpr68_sgpr69_sgpr70_sgpr71_sgpr72_sgpr73_sgpr74_sgpr75_sgpr76_sgpr77_sgpr78_sgpr79_sgpr80_sgpr81_sgpr82_sgpr83_sgpr84_sgpr85_sgpr86_sgpr87_sgpr88_sgpr89_sgpr90_sgpr91_sgpr92_sgpr93_sgpr94_sgpr95 = IMPLICIT_DEF
	SI_SPILL_S1024_SAVE killed $sgpr64_sgpr65_sgpr66_sgpr67_sgpr68_sgpr69_sgpr70_sgpr71_sgpr72_sgpr73_sgpr74_sgpr75_sgpr76_sgpr77_sgpr78_sgpr79_sgpr80_sgpr81_sgpr82_sgpr83_sgpr84_sgpr85_sgpr86_sgpr87_sgpr88_sgpr89_sgpr90_sgpr91_sgpr92_sgpr93_sgpr94_sgpr95, %stack.7, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32


	# CHECK-LABEL: name: check_reload

	# FLATSCR: $sgpr33 = S_MOV_B32 0
	# FLATSCR: $flat_scr_lo = S_ADD_U32 $sgpr0, $sgpr11, implicit-def $scc
	# FLATSCR: $flat_scr_hi = S_ADDC_U32 $sgpr1, 0, implicit-def $scc, implicit $scc

	# S32
	# CHECK: $sgpr12 = S_MOV_B32 $exec_lo
	# CHECK: $exec_lo = S_MOV_B32 1
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 4
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 4
	# CHECK: $exec_lo = S_MOV_B32 killed $sgpr12
	# CHECK: $sgpr12 = V_READLANE

	# S64
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 3
	# GCN64: $exec = S_MOV_B64 3
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 8
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 8
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE

	# S96
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 7
	# GCN64: $exec = S_MOV_B64 7
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 16
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 16
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE
	# CHECK: $sgpr14 = V_READLANE

	# S128
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 15
	# GCN64: $exec = S_MOV_B64 15
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 28
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 28
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE
	# CHECK: $sgpr14 = V_READLANE
	# CHECK: $sgpr15 = V_READLANE

	# S160
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 31
	# GCN64: $exec = S_MOV_B64 31
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 44
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 44
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE
	# CHECK: $sgpr14 = V_READLANE
	# CHECK: $sgpr15 = V_READLANE
	# CHECK: $sgpr16 = V_READLANE

	# S256
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 255
	# GCN64: $exec = S_MOV_B64 255
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 64
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 64
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE
	# CHECK: $sgpr14 = V_READLANE
	# CHECK: $sgpr15 = V_READLANE
	# CHECK: $sgpr16 = V_READLANE
	# CHECK: $sgpr17 = V_READLANE
	# CHECK: $sgpr18 = V_READLANE
	# CHECK: $sgpr19 = V_READLANE

	# S512
	# GCN32: $sgpr12 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr12_sgpr13 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 65535
	# GCN64: $exec = S_MOV_B64 65535
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 96
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 96
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr12
	# GCN64: $exec = S_MOV_B64 killed $sgpr12_sgpr13
	# CHECK: $sgpr12 = V_READLANE
	# CHECK: $sgpr13 = V_READLANE
	# CHECK: $sgpr14 = V_READLANE
	# CHECK: $sgpr15 = V_READLANE
	# CHECK: $sgpr16 = V_READLANE
	# CHECK: $sgpr17 = V_READLANE
	# CHECK: $sgpr18 = V_READLANE
	# CHECK: $sgpr19 = V_READLANE
	# CHECK: $sgpr20 = V_READLANE
	# CHECK: $sgpr21 = V_READLANE
	# CHECK: $sgpr22 = V_READLANE
	# CHECK: $sgpr23 = V_READLANE
	# CHECK: $sgpr24 = V_READLANE
	# CHECK: $sgpr25 = V_READLANE
	# CHECK: $sgpr26 = V_READLANE
	# CHECK: $sgpr27 = V_READLANE

	# S1024
	# GCN32: $sgpr64 = S_MOV_B32 $exec_lo
	# GCN64: $sgpr64_sgpr65 = S_MOV_B64 $exec
	# GCN32: $exec_lo = S_MOV_B32 4294967295
	# GCN64: $exec = S_MOV_B64 4294967295
	# MUBUF: BUFFER_LOAD_DWORD_OFFSET ${{(sgpr[0-9_]+)*}}, $sgpr33, 160
	# FLATSCR: SCRATCH_LOAD_DWORD_SADDR $sgpr33, 160
	# GCN32: $exec_lo = S_MOV_B32 killed $sgpr64
	# GCN64: $exec = S_MOV_B64 killed $sgpr64_sgpr65
	# CHECK: $sgpr64 = V_READLANE
	# CHECK: $sgpr65 = V_READLANE
	# CHECK: $sgpr66 = V_READLANE
	# CHECK: $sgpr67 = V_READLANE
	# CHECK: $sgpr68 = V_READLANE
	# CHECK: $sgpr69 = V_READLANE
	# CHECK: $sgpr70 = V_READLANE
	# CHECK: $sgpr71 = V_READLANE
	# CHECK: $sgpr72 = V_READLANE
	# CHECK: $sgpr73 = V_READLANE
	# CHECK: $sgpr74 = V_READLANE
	# CHECK: $sgpr75 = V_READLANE
	# CHECK: $sgpr76 = V_READLANE
	# CHECK: $sgpr77 = V_READLANE
	# CHECK: $sgpr78 = V_READLANE
	# CHECK: $sgpr79 = V_READLANE
	# CHECK: $sgpr80 = V_READLANE
	# CHECK: $sgpr81 = V_READLANE
	# CHECK: $sgpr82 = V_READLANE
	# CHECK: $sgpr83 = V_READLANE
	# CHECK: $sgpr84 = V_READLANE
	# CHECK: $sgpr85 = V_READLANE
	# CHECK: $sgpr86 = V_READLANE
	# CHECK: $sgpr87 = V_READLANE
	# CHECK: $sgpr88 = V_READLANE
	# CHECK: $sgpr89 = V_READLANE
	# CHECK: $sgpr90 = V_READLANE
	# CHECK: $sgpr91 = V_READLANE
	# CHECK: $sgpr92 = V_READLANE
	# CHECK: $sgpr93 = V_READLANE
	# CHECK: $sgpr94 = V_READLANE
	# CHECK: $sgpr95 = V_READLANE

	---
	name: check_reload
	tracksRegLiveness: true
	liveins:
	- { reg: '$sgpr4_sgpr5' }
	- { reg: '$sgpr6_sgpr7' }
	- { reg: '$sgpr8' }
	frameInfo:
	maxAlignment: 4
	stack:
	- { id: 0, type: spill-slot, size: 4, alignment: 4 }
	- { id: 1, type: spill-slot, size: 8, alignment: 4 }
	- { id: 2, type: spill-slot, size: 12, alignment: 4 }
	- { id: 3, type: spill-slot, size: 16, alignment: 4 }
	- { id: 4, type: spill-slot, size: 20, alignment: 4 }
	- { id: 5, type: spill-slot, size: 32, alignment: 4 }
	- { id: 6, type: spill-slot, size: 64, alignment: 4 }
	- { id: 7, type: spill-slot, size: 128, alignment: 4 }
	machineFunctionInfo:
	explicitKernArgSize: 660
	maxKernArgAlign: 4
	isEntryFunction: true
	waveLimiter: true
	scratchRSrcReg: '$sgpr96_sgpr97_sgpr98_sgpr99'
	stackPtrOffsetReg: '$sgpr32'
	frameOffsetReg: '$sgpr33'
	argumentInfo:
	flatScratchInit: { reg: '$sgpr0_sgpr1' }
	dispatchPtr: { reg: '$sgpr2_sgpr3' }
	privateSegmentBuffer: { reg: '$sgpr4_sgpr5_sgpr6_sgpr7' }
	kernargSegmentPtr: { reg: '$sgpr8_sgpr9' }
	workGroupIDX: { reg: '$sgpr10' }
	privateSegmentWaveByteOffset: { reg: '$sgpr11' }
	body: \|
	bb.0:
	liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7

	renamable $sgpr12 = SI_SPILL_S32_RESTORE %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13 = SI_SPILL_S64_RESTORE %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14 = SI_SPILL_S96_RESTORE %stack.2, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15 = SI_SPILL_S128_RESTORE %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16 = SI_SPILL_S160_RESTORE %stack.4, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19 = SI_SPILL_S256_RESTORE %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr12_sgpr13_sgpr14_sgpr15_sgpr16_sgpr17_sgpr18_sgpr19_sgpr20_sgpr21_sgpr22_sgpr23_sgpr24_sgpr25_sgpr26_sgpr27 = SI_SPILL_S512_RESTORE %stack.6, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

	renamable $sgpr64_sgpr65_sgpr66_sgpr67_sgpr68_sgpr69_sgpr70_sgpr71_sgpr72_sgpr73_sgpr74_sgpr75_sgpr76_sgpr77_sgpr78_sgpr79_sgpr80_sgpr81_sgpr82_sgpr83_sgpr84_sgpr85_sgpr86_sgpr87_sgpr88_sgpr89_sgpr90_sgpr91_sgpr92_sgpr93_sgpr94_sgpr95 = SI_SPILL_S1024_RESTORE %stack.7, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

llvm/test/CodeGen/AMDGPU/si-spill-sgpr-stack.ll

	; RUN: llc -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SGPR %s			; RUN: llc -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=ALL -check-prefix=SGPR %s

	; Make sure this doesn't crash.			; Make sure this doesn't crash.
	; ALL-LABEL: {{^}}test:			; ALL-LABEL: {{^}}test:
	; ALL: s_mov_b32 s[[LO:[0-9]+]], SCRATCH_RSRC_DWORD0			; ALL: s_mov_b32 s[[LO:[0-9]+]], SCRATCH_RSRC_DWORD0
	; ALL: s_mov_b32 s[[HI:[0-9]+]], 0xe80000			; ALL: s_mov_b32 s[[HI:[0-9]+]], 0xe80000

	; Make sure we are handling hazards correctly.			; Make sure we are handling hazards correctly.
	; SGPR: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], 0 offset:4			; SGPR: buffer_load_dword [[VHI:v[0-9]+]], off, s[{{[0-9]+:[0-9]+}}], 0 offset:16
	; SGPR-NEXT: s_mov_b64 exec, s[0:1]
	; SGPR-NEXT: s_waitcnt vmcnt(0)			; SGPR-NEXT: s_waitcnt vmcnt(0)
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 0			; SGPR-NEXT: v_readfirstlane_b32 s[[HI:[0-9]+]], [[VHI]]
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 1
	; SGPR-NEXT: v_readlane_b32 s{{[0-9]+}}, [[VHI]], 2
	; SGPR-NEXT: v_readlane_b32 s[[HI:[0-9]+]], [[VHI]], 3
	; SGPR-NEXT: s_nop 4			; SGPR-NEXT: s_nop 4
	; SGPR-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0			; SGPR-NEXT: buffer_store_dword v0, off, s[0:[[HI]]{{\]}}, 0

	; ALL: s_endpgm			; ALL: s_endpgm
	define amdgpu_kernel void @test(i32 addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @test(i32 addrspace(1)* %out, i32 %in) {
	call void asm sideeffect "", "~{s[0:7]}" ()			call void asm sideeffect "", "~{s[0:7]}" ()
	call void asm sideeffect "", "~{s[8:15]}" ()			call void asm sideeffect "", "~{s[8:15]}" ()
	call void asm sideeffect "", "~{s[16:23]}" ()			call void asm sideeffect "", "~{s[16:23]}" ()
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/spill-m0.ll

	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVGPR -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVGPR -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVGPR -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=1 -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVGPR -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVMEM -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVMEM -check-prefix=GCN %s
	; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVMEM -check-prefix=GCN %s			; RUN: llc -O0 -amdgpu-spill-sgpr-to-vgpr=0 -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=TOVMEM -check-prefix=GCN %s

	; XXX - Why does it like to use vcc?			; XXX - Why does it like to use vcc?

	; GCN-LABEL: {{^}}spill_m0:			; GCN-LABEL: {{^}}spill_m0:

	; GCN: #ASMSTART			; GCN: #ASMSTART
	; GCN-NEXT: s_mov_b32 m0, 0			; GCN-NEXT: s_mov_b32 m0, 0
	; GCN-NEXT: #ASMEND			; GCN-NEXT: #ASMEND
	; GCN-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0			; GCN-DAG: s_mov_b32 [[M0_COPY:s[0-9]+]], m0

	; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]], [[M0_LANE:[0-9]+]]			; TOVGPR: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]], [[M0_LANE:[0-9]+]]

	; TOVMEM: v_writelane_b32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]], 0			; TOVMEM: v_mov_b32_e32 [[SPILL_VREG:v[0-9]+]], [[M0_COPY]]
	; TOVMEM: s_mov_b32 [[COPY_EXEC_LO:s[0-9]+]], exec_lo
	; TOVMEM: s_mov_b32 exec_lo, 1
	; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:4 ; 4-byte Folded Spill			; TOVMEM: buffer_store_dword [[SPILL_VREG]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:4 ; 4-byte Folded Spill
	; TOVMEM: s_mov_b32 exec_lo, [[COPY_EXEC_LO]]

	; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN: s_cbranch_scc1 [[ENDIF:BB[0-9]+_[0-9]+]]

	; GCN: [[ENDIF]]:			; GCN: [[ENDIF]]:
	; TOVGPR: v_readlane_b32 [[M0_RESTORE:s[0-9]+]], [[SPILL_VREG]], [[M0_LANE]]			; TOVGPR: v_readlane_b32 [[M0_RESTORE:s[0-9]+]], [[SPILL_VREG]], [[M0_LANE]]
	; TOVGPR: s_mov_b32 m0, [[M0_RESTORE]]			; TOVGPR: s_mov_b32 m0, [[M0_RESTORE]]

	; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:4 ; 4-byte Folded Reload			; TOVMEM: buffer_load_dword [[RELOAD_VREG:v[0-9]+]], off, s{{\[[0-9]+:[0-9]+\]}}, 0 offset:4 ; 4-byte Folded Reload
	; TOVMEM: s_waitcnt vmcnt(0)			; TOVMEM: s_waitcnt vmcnt(0)
	; TOVMEM: v_readlane_b32 [[M0_RESTORE:s[0-9]+]], [[RELOAD_VREG]], 0			; TOVMEM: v_readfirstlane_b32 [[M0_RESTORE:s[0-9]+]], [[RELOAD_VREG]]
	; TOVMEM: s_mov_b32 m0, [[M0_RESTORE]]			; TOVMEM: s_mov_b32 m0, [[M0_RESTORE]]

	; GCN: s_add_i32 s{{[0-9]+}}, m0, 1			; GCN: s_add_i32 s{{[0-9]+}}, m0, 1
	define amdgpu_kernel void @spill_m0(i32 %cond, i32 addrspace(1)* %out) #0 {			define amdgpu_kernel void @spill_m0(i32 %cond, i32 addrspace(1)* %out) #0 {
	entry:			entry:
	%m0 = call i32 asm sideeffect "s_mov_b32 m0, 0", "={m0}"() #0			%m0 = call i32 asm sideeffect "s_mov_b32 m0, 0", "={m0}"() #0
	%cmp0 = icmp eq i32 %cond, 0			%cmp0 = icmp eq i32 %cond, 0
	br i1 %cmp0, label %if, label %endif			br i1 %cmp0, label %if, label %endif
	▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/spill-special-sgpr.mir

Show All 40 Lines	bb.0:
; GFX9: $sgpr33 = S_MOV_B32 0		; GFX9: $sgpr33 = S_MOV_B32 0
; GFX9: $sgpr12 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr12 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr13 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr13 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr14 = S_MOV_B32 4294967295, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr14 = S_MOV_B32 4294967295, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr15 = S_MOV_B32 14680064, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr15 = S_MOV_B32 14680064, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr12 = S_ADD_U32 $sgpr12, $sgpr9, implicit-def $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr12 = S_ADD_U32 $sgpr12, $sgpr9, implicit-def $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $sgpr13 = S_ADDC_U32 $sgpr13, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15		; GFX9: $sgpr13 = S_ADDC_U32 $sgpr13, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr12_sgpr13_sgpr14_sgpr15
; GFX9: $vcc = IMPLICIT_DEF		; GFX9: $vcc = IMPLICIT_DEF
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX9: $vgpr0 = V_MOV_B32_e32 $vcc_lo, implicit $exec, implicit $vcc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX9: $vcc = S_MOV_B64 $exec		; GFX9: $vgpr0 = V_MOV_B32_e32 $vcc_hi, implicit $exec, implicit $vcc
; GFX9: $exec = S_MOV_B64 3		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0 + 4, addrspace 5)
; GFX9: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 $vcc
; GFX9: $vcc_hi = V_READLANE_B32 $vgpr0, 1
; GFX9: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0
; GFX9: $vcc = IMPLICIT_DEF		; GFX9: $vcc = IMPLICIT_DEF
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX9: $vgpr0 = V_MOV_B32_e32 $vcc_lo, implicit $exec, implicit $vcc
; GFX9: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc
; GFX9: $vcc = S_MOV_B64 $exec
; GFX9: $exec = S_MOV_B64 3
; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 killed $vcc		; GFX9: $vgpr0 = V_MOV_B32_e32 $vcc_hi, implicit $exec, implicit killed $vcc
; GFX9: $vcc = S_MOV_B64 $exec		; GFX9: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0 + 4, addrspace 5)
; GFX9: $exec = S_MOV_B64 3
; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)
; GFX9: $exec = S_MOV_B64 killed $vcc		; GFX9: $vcc_lo = V_READFIRSTLANE_B32 killed $vgpr0, implicit $exec, implicit-def $vcc
; GFX9: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc		; GFX9: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr12_sgpr13_sgpr14_sgpr15, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0 + 4, addrspace 5)
; GFX9: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1		; GFX9: $vcc_hi = V_READFIRSTLANE_B32 killed $vgpr0, implicit $exec, implicit-def $vcc
; GFX10-LABEL: name: check_vcc		; GFX10-LABEL: name: check_vcc
; GFX10: liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr9		; GFX10: liveins: $sgpr8, $sgpr4_sgpr5, $sgpr6_sgpr7, $sgpr9
; GFX10: $sgpr33 = S_MOV_B32 0		; GFX10: $sgpr33 = S_MOV_B32 0
; GFX10: $sgpr96 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr96 = S_MOV_B32 &SCRATCH_RSRC_DWORD0, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr97 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr97 = S_MOV_B32 &SCRATCH_RSRC_DWORD1, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr98 = S_MOV_B32 4294967295, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr98 = S_MOV_B32 4294967295, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr99 = S_MOV_B32 836853760, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr99 = S_MOV_B32 836853760, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr96 = S_ADD_U32 $sgpr96, $sgpr9, implicit-def $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr96 = S_ADD_U32 $sgpr96, $sgpr9, implicit-def $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $sgpr97 = S_ADDC_U32 $sgpr97, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99		; GFX10: $sgpr97 = S_ADDC_U32 $sgpr97, 0, implicit-def $scc, implicit $scc, implicit-def $sgpr96_sgpr97_sgpr98_sgpr99
; GFX10: $vcc = IMPLICIT_DEF		; GFX10: $vcc = IMPLICIT_DEF
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX10: $vgpr0 = V_MOV_B32_e32 $vcc_lo, implicit $exec, implicit $vcc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit $vcc		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX10: $vcc = S_MOV_B64 $exec		; GFX10: $vgpr0 = V_MOV_B32_e32 $vcc_hi, implicit $exec, implicit $vcc
; GFX10: $exec = S_MOV_B64 3		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0 + 4, addrspace 5)
; GFX10: BUFFER_STORE_DWORD_OFFSET $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 $vcc
; GFX10: $vcc_hi = V_READLANE_B32 $vgpr0, 1
; GFX10: $vcc_lo = V_READLANE_B32 killed $vgpr0, 0
; GFX10: $vcc = IMPLICIT_DEF		; GFX10: $vcc = IMPLICIT_DEF
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_lo, 0, undef $vgpr0, implicit $vcc		; GFX10: $vgpr0 = V_MOV_B32_e32 $vcc_lo, implicit $exec, implicit $vcc
; GFX10: $vgpr0 = V_WRITELANE_B32 $vcc_hi, 1, $vgpr0, implicit killed $vcc
; GFX10: $vcc = S_MOV_B64 $exec
; GFX10: $exec = S_MOV_B64 3
; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 killed $vcc		; GFX10: $vgpr0 = V_MOV_B32_e32 $vcc_hi, implicit $exec, implicit killed $vcc
; GFX10: $vcc = S_MOV_B64 $exec		; GFX10: BUFFER_STORE_DWORD_OFFSET killed $vgpr0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %stack.0 + 4, addrspace 5)
; GFX10: $exec = S_MOV_B64 3
; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 4, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)
; GFX10: $exec = S_MOV_B64 killed $vcc		; GFX10: $vcc_lo = V_READFIRSTLANE_B32 killed $vgpr0, implicit $exec, implicit-def $vcc
; GFX10: $vcc_lo = V_READLANE_B32 $vgpr0, 0, implicit-def $vcc		; GFX10: $vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr33, 8, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0 + 4, addrspace 5)
; GFX10: $vcc_hi = V_READLANE_B32 killed $vgpr0, 1		; GFX10: $vcc_hi = V_READFIRSTLANE_B32 killed $vgpr0, implicit $exec, implicit-def $vcc
$vcc = IMPLICIT_DEF		$vcc = IMPLICIT_DEF
SI_SPILL_S64_SAVE $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		SI_SPILL_S64_SAVE $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

$vcc = IMPLICIT_DEF		$vcc = IMPLICIT_DEF
SI_SPILL_S64_SAVE killed $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		SI_SPILL_S64_SAVE killed $vcc, %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32

$vcc = SI_SPILL_S64_RESTORE %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32		$vcc = SI_SPILL_S64_RESTORE %stack.0, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr32
...		...