This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Revert agnostic SGPR spill.
AbandonedPublic

Authored by hliao on Feb 18 2021, 11:04 AM.

Details

Summary
  • With that explicit exec mask manipulation, we may clobber global VGPRs during SGPR spilling or reloading. For instance, the following pseudo code illustrate such a clobbering:
v[62:63] = def(...);
spill(v[62:63], stack.0);
for (loop.cond) {
  ... = use(v[62:63]);
  if (branch.cond) {
    ... reuse of v[62:v63];
    ... SGPR reload through v62 or v63;
    v[62:63] = reload(stack.0);
    // At this point, v[62:v63] is clobbered if branch.cond doesn't
    // cover lanes 0 and 1.
  }
}
  • For concerns in the origianl patch, we should not worry about the different exec masks between SGPR spills and reloads. As the IR is deSSA-ed from the original SSA form, we guarantee a def always dominates all its uses and thus the point spilling a value always dominates the point where that value is reloaded again. The exec mask at the reloading point is guaranteed to be a subset of the exec mask at the spilling point. As long as the SGPR is broadcasted to VGPR in the spilling point and v_readfirstlane is used to load SGPR from VGPR in the reloading point, the original SGPR value is always reloaded regard to that exec mask.

Diff Detail

Event Timeline

hliao created this revision.Feb 18 2021, 11:04 AM
hliao requested review of this revision.Feb 18 2021, 11:04 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2021, 11:04 AM
hliao edited the summary of this revision. (Show Details)Feb 18 2021, 11:11 AM
hliao added a comment.Feb 18 2021, 3:59 PM

Conflicts with D96336/D96517

Just have brief look on that two changes, they seem always manipulate the exec mask explicitly and should have the same issue on clobbering global VGPRs. We should not change the current exec mask if we need a temp VGPR during SGPR spilling. That's quite risky that a global VGPR is clobbered.

hliao added a comment.EditedFeb 18 2021, 4:28 PM

Conflicts with D96336/D96517

Just have a detailed look of that two. Shall we optimize the case where 1 or 2 SGPRs need spilling/reloading and there's a scavenged register available. For that case, we just use the original implementation (based on broadcast and v_readfirstlane). In terms of LD/ST, the original is comparable to the proposed but we have much less code and also remove the dependency on the restore of that store of the whole tmp VGPR. HPC workloads mostly spill 1 or 2 SGPRs.

foad added a comment.Feb 19 2021, 12:17 AM

HPC workloads mostly spill 1 or 2 SGPRs.

Can you explain this a bit more? Spilling SGPRs to memory is supposed to be a very rare case. Why is it common for HPC workloads? Is there a better way to fix it?

Is it at all possible to encapsulate your failure case in a lit test?

Is the problem not with the exec manipulation, but really that the register scavenger choosing an inappropriate register, w.r.t. wave level CFG?

hliao added a comment.Feb 19 2021, 8:12 AM

Is it at all possible to encapsulate your failure case in a lit test?

Is the problem not with the exec manipulation, but really that the register scavenger choosing an inappropriate register, w.r.t. wave level CFG?

The register scavenge may be improved to a better candidate without live values in the inactive lanes. But, we should the cases where such a candidate is not available at all. The issue is exactly from the explicit exec mask manipulation, which doesn't honor the current exec mask and tries to access the inactive parts. Without inactive lanes being protected, we would always have the issue.

hliao added a comment.Feb 19 2021, 8:13 AM

HPC workloads mostly spill 1 or 2 SGPRs.

Can you explain this a bit more? Spilling SGPRs to memory is supposed to be a very rare case. Why is it common for HPC workloads? Is there a better way to fix it?

Pointers are extensively used in most HPC workloads, where double is also frequently used. There are lots of uniform 64-bit values in either pointer or double.

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.
exec=0 is a corner case, but I don’t think we can build on that.

Other than that, it surely is more efficient than the (worst case) 4 memory operations introduced by saving all lanes of the used VGPR :)

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.
exec=0 is a corner case, but I don’t think we can build on that.

Other than that, it surely is more efficient than the (worst case) 4 memory operations introduced by saving all lanes of the used VGPR :)

That's the part I don't understand. Why code path is still executed when exec mask is 0? For the regular code by the compiler, exec mask 0 always results in branch away on that code path. There's even no chance to execute that. Could you elaborate on how a code path could be executed with exec mask 0?

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

That's the part I don't understand. Why code path is still executed when exec mask is 0? For the regular code by the compiler, exec mask 0 always results in branch away on that code path. There's even no chance to execute that. Could you elaborate on how a code path could be executed with exec mask 0?

SIRemoveShortExecBranches.cpp is one source of executing instructions when exec == 0. I am not sure if there are any others.

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I realized SIRemoveShortExecBranches.cpp has done correctness checks. So removing SIInsertSkips should work if we make sure there will always be a branching instruction for each control flow change.

hliao added a comment.Feb 22 2021, 7:56 AM

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

I realized SIRemoveShortExecBranches.cpp has done correctness checks. So removing SIInsertSkips should work if we make sure there will always be a branching instruction for each control flow change.

When exec mask goes to zero, an SGPR spill in that corresponding code path has no effect at all. As our IR is deSSAed from the SSA form. A def always dominates all its uses. An SGPR spill always dominates its reloads. If that SGPR spill has exec mask 0, all its reloads have exec mask 0 as well. It's simply OK to ignore that spill as, semantically, there is no change in the program state.

hliao added a comment.Feb 22 2021, 9:04 AM

I think this approach fails when exec is zero.
The v_mov for the save will be a noop, the v_readfirstline for the restore will read lane 0, which contains some unknown value.

For exec == 0 when reloading, I think the basic block that contains v_readfirstlane will be jumped over, see SIInsertSkips.cpp and hasUnwantedEffectsWhenEXECEmpty()

I'm trying to eliminate SIInsertSkips. Initially, all branches that go over exec changes should insert the skip jump. We then should eliminate them in cases where they aren't needed and the blocks are short.

For that elimination, we could choose to skip it if the body of the branch has unwanted effect.

hliao updated this revision to Diff 325557.Feb 22 2021, 1:43 PM

Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.

Mark SI_SPILL_<n>_RESTORE having unwanted effect so that they would be executed under exec mask 0.

I assume you mean "[...] so they would not be executed under exec mask 0"?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3316 ↗(On Diff #325557)

This seems reasonable.
It avoids the case where an EXEC=0 region attempts to restore an SGPR for immediate use in the same region.
I guess we are assuring ourselves that semantically an EXEC=0 region never restores an SGPR expecting it to be live beyond that region?
(And that EXEC=0 regions will never save SGPRs expecting them to have meaningful effects elsewhere.)

ruiling added inline comments.Feb 22 2021, 5:18 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3302–3315 ↗(On Diff #325557)

I don't think we need this, these instructions have already been lowered to v_readfirstlane when we try to optimize off the skip-jump.

critson added inline comments.Feb 22 2021, 6:58 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3302–3315 ↗(On Diff #325557)

True if these have already been lowered then this code will have no additional effect.

hliao added inline comments.Feb 23 2021, 9:33 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3316 ↗(On Diff #325557)

This seems reasonable.
It avoids the case where an EXEC=0 region attempts to restore an SGPR for immediate use in the same region.
I guess we are assuring ourselves that semantically an EXEC=0 region never restores an SGPR expecting it to be live beyond that region?
(And that EXEC=0 regions will never save SGPRs expecting them to have meaningful effects elsewhere.)

So that, it means technically, we won't have EXEC = 0 case at all, right?

hliao updated this revision to Diff 326077.Feb 24 2021, 6:59 AM

Remove that unnecessary change and add rationale why that's safe for the original concerns.

hliao added a comment.Feb 24 2021, 7:03 AM

In addition, we already heavily use v_readfirstlane in our codegen due to some patterns benefits by using vector instructions when no corresponding scalar instructions could be used. I believed it's quite safe that it's guaranteed v_readfirstlane won't be executed when exec mask goes to 0.

Fine with me. It would be nice if someone knowledgable of LLVM can say if relying on exec != 0 in MIR works or not.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1233–1234

Shouldn’t this still define SuperReg on the first v_readfirstlane, like before?

critson accepted this revision.Feb 24 2021, 5:04 PM

LGTM - but please address the SuperReg definition issue raised by Sebastian.

I will test this for graphics and we can address any issues arising in a follow up patch.
(I do not expect any.)

This revision is now accepted and ready to land.Feb 24 2021, 5:04 PM

Fine with me. It would be nice if someone knowledgable of LLVM can say if relying on exec != 0 in MIR works or not.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

Unfortunately I realized and have verified at least one problem with this and WQM.
With WQM the assumption that the EXEC mask for restore of the SGPR is a subset of the spill EXEC mask is not true.
Specifically an SGPR can be saved before entering WQM, then restored in WQM (so the readfirstlane will return junk).
This could potentially be addressed by ensuring the exec mask mode matches between spills and restores in the WQM pass.
(I can add this to WQM pass if needed, but will think a bit more on it first.)

Note: this issue is not necessarily an immediate problem for graphics, because we try to avoid all spills (particular SGPR spills to memory).
I am explicitly testing with SGPR to VGPR spills disabled and lowered SGPR counts to perturb the problem.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

Unfortunately I realized and have verified at least one problem with this and WQM.
With WQM the assumption that the EXEC mask for restore of the SGPR is a subset of the spill EXEC mask is not true.
Specifically an SGPR can be saved before entering WQM, then restored in WQM (so the readfirstlane will return junk).

Yes, that sounds a serious problem. But the issue may not be only specific to this scenario. We sometimes have to broadcast uniform value into VGPR because we only have the V_xxx instruction instead of S_xxx instruction, and later we may use v_readfirstlane to read the value back into SGPR for later scalar operations. If one operation happens before WQM while another in WQM, things may also be wrong. I am not quite sure whether this would happen in real-world case. But sounds possible to me.

This could potentially be addressed by ensuring the exec mask mode matches between spills and restores in the WQM pass.
(I can add this to WQM pass if needed, but will think a bit more on it first.)

Note: this issue is not necessarily an immediate problem for graphics, because we try to avoid all spills (particular SGPR spills to memory).

I want to say that the issue this patch tries to solve is a blocking issue for us. That's why @sebastian-ne is also actively working to solve it.

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

Could you explain a little bit more? do you mean a transformation that may merge/duplicate blocks so that predecessor changed? What kind of verification do you think is needed to catch the problem in your mind?

(I get the argument that IR does not run code if exec = 0, however, MIR models the hardware rather than a high-level language, and exec = 0 is perfectly fine there and even required in some cases, like for the last null export inserted in SIInsertSkips.)

I'm not convinced this always works. It's possible some transformation ends up violating this. We need to track both sets of predecessors and add some verification for this

What do you mean by "violating this"? Do you mean some transformation may failed to keep a jump on EXEC = 0 for each divergent branching?

Yes. The MIR doesn't track divergent predecessors and we don't have any verification for this

I think EXEC = 0 problem can only happen when there is a conditional-branching which can make active lanes less. Unconditional branching will never be a point to bring in EXEC = 0 issue. I think MIR transform should ensure the branching is always there for conditional-branching, transformations are only allowed to delete unconditional branching while keeping the semantic of the program unchanged. I don't see which step may possibly go wrong.

I also want to say that the possible EXEC = 0 problem (if any) is not specific to such spill solution. If there is any such situation, we definitely should fix it elsewhere. Normal program which includes scalar instructions suffering from "exec=0 sideeffect" would also have such problem.

As far as I can see, the current state of this is

  1. Carl found a bug when WQM is entered between save and restore
  2. Matt and Stas have concerns about the correctness and want more verification

David claimed that we can handle the first issue (WQM) downstream (although I don’t know how he intends to do this, just hoping this never happens in practice?).
The second issue (more verification) seems to be stuck over the last weeks.

As a faster solution, can we submit D96336 to fix the correctness issue until the remaining issues here are solved?
I can revert D96336 once this patch is ready to go in.

I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
Imagine the following pseudocode:

function main() {
  // exec = 0xff
  if (<divergent condition>) {
    // exec = 0xf0
    foo();
  }
  // continue doing things
}

function foo() {
  // exec = 0xf0
  <spill s0>

  // Kills all currently active lanes
  // However, more lanes are active outside the call, so we can’t s_endpgm
  llvm.amdgcn.kill(false);

  // exec = 0x00
  // We still need to restore s0 (if it is a callee-save register)
  <restore s0>
}

I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
... code example omitted...

I agree this could happen if we use function calls in pixel shaders (PS); however, I don't think we have any plans to do this yet?
(Currently the only expected place for a kill is in PS, although we do support it elsewhere.)

I think I found a way how exec can get zero before a restore (disclaimer: I’m not sure if that is valid code in any of the graphics APIs or if something similar can happen in compute. It should be valid LLVM IR, however).
Imagine the following pseudocode:

function main() {
  // exec = 0xff
  if (<divergent condition>) {
    // exec = 0xf0
    foo();
  }
  // continue doing things
}

function foo() {
  // exec = 0xf0
  <spill s0>

  // Kills all currently active lanes
  // However, more lanes are active outside the call, so we can’t s_endpgm
  llvm.amdgcn.kill(false);

  // exec = 0x00
  // We still need to restore s0 (if it is a callee-save register)
  <restore s0>
}

We don't use kills at compute side and I personally don't believe it can be legal under any high level labguage model to do so. There are things like throw and even abort, but those lead to immediate control transfer and backed by the HW.

hliao abandoned this revision.May 25 2021, 11:36 AM
llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll