This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Add a pass to avoid jump into blocks with 0 exec mask.
AbandonedPublic

Authored by hliao on Mar 29 2021, 7:38 AM.

Details

Summary
  • For such blocks where the mask is restored from a reloaded mask, zero exec mask results in the undefined behavior as the SGPR reload uses v_readfirstlane. Avoid such cases by transforms s_cbranch_execz and s_cbranch_execnz into equivalent branches without evaluating exec mask too eager.

Diff Detail

Event Timeline

hliao created this revision.Mar 29 2021, 7:38 AM
hliao requested review of this revision.Mar 29 2021, 7:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2021, 7:38 AM
foad added inline comments.Mar 29 2021, 7:46 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
19

I don't know what you mean by "relax" here.

This comment doesn't explain why you'd want to do this transformation.

hliao added a comment.EditedMar 29 2021, 7:52 AM

This's the companion fix for D96980. That explain the myth why the original agnostic SGPR spill/reload is proposed to solve the issue SGRP spill/reload may be executed when exec mask goes to zero.
We did try to skip executing code when exec mask goes to zero by branch on EXECZ (to the target block) or EXECNZ (to the fallthrough block.) We may run instructions with zero exec mask. But, that's usually not an issue as we immediately restore the exec mask on the targeted block. Code following that exec mask restoration won't be executed in 0 mask. However, if that mask restoration needs to reload a spilled exec mask, we will run the SGPR reload with 0 mask, where v_readfristlane has undefined behavior when exec mask is zero.
This patch tries to mitigate that case by not evaluating exec mask that early or clearing the exec mask when the branch target has mask restoration following SGRP reload. Instead of checking EXECZ or EXECNZ, the exec mask evaluation is duplicated with a temporary SGRP as the destination (without updating exec mask directly), checking SCC0 is equivalent to EXECZ. Exec mask is only evaluated when the result won't be zero. For instance,

    $exec = MASK_EVALUATION()
    s_cbranch_execz TARGET
    ...
TARGET:
   $mask = SGPR_RELOAD(SLOT)
   $exec = OR $exec, $exec, $mask

is translated into

  $tmp = MASK_EVALUATION()
  s_cbranch_scc0 TARGET
  $exec = MASK_EVALUATION()
  ...
TARGET:
   $mask = SGPR_RELOAD(SLOT)
   $exec = OR $exec, $exec, $mask

Note that such transformation is only applied when that mask restoration needs SGPR reloading. As the mask restoration is the merge point of CFG, the predecessor block should have its exec mask always subset of the mask to be restored. It's safe to use the exec mask before that exec mask evaluation.

hliao added inline comments.Mar 29 2021, 7:54 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
19

just add comment on the motivation of this pass. Will put the details in the source code then.

foad added inline comments.Mar 29 2021, 9:26 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

What happens if there are no free sgprs so the scavenger has to spill something? That sounds like yet another case that won't work correctly when exec is zero.

foad added a comment.Mar 29 2021, 9:37 AM

In general I am uncomfortable about generating code that does not work (i.e. expanding spills the way we do when exec might be 0) and then running yet another pass for correctness to fix it up later. Is there a way this can be made correct by default, and if necessary run an extra pass that optimizes it for efficiency?

Anyway I am not an expert in this area. I am happy to be overruled by people who know more about it.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1231

I don't think you can put anything that inserts extra instructions after BranchRelaxation. Putting it after the hazard recognizer might be risky too. Am I right in thinking that you just want to run this after spills are lowered to real code (prologue-epilogue insertion?)?

hliao added inline comments.Mar 29 2021, 9:56 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

Yeah, that's true. If we have a null-like reg pair, we could save the searching for that tmp SGPR as well. We have a null reg definition but we need an SGPR pair for the WAVE64 case. As we only duplicate that evaluation to update SCC, a null reg-pair would be sufficient.

rampitec added inline comments.Mar 29 2021, 9:58 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

NULL works for 64 bit pair too. Although it is not available on every target.

hliao added inline comments.Mar 29 2021, 9:59 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1231

branch-relax may convert EXECZ branch to EXECNZ branch or vice versa. I choose to run this pass after branch-relax is just a random choice after the compiler makes the final decision on what branch should be used. I am open to the place where we run this pass to relax the branches.

hliao added inline comments.Mar 29 2021, 10:03 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

Could you elaborate more? SGPR_NULL is currently defined as a 32-bit SGPR @ offset 125. For that 64-bit SGPR pair, we need a pair at an *even* offset based on the ISA document.

rampitec added inline comments.Mar 29 2021, 10:23 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too:

llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null'
        .text
        s_mov_b64 s[0:1], null                  ; encoding: [0x7d,0x04,0x80,0xbe]

Not sure if you would need to fix something in the verifier.

But again, this is not a universal solution, it is gfx10 only.

hliao added inline comments.Mar 29 2021, 10:30 AM
llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220

It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too:

llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null'
        .text
        s_mov_b64 s[0:1], null                  ; encoding: [0x7d,0x04,0x80,0xbe]

Not sure if you would need to fix something in the verifier.

But again, this is not a universal solution, it is gfx10 only.

That explains why I cannot find that usage in vega ISA document. From scalar operand encoding map, it seems we have several reserved slots like 209-234.

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

What's the bug you refer to? I am not aware CFG lowering has anything issue except direct uses of EXECZ/EXECNZ scared me. If the target or fallthrough block of that EXECZ/EXECNZ branches could restore the mask immediately, that sounds fine. But, we run into the case where the mask to be restored needs reloading.

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

What's the bug you refer to? I am not aware CFG lowering has anything issue except direct uses of EXECZ/EXECNZ scared me. If the target or fallthrough block of that EXECZ/EXECNZ branches could restore the mask immediately, that sounds fine. But, we run into the case where the mask to be restored needs reloading.

Instead of having a fixup patch to avoid cases where this happens, we should have the infrastructure to stop this from happening in the first place

Instead of having a fixup patch to avoid cases where this happens, we should have the infrastructure to stop this from happening in the first place

This is my position as well.

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
201

Presumably this needs to depend on IsWave32.

221

This also needs to depends on IsWave32.

For such blocks where the mask is restored from a reloaded mask, zero exec mask results in the undefined behavior as the SGPR reload uses v_readfirstlane

Can we mark the SGPRs holding the masks unspillable during LowerControlFlow to fix your problem? After we split SGPR/VGPR allocation, this problem would disappear.

hliao added a comment.Mar 30 2021, 6:11 AM

It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:

  • Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.
  • When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.

Any suggestions and comments?

It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:

  • Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.
  • When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.

Any suggestions and comments?

I think this requires a lot more thought. I think we need deeper IR changes. I believe MachineBasicBlock needs to start tracking both uniform and divergent predecessors/successors, but not sure what this ends up looking like. I think the terminators should still reflect the hardware instructions, and the exec issues would be tracked through the divergent pred/succ.

nhaehnle requested changes to this revision.Apr 1 2021, 3:12 PM

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.
  2. Load directly from memory using an SMEM load instruction.

Both types of reload should work just fine with exec=0.

Keeping a lane mask in a VGPR is fundamentally a nonsensical thing to do because it clashes with the whole theory of how different types of data (uniform vs. divergent) are represented in AMDGPU's implementation of SIMT. So I'd really rather we fix that instead of adding yet another hack onto the existing pile of hacks. At the very least, we need to understand this better.

This revision now requires changes to proceed.Apr 1 2021, 3:12 PM
hliao added a comment.Apr 3 2021, 8:16 AM

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly. Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR. As a result, we need to READFIRSTLANE correspondingly when an SGPR needs reloading. Exec mask 0 makes that READFIRSTLANE undefined and we need to ensure proper exec mask is used.

  1. Load directly from memory using an SMEM load instruction.

Both types of reload should work just fine with exec=0.

Keeping a lane mask in a VGPR is fundamentally a nonsensical thing to do because it clashes with the whole theory of how different types of data (uniform vs. divergent) are represented in AMDGPU's implementation of SIMT. So I'd really rather we fix that instead of adding yet another hack onto the existing pile of hacks. At the very least, we need to understand this better.

t-tye added a reviewer: mjbedy.Apr 3 2021, 9:29 AM

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly.

Let me rephrase to make sure I understood you correctly. You're saying that spilling an SGPR to a fixed lane of a VGPR may cause data of an inactive lane to be overwritten. This is a problem if the spill/reload happens in a called function, because VGPR save/reload doesn't save those inactive lanes. (HPC is irrelevant here.)

My understanding is that @sebastian-ne is working on a fix for this, see D96336.

Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR.

Wrong, it's way more overhead. Instead of copying the 32-bit value into a single lane of a VGPR, you may be copying it into up to 64 lanes. That's very wasteful. We should fix the root cause instead, which means properly saving/restoring the VGPRs that are used for SGPR-to-VGPR spilling.

hliao added a comment.Apr 5 2021, 7:06 AM

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

  1. Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly.

Let me rephrase to make sure I understood you correctly. You're saying that spilling an SGPR to a fixed lane of a VGPR may cause data of an inactive lane to be overwritten. This is a problem if the spill/reload happens in a called function, because VGPR save/reload doesn't save those inactive lanes. (HPC is irrelevant here.)

We found the inactive lane overwriting issue in a HPC workload where all functions are inlined but its CFG is quite complicated also it has very high SGPR register pressure due to pointer usage. The overwriting is observed in the pseudo code illustrated in https://reviews.llvm.org/D96980. That HPC workload runs correctly after reverting that agnostic SGPR spill.

My understanding is that @sebastian-ne is working on a fix for this, see D96336.

Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR.

Wrong, it's way more overhead. Instead of copying the 32-bit value into a single lane of a VGPR, you may be copying it into up to 64 lanes. That's very wasteful. We should fix the root cause instead, which means properly saving/restoring the VGPRs that are used for SGPR-to-VGPR spilling.

The alternative spill approach adds the overhead to save/restore a VGPR in both SGPR spill and reload. Comparing to VGPR waste, that's newly added memory overhead is a significant one, especially for SGRP spill. Those spills are store-only previously but now add extra memory load overhead. Considering that not all spills are reloaded later in the application runtime, that extra memory load overhead in a spill cannot justice the VGPR space saved. But, I do agree that SGPR to memory spill needs a more efficient mechanism.

hliao abandoned this revision.Apr 5 2021, 8:06 AM