This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
1/2
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
4/10
SIAvoidZeroExecMask.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
control-flow-fastregalloc.ll

Differential D99507

[amdgpu] Add a pass to avoid jump into blocks with 0 exec mask.
AbandonedPublic

Authored by hliao on Mar 29 2021, 7:38 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
critson
tpr
foad
sameerds
sebastian-ne
dstuttard
nhaehnle
mjbedy

Summary

For such blocks where the mask is restored from a reloaded mask, zero exec mask results in the undefined behavior as the SGPR reload uses v_readfirstlane. Avoid such cases by transforms s_cbranch_execz and s_cbranch_execnz into equivalent branches without evaluating exec mask too eager.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.Mar 29 2021, 7:38 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 8 others. · View Herald TranscriptMar 29 2021, 7:38 AM

hliao requested review of this revision.Mar 29 2021, 7:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2021, 7:38 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B96112: Diff 333866.Mar 29 2021, 7:38 AM

foad added inline comments.Mar 29 2021, 7:46 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
19	I don't know what you mean by "relax" here. This comment doesn't explain why you'd want to do this transformation.

This's the companion fix for D96980. That explain the myth why the original agnostic SGPR spill/reload is proposed to solve the issue SGRP spill/reload may be executed when exec mask goes to zero.
We did try to skip executing code when exec mask goes to zero by branch on EXECZ (to the target block) or EXECNZ (to the fallthrough block.) We may run instructions with zero exec mask. But, that's usually not an issue as we immediately restore the exec mask on the targeted block. Code following that exec mask restoration won't be executed in 0 mask. However, if that mask restoration needs to reload a spilled exec mask, we will run the SGPR reload with 0 mask, where v_readfristlane has undefined behavior when exec mask is zero.
This patch tries to mitigate that case by not evaluating exec mask that early or clearing the exec mask when the branch target has mask restoration following SGRP reload. Instead of checking EXECZ or EXECNZ, the exec mask evaluation is duplicated with a temporary SGRP as the destination (without updating exec mask directly), checking SCC0 is equivalent to EXECZ. Exec mask is only evaluated when the result won't be zero. For instance,

    $exec = MASK_EVALUATION()
    s_cbranch_execz TARGET
    ...
TARGET:
   $mask = SGPR_RELOAD(SLOT)
   $exec = OR $exec, $exec, $mask

is translated into

  $tmp = MASK_EVALUATION()
  s_cbranch_scc0 TARGET
  $exec = MASK_EVALUATION()
  ...
TARGET:
   $mask = SGPR_RELOAD(SLOT)
   $exec = OR $exec, $exec, $mask

Note that such transformation is only applied when that mask restoration needs SGPR reloading. As the mask restoration is the merge point of CFG, the predecessor block should have its exec mask always subset of the mask to be restored. It's safe to use the exec mask before that exec mask evaluation.

hliao added inline comments.Mar 29 2021, 7:54 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
19	just add comment on the motivation of this pass. Will put the details in the source code then.

hliao added a reviewer: dstuttard.Mar 29 2021, 8:12 AM

foad added inline comments.Mar 29 2021, 9:26 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	What happens if there are no free sgprs so the scavenger has to spill something? That sounds like yet another case that won't work correctly when exec is zero.

In general I am uncomfortable about generating code that does not work (i.e. expanding spills the way we do when exec might be 0) and then running yet another pass for correctness to fix it up later. Is there a way this can be made correct by default, and if necessary run an extra pass that optimizes it for efficiency?

Anyway I am not an expert in this area. I am happy to be overruled by people who know more about it.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1231	I don't think you can put anything that inserts extra instructions after BranchRelaxation. Putting it after the hazard recognizer might be risky too. Am I right in thinking that you just want to run this after spills are lowered to real code (prologue-epilogue insertion?)?

hliao added inline comments.Mar 29 2021, 9:56 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	Yeah, that's true. If we have a null-like reg pair, we could save the searching for that tmp SGPR as well. We have a null reg definition but we need an SGPR pair for the WAVE64 case. As we only duplicate that evaluation to update SCC, a null reg-pair would be sufficient.

rampitec added inline comments.Mar 29 2021, 9:58 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	NULL works for 64 bit pair too. Although it is not available on every target.

hliao added inline comments.Mar 29 2021, 9:59 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
1231	branch-relax may convert EXECZ branch to EXECNZ branch or vice versa. I choose to run this pass after branch-relax is just a random choice after the compiler makes the final decision on what branch should be used. I am open to the place where we run this pass to relax the branches.

hliao added inline comments.Mar 29 2021, 10:03 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	Could you elaborate more? SGPR_NULL is currently defined as a 32-bit SGPR @ offset 125. For that 64-bit SGPR pair, we need a pair at an even offset based on the ISA document.

rampitec added inline comments.Mar 29 2021, 10:23 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too: llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null' .text s_mov_b64 s[0:1], null ; encoding: [0x7d,0x04,0x80,0xbe] Not sure if you would need to fix something in the verifier. But again, this is not a universal solution, it is gfx10 only.

hliao added inline comments.Mar 29 2021, 10:30 AM

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
220	It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too: llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null' .text s_mov_b64 s[0:1], null ; encoding: [0x7d,0x04,0x80,0xbe] Not sure if you would need to fix something in the verifier. But again, this is not a universal solution, it is gfx10 only. That explains why I cannot find that usage in vega ISA document. From scalar operand encoding map, it seems we have several reserved slots like 209-234.

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

In D99507#2656766, @arsenm wrote:

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

What's the bug you refer to? I am not aware CFG lowering has anything issue except direct uses of EXECZ/EXECNZ scared me. If the target or fallthrough block of that EXECZ/EXECNZ branches could restore the mask immediately, that sounds fine. But, we run into the case where the mask to be restored needs reloading.

In D99507#2656770, @hliao wrote:

In D99507#2656766, @arsenm wrote:

I'm not comfortable adding a pass to fixup a bug in control flow lowering. I think we just need to actually try to model divergent predecessors/successors explicitly

What's the bug you refer to? I am not aware CFG lowering has anything issue except direct uses of EXECZ/EXECNZ scared me. If the target or fallthrough block of that EXECZ/EXECNZ branches could restore the mask immediately, that sounds fine. But, we run into the case where the mask to be restored needs reloading.

Instead of having a fixup patch to avoid cases where this happens, we should have the infrastructure to stop this from happening in the first place

In D99507#2656780, @arsenm wrote:

Instead of having a fixup patch to avoid cases where this happens, we should have the infrastructure to stop this from happening in the first place

This is my position as well.

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp
201	Presumably this needs to depend on IsWave32.
221	This also needs to depends on IsWave32.

For such blocks where the mask is restored from a reloaded mask, zero exec mask results in the undefined behavior as the SGPR reload uses v_readfirstlane

Can we mark the SGPRs holding the masks unspillable during LowerControlFlow to fix your problem? After we split SGPR/VGPR allocation, this problem would disappear.

It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:

Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.
When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.

Any suggestions and comments?

In D99507#2658404, @hliao wrote:

It seems to me that we may need to revise CFG lowering to avoid updating EXEC directly and later revise it based on whether the restoring mask needs reloading or not. Here's the brief thought in my mind:

Instead of lowering CFG early before RA, lower it after RA. As a byproduct, it also remove the need of "terminator" version of exec mask manipulation instructions.

When CFG is being lowered, it could update EXEC eagerly if the merge point doesn't need to reload the mask; Otherwise, it just needs to translate as what we currently did.

Any suggestions and comments?

I think this requires a lot more thought. I think we need deeper IR changes. I believe MachineBasicBlock needs to start tracking both uniform and divergent predecessors/successors, but not sure what this ends up looking like. I think the terminators should still reflect the hardware instructions, and the exec issues would be tracked through the divergent pred/succ.

I think this requires a lot more thought.

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

Load from a fixed lane of a VGPR using V_READLANE.
Load directly from memory using an SMEM load instruction.

Both types of reload should work just fine with exec=0.

Keeping a lane mask in a VGPR is fundamentally a nonsensical thing to do because it clashes with the whole theory of how different types of data (uniform vs. divergent) are represented in AMDGPU's implementation of SIMT. So I'd really rather we fix that instead of adding yet another hack onto the existing pile of hacks. At the very least, we need to understand this better.

This revision now requires changes to proceed.Apr 1 2021, 3:12 PM

In D99507#2665302, @nhaehnle wrote:

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly. Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR. As a result, we need to READFIRSTLANE correspondingly when an SGPR needs reloading. Exec mask 0 makes that READFIRSTLANE undefined and we need to ensure proper exec mask is used.

Load directly from memory using an SMEM load instruction.

Both types of reload should work just fine with exec=0.

Keeping a lane mask in a VGPR is fundamentally a nonsensical thing to do because it clashes with the whole theory of how different types of data (uniform vs. divergent) are represented in AMDGPU's implementation of SIMT. So I'd really rather we fix that instead of adding yet another hack onto the existing pile of hacks. At the very least, we need to understand this better.

t-tye added a reviewer: mjbedy.Apr 3 2021, 9:29 AM

In D99507#2667537, @hliao wrote:

In D99507#2665302, @nhaehnle wrote:

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly.

Let me rephrase to make sure I understood you correctly. You're saying that spilling an SGPR to a fixed lane of a VGPR may cause data of an inactive lane to be overwritten. This is a problem if the spill/reload happens in a called function, because VGPR save/reload doesn't save those inactive lanes. (HPC is irrelevant here.)

My understanding is that @sebastian-ne is working on a fix for this, see D96336.

Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR.

Wrong, it's way more overhead. Instead of copying the 32-bit value into a single lane of a VGPR, you may be copying it into up to 64 lanes. That's very wasteful. We should fix the root cause instead, which means properly saving/restoring the VGPRs that are used for SGPR-to-VGPR spilling.

In D99507#2668062, @nhaehnle wrote:

In D99507#2667537, @hliao wrote:

In D99507#2665302, @nhaehnle wrote:

I think this requires a lot more thought.

+1

What I'd like to know: why are we reloading a lane mask via V_READFIRSTLANE in the first place? I would expect one of two types of reload:

Load from a fixed lane of a VGPR using V_READLANE.

That depends on how we spill a SGPR by writing a fixed lane or write an active lane. The 1st one, without saving/restoring, we will overwrite the live values in the inactive lanes. HPC workloads are hit by that issue and cannot run correctly.

Let me rephrase to make sure I understood you correctly. You're saying that spilling an SGPR to a fixed lane of a VGPR may cause data of an inactive lane to be overwritten. This is a problem if the spill/reload happens in a called function, because VGPR save/reload doesn't save those inactive lanes. (HPC is irrelevant here.)

We found the inactive lane overwriting issue in a HPC workload where all functions are inlined but its CFG is quite complicated also it has very high SGPR register pressure due to pointer usage. The overwriting is observed in the pseudo code illustrated in https://reviews.llvm.org/D96980. That HPC workload runs correctly after reverting that agnostic SGPR spill.

My understanding is that @sebastian-ne is working on a fix for this, see D96336.

Instead, writing into active lanes won't need to save/restore those lanes as they are actively maintained in RA. That minimizes the overhead when you have to spill an SGPR.

Wrong, it's way more overhead. Instead of copying the 32-bit value into a single lane of a VGPR, you may be copying it into up to 64 lanes. That's very wasteful. We should fix the root cause instead, which means properly saving/restoring the VGPRs that are used for SGPR-to-VGPR spilling.

The alternative spill approach adds the overhead to save/restore a VGPR in both SGPR spill and reload. Comparing to VGPR waste, that's newly added memory overhead is a significant one, especially for SGRP spill. Those spills are store-only previously but now add extra memory load overhead. Considering that not all spills are reloaded later in the application runtime, that extra memory load overhead in a spill cannot justice the VGPR space saved. But, I do agree that SGPR to memory spill needs a more efficient mechanism.

hliao abandoned this revision.Apr 5 2021, 8:06 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUTargetMachine.cpp

2 lines

CMakeLists.txt

1 line

SIAvoidZeroExecMask.cpp

247 lines

test/

CodeGen/

AMDGPU/

control-flow-fastregalloc.ll

8 lines

Diff 333866

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 222 Lines • ▼ Show 20 Lines
	extern char &AMDGPUUseNativeCallsID;			extern char &AMDGPUUseNativeCallsID;

	void initializeSIAddIMGInitPass(PassRegistry &);			void initializeSIAddIMGInitPass(PassRegistry &);
	extern char &SIAddIMGInitID;			extern char &SIAddIMGInitID;

	void initializeAMDGPUPerfHintAnalysisPass(PassRegistry &);			void initializeAMDGPUPerfHintAnalysisPass(PassRegistry &);
	extern char &AMDGPUPerfHintAnalysisID;			extern char &AMDGPUPerfHintAnalysisID;

				void initializeSIAvoidZeroExecMaskPass(PassRegistry &);
				extern char &SIAvoidZeroExecMaskID;

	// Passes common to R600 and SI			// Passes common to R600 and SI
	FunctionPass *createAMDGPUPromoteAlloca();			FunctionPass *createAMDGPUPromoteAlloca();
	void initializeAMDGPUPromoteAllocaPass(PassRegistry&);			void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
	extern char &AMDGPUPromoteAllocaID;			extern char &AMDGPUPromoteAllocaID;

	FunctionPass *createAMDGPUPromoteAllocaToVector();			FunctionPass *createAMDGPUPromoteAllocaToVector();
	void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry&);			void initializeAMDGPUPromoteAllocaToVectorPass(PassRegistry&);
	extern char &AMDGPUPromoteAllocaToVectorID;			extern char &AMDGPUPromoteAllocaToVectorID;
	▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
initializeAMDGPUPrintfRuntimeBindingPass(*PR);		initializeAMDGPUPrintfRuntimeBindingPass(*PR);
initializeGCNRegBankReassignPass(*PR);		initializeGCNRegBankReassignPass(*PR);
initializeGCNNSAReassignPass(*PR);		initializeGCNNSAReassignPass(*PR);
initializeSIAddIMGInitPass(*PR);		initializeSIAddIMGInitPass(*PR);
		initializeSIAvoidZeroExecMaskPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return std::make_unique<AMDGPUTargetObjectFile>();		return std::make_unique<AMDGPUTargetObjectFile>();
}		}

static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {		static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {
return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());		return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());
▲ Show 20 Lines • Show All 946 Lines • ▼ Show 20 Lines	void GCNPassConfig::addPreEmitPass() {
// are multiple scheduling regions in a basic block, the regions are scheduled		// are multiple scheduling regions in a basic block, the regions are scheduled
// bottom up, so when we begin to schedule a region we don't know what		// bottom up, so when we begin to schedule a region we don't know what
// instructions were emitted directly before it.		// instructions were emitted directly before it.
//		//
// Here we add a stand-alone hazard recognizer pass which can handle all		// Here we add a stand-alone hazard recognizer pass which can handle all
// cases.		// cases.
addPass(&PostRAHazardRecognizerID);		addPass(&PostRAHazardRecognizerID);
addPass(&BranchRelaxationPassID);		addPass(&BranchRelaxationPassID);
		addPass(&SIAvoidZeroExecMaskID);
		foadUnsubmitted Not Done Reply Inline Actions I don't think you can put anything that inserts extra instructions after BranchRelaxation. Putting it after the hazard recognizer might be risky too. Am I right in thinking that you just want to run this after spills are lowered to real code (prologue-epilogue insertion?)? foad: I don't think you can put anything that inserts extra instructions after BranchRelaxation.
		hliaoAuthorUnsubmitted Done Reply Inline Actions branch-relax may convert EXECZ branch to EXECNZ branch or vice versa. I choose to run this pass after branch-relax is just a random choice after the compiler makes the final decision on what branch should be used. I am open to the place where we run this pass to relax the branches. hliao: branch-relax may convert EXECZ branch to EXECNZ branch or vice versa. I choose to run this pass…
}		}

TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {		TargetPassConfig *GCNTargetMachine::createPassConfig(PassManagerBase &PM) {
return new GCNPassConfig(*this, PM);		return new GCNPassConfig(*this, PM);
}		}

yaml::MachineFunctionInfo *GCNTargetMachine::createDefaultFuncInfoYAML() const {		yaml::MachineFunctionInfo *GCNTargetMachine::createDefaultFuncInfoYAML() const {
return new yaml::SIMachineFunctionInfo();		return new yaml::SIMachineFunctionInfo();
▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
R600MachineFunctionInfo.cpp		R600MachineFunctionInfo.cpp
R600MachineScheduler.cpp		R600MachineScheduler.cpp
R600OpenCLImageTypeLoweringPass.cpp		R600OpenCLImageTypeLoweringPass.cpp
R600OptimizeVectorRegisters.cpp		R600OptimizeVectorRegisters.cpp
R600Packetizer.cpp		R600Packetizer.cpp
R600RegisterInfo.cpp		R600RegisterInfo.cpp
SIAddIMGInit.cpp		SIAddIMGInit.cpp
SIAnnotateControlFlow.cpp		SIAnnotateControlFlow.cpp
		SIAvoidZeroExecMask.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixVGPRCopies.cpp		SIFixVGPRCopies.cpp
SIPreAllocateWWMRegs.cpp		SIPreAllocateWWMRegs.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIFormMemoryClauses.cpp		SIFormMemoryClauses.cpp
SIFrameLowering.cpp		SIFrameLowering.cpp
SIInsertHardClauses.cpp		SIInsertHardClauses.cpp
SILateBranchLowering.cpp		SILateBranchLowering.cpp
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIAvoidZeroExecMask.cpp

This file was added.

				//===-- SIAvoidZeroExecMask.cpp ------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "GCNSubtarget.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/RegisterScavenging.h"
				#include "llvm/Support/CommandLine.h"

				using namespace llvm;

				// In case the mask to be restored needs reloading at the beginning of a block,
				// relax the eager exec mask evaluation in the corresponding predecessor block.
				foadUnsubmitted Not Done Reply Inline Actions I don't know what you mean by "relax" here. This comment doesn't explain why you'd want to do this transformation. foad: I don't know what you mean by "relax" here. This comment doesn't explain //why// you'd want to…
				hliaoAuthorUnsubmitted Done Reply Inline Actions just add comment on the motivation of this pass. Will put the details in the source code then. hliao: just add comment on the motivation of this pass. Will put the details in the source code then.
				//
				// For instance, transform the following
				//
				// $exec = /* exec mask evaluation */
				// s_cbranch_execz TARGET
				// FALLTHROUGH:
				//
				// to
				//
				// $null = /* exec mask evaluation */
				// s_cbranch_scc0 TARGET
				// L:
				// $exec = /* exec mask evaluation */
				// FALLTRHOUGH:
				//
				// and transforms the following
				//
				// $exec = /* exec mask evaluation */
				// s_cbranch_execnz TARGET
				// FALLTRHOUGH:
				//
				// to
				//
				// $null = /* exec mask evaluation */
				// s_cbranch_scc0 FALLTHROUGH
				// L:
				// $exec = /* exec mask evaluation */
				// s_branch TARGET
				// FALLTRHOUGH:
				//

				#define DEBUG_TYPE "si-avoid-zero-exec-mask"
				namespace {

				class SIAvoidZeroExecMask : public MachineFunctionPass {
				private:
				const SIInstrInfo *TII = nullptr;
				const SIRegisterInfo *TRI = nullptr;
				std::unique_ptr<RegScavenger> RS;

				bool IsWave32;
				Register ExecMask;

				bool isExecMaskRestore(const MachineInstr &MI) const;
				MachineOperand findOnlyImplicitSCCDefOperand(MachineInstr MI) const;

				bool relaxBranchEXEC(MachineBasicBlock &MBB, MachineBasicBlock *Target) const;

				public:
				static char ID;

				SIAvoidZeroExecMask() : MachineFunctionPass(ID) {
				initializeSIAvoidZeroExecMaskPass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;
				};

				} // End anonymous namespace.

				INITIALIZE_PASS(SIAvoidZeroExecMask, DEBUG_TYPE, "SI avoid zero exec mask",
				false, false)

				char SIAvoidZeroExecMask::ID = 0;

				char &llvm::SIAvoidZeroExecMaskID = SIAvoidZeroExecMask::ID;

				bool SIAvoidZeroExecMask::runOnMachineFunction(MachineFunction &MF) {
				const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
				TII = ST.getInstrInfo();
				TRI = ST.getRegisterInfo();

				IsWave32 = ST.isWave32();
				ExecMask = IsWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;

				assert(TRI->trackLivenessAfterRegAlloc(MF));
				RS.reset(new RegScavenger());

				bool Changed = false;
				for (auto MBBI = MF.begin(), MBBE = MF.end(); MBBI != MBBE; /EMPTY/) {
				auto &MBB = *MBBI++;
				auto MII = MBB.getLastNonDebugInstr();
				// Skip if there's no terminator in this block.
				if (MII == MBB.end() \|\| !MII->isTerminator())
				continue;
				// Consider EXECZ & EXECNZ only.
				if (MII->getOpcode() != AMDGPU::S_CBRANCH_EXECZ &&
				MII->getOpcode() != AMDGPU::S_CBRANCH_EXECNZ)
				continue;
				// Prepare the block for the exec mask restoration checking. It's the target
				// block for EXECZ or the fall-through block otherwise.
				MachineBasicBlock *Target = MII->getOperand(0).getMBB();
				if (MII->getOpcode() == AMDGPU::S_CBRANCH_EXECNZ)
				Target = MBB.getFallThrough();
				// If there's any readfirstlane, relax the eager exec evaluation in MBB.
				bool Found = false;
				unsigned Count = 0;
				for (auto &MI : *Target) {
				// Skip checking if it's unlikely the case.
				if (++Count > 8)
				break;
				Found \|= (MI.getOpcode() == AMDGPU::V_READFIRSTLANE_B32);
				if (!isExecMaskRestore(MI))
				continue;
				// No need to handle the restoration if no mask needs reloading.
				if (!Found)
				break;
				// Otherwise, we need to ensure the target not being branched into with
				// zero exec mask.
				relaxBranchEXEC(MBB, Target);
				break;
				}
				}

				return Changed;
				}

				// Check that's an execution mask restore instruction, which is implemented as
				//
				// OR $exec, $exec, %mask
				//
				bool SIAvoidZeroExecMask::isExecMaskRestore(const MachineInstr &MI) const {
				unsigned MaskOR = IsWave32 ? AMDGPU::S_OR_B32 : AMDGPU::S_OR_B64;
				if (MI.getOpcode() != MaskOR)
				return false;
				const MachineOperand &Dst = MI.getOperand(0);
				if (!Dst.isReg() \|\| Dst.getReg() != ExecMask)
				return false;
				const MachineOperand &S0 = MI.getOperand(1);
				if (!S0.isReg() \|\| S0.getReg() != ExecMask)
				return false;
				return true;
				}

				MachineOperand *
				SIAvoidZeroExecMask::findOnlyImplicitSCCDefOperand(MachineInstr *MI) const {
				if (MI->getDesc().getNumImplicitDefs() != 1)
				return nullptr;
				for (unsigned I = MI->getNumExplicitOperands(), E = MI->getNumOperands();
				I != E; ++I) {
				MachineOperand &MO = MI->getOperand(I);
				if (MO.isDef() && MO.isImplicit() && MO.getReg() == AMDGPU::SCC)
				return &MO;
				}
				return nullptr;
				}

				bool SIAvoidZeroExecMask::relaxBranchEXEC(MachineBasicBlock &MBB,
				MachineBasicBlock *Target) const {
				auto MBBI = MBB.getFirstTerminator();
				// Skip if there's no terminator.
				if (MBBI == MBB.end())
				return false;
				if (MBBI->getOpcode() != AMDGPU::S_CBRANCH_EXECZ &&
				MBBI->getOpcode() != AMDGPU::S_CBRANCH_EXECNZ)
				return false;
				// Skip if there's no exec mask evaluation.
				if (MBBI == MBB.begin())
				return false;
				auto Br = &*MBBI;

				unsigned MaskMov = IsWave32 ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
				unsigned MaskAnd = IsWave32 ? AMDGPU::S_AND_B32 : AMDGPU::S_AND_B64;

				// Check whether the previous instruction is the evaluation of the exec
				// mask.
				--MBBI;
				if (MBBI->getNumExplicitDefs() == 0)
				return false;
				auto &Op = MBBI->getOperand(0);
				if (!Op.isReg() \|\| Op.getReg() != ExecMask)
				return false;
				// That evaluation should only implicitly define SCC. The only exception is
				// S_MOV.
				if (MBBI->getOpcode() == MaskMov \|\| !findOnlyImplicitSCCDefOperand(&*MBBI))
				return false;
				auto ExecEval = &*MBBI;

				MachineFunction *MF = MBB.getParent();
				MachineRegisterInfo &MRI = MF->getRegInfo();

				Register Tmp = MRI.createVirtualRegister(&AMDGPU::SReg_64RegClass);
				critsonUnsubmitted Not Done Reply Inline Actions Presumably this needs to depend on IsWave32. critson: Presumably this needs to depend on IsWave32.
				MachineInstr *Cloned = nullptr;
				// Clone that mask evaluation with a temp destination. As SCC is updated with
				// the same result, branch off on SCC0 before zero mask is evaluated.
				if (ExecEval->getOpcode() == MaskMov) {
				Cloned =
				BuildMI(MBB, ExecEval, ExecEval->getDebugLoc(), TII->get(MaskAnd), Tmp)
				.addReg(ExecEval->getOperand(1).getReg())
				.addReg(ExecEval->getOperand(1).getReg());
				} else {
				Cloned = MF->CloneMachineInstr(&*ExecEval);
				MBB.insert(ExecEval, Cloned);
				}
				Cloned->clearKillInfo();
				Cloned->getOperand(0).setReg(Tmp);
				Cloned->getOperand(0).setIsDead();
				// Find a temp sreg as the destination of the duplicated evaluation. We only
				// care the updated SCC.
				RS->enterBasicBlockEnd(MBB);
				unsigned Scav = RS->scavengeRegisterBackwards(
				foadUnsubmitted Not Done Reply Inline Actions What happens if there are no free sgprs so the scavenger has to spill something? That sounds like yet another case that won't work correctly when exec is zero. foad: What happens if there are no free sgprs so the scavenger has to spill something? That sounds…
				hliaoAuthorUnsubmitted Done Reply Inline Actions Yeah, that's true. If we have a null-like reg pair, we could save the searching for that tmp SGPR as well. We have a null reg definition but we need an SGPR pair for the WAVE64 case. As we only duplicate that evaluation to update SCC, a null reg-pair would be sufficient. hliao: Yeah, that's true. If we have a null-like reg pair, we could save the searching for that tmp…
				rampitecUnsubmitted Not Done Reply Inline Actions NULL works for 64 bit pair too. Although it is not available on every target. rampitec: NULL works for 64 bit pair too. Although it is not available on every target.
				hliaoAuthorUnsubmitted Done Reply Inline Actions Could you elaborate more? SGPR_NULL is currently defined as a 32-bit SGPR @ offset 125. For that 64-bit SGPR pair, we need a pair at an even offset based on the ISA document. hliao: Could you elaborate more? SGPR_NULL is currently defined as a 32-bit SGPR @ offset 125. For…
				rampitecUnsubmitted Not Done Reply Inline Actions It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too: llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null' .text s_mov_b64 s[0:1], null ; encoding: [0x7d,0x04,0x80,0xbe] Not sure if you would need to fix something in the verifier. But again, this is not a universal solution, it is gfx10 only. rampitec: It is not a real register, it is just a way to encode 0. It is even free in terms of the…
				hliaoAuthorUnsubmitted Done Reply Inline Actions It is not a real register, it is just a way to encode 0. It is even free in terms of the constant bus usage. It can be used as 64 bit too: llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 's_mov_b64 s[0:1], null' .text s_mov_b64 s[0:1], null ; encoding: [0x7d,0x04,0x80,0xbe] Not sure if you would need to fix something in the verifier. But again, this is not a universal solution, it is gfx10 only. That explains why I cannot find that usage in vega ISA document. From scalar operand encoding map, it seems we have several reserved slots like 209-234. hliao: > It is not a real register, it is just a way to encode 0. It is even free in terms of the…
				AMDGPU::SReg_64RegClass, MachineBasicBlock::iterator(Cloned), false, 0);
				critsonUnsubmitted Not Done Reply Inline Actions This also needs to depends on IsWave32. critson: This also needs to depends on IsWave32.
				MRI.replaceRegWith(Tmp, Scav);
				MRI.clearVirtRegs();
				RS->setRegUsed(Scav);
				// Clear dead flag on that SCC implicit def.
				findOnlyImplicitSCCDefOperand(Cloned)->setIsDead(false);

				// Split the block.
				auto L = MBB.splitAt(*Cloned, true);

				// Add branch on scc0.
				BuildMI(&MBB, Br->getDebugLoc(), TII->get(AMDGPU::S_CBRANCH_SCC0))
				.addMBB(Target);
				MBB.addSuccessor(Target);
				if (Br->getOpcode() == AMDGPU::S_CBRANCH_EXECZ) {
				// Remove the orignal branch on execz.
				Br->eraseFromParent();
				} else {
				// Replace the original branch on execnz with unconditional branch.
				assert(Br->getOpcode() == AMDGPU::S_CBRANCH_EXECNZ);
				Br->setDesc(TII->get(AMDGPU::S_BRANCH));
				}
				// Remove `Target` from L's successors.
				L->removeSuccessor(Target, true);

				return true;
				}

llvm/test/CodeGen/AMDGPU/control-flow-fastregalloc.ll

	Show First 20 Lines • Show All 214 Lines • ▼ Show 20 Lines
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_LO]], [[FLOW_SAVEEXEC_LO_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_LO]], [[FLOW_SAVEEXEC_LO_LANE:[0-9]+]]
	; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_HI]], [[FLOW_SAVEEXEC_HI_LANE:[0-9]+]]			; VGPR: v_writelane_b32 [[SPILL_VGPR]], s[[FLOW_AND_EXEC_HI]], [[FLOW_SAVEEXEC_HI_LANE:[0-9]+]]

	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_LO:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_LO]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_LO]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_LO_OFFSET:[0-9]+]] ; 4-byte Folded Spill
	; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]			; VMEM: v_mov_b32_e32 v[[FLOW_V_SAVEEXEC_HI:[0-9]+]], s[[FLOW_S_RELOAD_SAVEEXEC_HI]]
	; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill			; VMEM: buffer_store_dword v[[FLOW_V_SAVEEXEC_HI]], off, s[0:3], 0 offset:[[FLOW_SAVEEXEC_HI_OFFSET:[0-9]+]] ; 4-byte Folded Spill

	; GCN: s_xor_b64 exec, exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}			; VMEM: s_xor_b64 s[{{[0-9]+:[0-9]+}}], exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}
	; GCN-NEXT: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]			; VMEM-NEXT: s_cbranch_scc0 [[ENDIF:BB[0-9]+_[0-9]+]]
				; VMEM-NEXT: ; %bb.{{[0-9]+}}: ; %Flow
				; VMEM-NEXT: s_xor_b64 exec, exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}
				; VGPR: s_xor_b64 exec, exec, s{{\[}}[[FLOW_AND_EXEC_LO]]:[[FLOW_AND_EXEC_HI]]{{\]}}
				; VGPR-NEXT: s_cbranch_execz [[ENDIF:BB[0-9]+_[0-9]+]]


	; GCN: ; %bb.{{[0-9]+}}: ; %if			; GCN: ; %bb.{{[0-9]+}}: ; %if
	; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload			; GCN: buffer_load_dword v[[LOAD0_RELOAD:[0-9]+]], off, s[0:3], 0 offset:[[LOAD0_OFFSET]] ; 4-byte Folded Reload
	; GCN: ds_read_b32			; GCN: ds_read_b32
	; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]			; GCN: v_add_i32_e32 [[ADD:v[0-9]+]], vcc, v{{[0-9]+}}, v[[LOAD0_RELOAD]]
	; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill			; GCN: buffer_store_dword [[ADD]], off, s[0:3], 0 offset:[[RESULT_OFFSET]] ; 4-byte Folded Spill
	; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]			; GCN-NEXT: s_branch [[ENDIF:BB[0-9]+_[0-9]+]]
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines