This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
13/24
AMDGPUSetWavePriority.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
set-wave-priority.ll

Differential D124671

[AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions.
ClosedPublic

Authored by kosarev on Apr 29 2022, 6:14 AM.

Download Raw Diff

Details

Reviewers

foad
dstuttard
tpr
nhaehnle
tsymalla

Commits

rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU…

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kosarev created this revision.Apr 29 2022, 6:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM

Herald added subscribers: hsmhsm, foad, kerbowa and 9 others. · View Herald Transcript

kosarev requested review of this revision.Apr 29 2022, 6:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

kosarev added reviewers: foad, dstuttard, tpr, nhaehnle, tsymalla.Apr 29 2022, 6:18 AM

tsymalla added inline comments.Apr 29 2022, 7:30 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
127	You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct?

Harbormaster completed remote builds in B161971: Diff 426032.Apr 29 2022, 8:48 AM

arsenm added inline comments.Apr 29 2022, 10:58 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Should we be counting cycles instead of instructions?
132	You're counting VALU instructions here and above?
205–206	Why not construct directly at the insert point?

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here.
127	We could, but I'm not sure I know how that might be useful in practice.
132	Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.
205–206	Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction.

Harbormaster completed remote builds in B162420: Diff 426646.May 3 2022, 5:49 AM

Ping.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 10 2022, 6:01 AM

arsenm added inline comments.May 17 2022, 2:46 PM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
205–206	But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter

kosarev added inline comments.May 18 2022, 1:01 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
205–206	I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place.

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
126	If you're ignoring loops, can you described succinctly what you are counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question?

foad added inline comments.May 18 2022, 2:50 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
205–206	I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like `MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin()`.

Reworked as suggested.

In D124671#3521606, @foad wrote:

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

Right, that's the hardest part of it, isn't it. Because we don't have much use cases provided, the idea was to start with replicating more or less the same counting logic we have in the other compiler, and then adjust things using feedback from people who can give it some proper testing.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
126	Good point. Amended the comment.
205–206	OK, changed.

Harbormaster completed remote builds in B165129: Diff 430382.May 18 2022, 8:40 AM

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context.
134–135	I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions. The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU.

kosarev added inline comments.Aug 8 2022, 4:44 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
134–135	So would something like this work for the purpose? if (SIInstrInfo::isVALU(MI)) ++NumVALUInsts; else NumVALUInsts = 0;

Updated to count VALU instructions that follow VMEM loads and to add
support for threshold function attributes.

In D124671#3698865, @nhaehnle wrote:

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

I think knowing if we are at a point past a VMEM load wouldn't help as there may be another VMEM load down the control flow. And a similar problem with the last sequence break as at where the sequence finally becomes 'long enough', we can have several points at which it begins.

If I'm not wrong that there is no harm in lowering the priority immediately after VMEM loads, it seems it's still easier implementation-wise to track them and not other things. The updated version just makes sure we only consider VMEM loads that are followed by VALU sequences of the required length. Then the rest of the logic remains the same.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Done.
134–135	Updated to break the sequence on DS instructions.

Update commit description.

kosarev edited the summary of this revision. (Show Details)Aug 12 2022, 4:27 AM

Harbormaster completed remote builds in B180885: Diff 452140.Aug 12 2022, 5:05 AM

Update commit title as well.

kosarev retitled this revision from [AMDGPU] Do not raise wave priority beyond a specific number of VALU instructions. to [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions..Aug 12 2022, 5:59 AM

Harbormaster completed remote builds in B180901: Diff 452164.Aug 12 2022, 7:06 AM

Ping.

nhaehnle added inline comments.Aug 31 2022, 3:51 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
142–143	Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU.

nhaehnle added inline comments.Sep 1 2022, 2:35 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
142–143	Thank you for the offline clarification, I understand this better now.
159–163	If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one. I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far.

kosarev added inline comments.Sep 1 2022, 4:53 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
159–163	The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough. In terms of the code, as we reset `MaxNumVALUInstsInMiddle` every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner. `Info.LastVMEMLoad` storing the actual last VMEM load in the block should not be a problem because it is the `Info.MayReachVMEMLoad` flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length.

nhaehnle added inline comments.Sep 1 2022, 4:59 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
159–163	I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is.

@nhaehnle Are we fine with accepting and submitting this? Or, do we want to give this some testing beforehand? Also not quite sure if the pass is supposed to be enabled to by default.

Yes, I think it's fine.

This revision is now accepted and ready to land.Sep 7 2022, 1:03 PM

This revision was landed with ongoing or failed builds.Sep 8 2022, 7:22 AM

Closed by commit rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU… (authored by kosarev). · Explain Why

This revision was automatically updated to reflect the committed changes.

kosarev added a commit: rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU….

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUSetWavePriority.cpp

99 lines

test/

CodeGen/

AMDGPU/

set-wave-priority.ll

79 lines

Diff 458732

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp

Show All 20 Lines
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Support/Allocator.h"		#include "llvm/Support/Allocator.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "amdgpu-set-wave-priority"		#define DEBUG_TYPE "amdgpu-set-wave-priority"

		static cl::opt<unsigned> DefaultVALUInstsThreshold(
		"amdgpu-set-wave-priority-valu-insts-threshold",
		cl::desc("VALU instruction count threshold for adjusting wave priority"),
		arsenmUnsubmitted Not Done Reply Inline Actions Should we be counting cycles instead of instructions? arsenm: Should we be counting cycles instead of instructions?
		kosarevAuthorUnsubmitted Done Reply Inline Actions We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here. kosarev: We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that…
		nhaehnleUnsubmitted Done Reply Inline Actions Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context. nhaehnle: Could this also be configured via a function attribute? Those are less problematic than command…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Done. kosarev: Done.
		cl::init(100), cl::Hidden);

namespace {		namespace {

struct MBBInfo {		struct MBBInfo {
MBBInfo() = default;		MBBInfo() = default;
		unsigned NumVALUInstsAtStart = 0;
bool MayReachVMEMLoad = false;		bool MayReachVMEMLoad = false;
		MachineInstr *LastVMEMLoad = nullptr;
};		};

using MBBInfoSet = DenseMap<const MachineBasicBlock *, MBBInfo>;		using MBBInfoSet = DenseMap<const MachineBasicBlock *, MBBInfo>;

class AMDGPUSetWavePriority : public MachineFunctionPass {		class AMDGPUSetWavePriority : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;

AMDGPUSetWavePriority() : MachineFunctionPass(ID) {}		AMDGPUSetWavePriority() : MachineFunctionPass(ID) {}

StringRef getPassName() const override { return "Set wave priority"; }		StringRef getPassName() const override { return "Set wave priority"; }

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;

private:		private:
MachineInstr *BuildSetprioMI(MachineFunction &MF, unsigned priority) const;		MachineInstr *BuildSetprioMI(MachineBasicBlock &MBB,
		MachineBasicBlock::iterator I,
		unsigned priority) const;

const SIInstrInfo *TII;		const SIInstrInfo *TII;
};		};

} // End anonymous namespace.		} // End anonymous namespace.

INITIALIZE_PASS(AMDGPUSetWavePriority, DEBUG_TYPE, "Set wave priority", false,		INITIALIZE_PASS(AMDGPUSetWavePriority, DEBUG_TYPE, "Set wave priority", false,
false)		false)

char AMDGPUSetWavePriority::ID = 0;		char AMDGPUSetWavePriority::ID = 0;

FunctionPass *llvm::createAMDGPUSetWavePriorityPass() {		FunctionPass *llvm::createAMDGPUSetWavePriorityPass() {
return new AMDGPUSetWavePriority();		return new AMDGPUSetWavePriority();
}		}

MachineInstr *AMDGPUSetWavePriority::BuildSetprioMI(MachineFunction &MF,		MachineInstr *
		AMDGPUSetWavePriority::BuildSetprioMI(MachineBasicBlock &MBB,
		MachineBasicBlock::iterator I,
unsigned priority) const {		unsigned priority) const {
return BuildMI(MF, DebugLoc(), TII->get(AMDGPU::S_SETPRIO)).addImm(priority);		return BuildMI(MBB, I, DebugLoc(), TII->get(AMDGPU::S_SETPRIO))
		.addImm(priority);
}		}

// Checks that for every predecessor Pred that can reach a VMEM load,		// Checks that for every predecessor Pred that can reach a VMEM load,
// none of Pred's successors can reach a VMEM load.		// none of Pred's successors can reach a VMEM load.
static bool CanLowerPriorityDirectlyInPredecessors(const MachineBasicBlock &MBB,		static bool CanLowerPriorityDirectlyInPredecessors(const MachineBasicBlock &MBB,
MBBInfoSet &MBBInfos) {		MBBInfoSet &MBBInfos) {
for (const MachineBasicBlock *Pred : MBB.predecessors()) {		for (const MachineBasicBlock *Pred : MBB.predecessors()) {
if (!MBBInfos[Pred].MayReachVMEMLoad)		if (!MBBInfos[Pred].MayReachVMEMLoad)
Show All 16 Lines	bool AMDGPUSetWavePriority::runOnMachineFunction(MachineFunction &MF) {

Function &F = MF.getFunction();		Function &F = MF.getFunction();
if (skipFunction(F) \|\| !AMDGPU::isEntryFunctionCC(F.getCallingConv()))		if (skipFunction(F) \|\| !AMDGPU::isEntryFunctionCC(F.getCallingConv()))
return false;		return false;

const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();

		unsigned VALUInstsThreshold = DefaultVALUInstsThreshold;
		Attribute A = F.getFnAttribute("amdgpu-wave-priority-threshold");
		if (A.isValid())
		A.getValueAsString().getAsInteger(0, VALUInstsThreshold);

		// Find VMEM loads that may be executed before long-enough sequences of
		// VALU instructions. We currently assume that backedges/loops, branch
		// probabilities and other details can be ignored, so we essentially
		// determine the largest number of VALU instructions along every
		// possible path from the start of the function that may potentially be
		// executed provided no backedge is ever taken.
MBBInfoSet MBBInfos;		MBBInfoSet MBBInfos;
SmallVector<const MachineBasicBlock *, 16> Worklist;		for (MachineBasicBlock *MBB : post_order(&MF)) {
for (MachineBasicBlock &MBB : MF) {
if (any_of(MBB, isVMEMLoad))
Worklist.push_back(&MBB);
}

// Mark blocks from which control may reach VMEM loads.
while (!Worklist.empty()) {
const MachineBasicBlock *MBB = Worklist.pop_back_val();
MBBInfo &Info = MBBInfos[MBB];		MBBInfo &Info = MBBInfos[MBB];
if (!Info.MayReachVMEMLoad) {		bool AtStart = true;
		foadUnsubmitted Not Done Reply Inline Actions If you're ignoring loops, can you described succinctly what you are counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question? foad: If you're ignoring loops, can you described succinctly what you are counting? Is it something…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Good point. Amended the comment. kosarev: Good point. Amended the comment.
Info.MayReachVMEMLoad = true;		unsigned MaxNumVALUInstsInMiddle = 0;
		tsymallaUnsubmitted Not Done Reply Inline Actions You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct? tsymalla: You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct?
		kosarevAuthorUnsubmitted Done Reply Inline Actions We could, but I'm not sure I know how that might be useful in practice. kosarev: We could, but I'm not sure I know how that might be useful in practice.
Worklist.append(MBB->pred_begin(), MBB->pred_end());		unsigned NumVALUInstsAtEnd = 0;
		for (MachineInstr &MI : *MBB) {
		if (isVMEMLoad(MI)) {
		AtStart = false;
		Info.NumVALUInstsAtStart = 0;
		arsenmUnsubmitted Not Done Reply Inline Actions You're counting VALU instructions here and above? arsenm: You're counting VALU instructions here and above?
		kosarevAuthorUnsubmitted Done Reply Inline Actions Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks. kosarev: Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.
		MaxNumVALUInstsInMiddle = 0;
		NumVALUInstsAtEnd = 0;
		Info.LastVMEMLoad = &MI;
		nhaehnleUnsubmitted Done Reply Inline Actions I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions. The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU. nhaehnle: I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number…
		kosarevAuthorUnsubmitted Done Reply Inline Actions So would something like this work for the purpose? if (SIInstrInfo::isVALU(MI)) ++NumVALUInsts; else NumVALUInsts = 0; kosarev: So would something like this work for the purpose? ``` if (SIInstrInfo::isVALU(MI))…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Updated to break the sequence on DS instructions. kosarev: Updated to break the sequence on DS instructions.
		} else if (SIInstrInfo::isDS(MI)) {
		AtStart = false;
		MaxNumVALUInstsInMiddle =
		std::max(MaxNumVALUInstsInMiddle, NumVALUInstsAtEnd);
		NumVALUInstsAtEnd = 0;
		} else if (SIInstrInfo::isVALU(MI)) {
		if (AtStart)
		++Info.NumVALUInstsAtStart;
		nhaehnleUnsubmitted Not Done Reply Inline Actions Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU. nhaehnle: Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed…
		nhaehnleUnsubmitted Not Done Reply Inline Actions Thank you for the offline clarification, I understand this better now. nhaehnle: Thank you for the offline clarification, I understand this better now.
		++NumVALUInstsAtEnd;
}		}
}		}

		bool SuccsMayReachVMEMLoad = false;
		unsigned NumFollowingVALUInsts = 0;
		for (const MachineBasicBlock *Succ : MBB->successors()) {
		SuccsMayReachVMEMLoad \|= MBBInfos[Succ].MayReachVMEMLoad;
		NumFollowingVALUInsts =
		std::max(NumFollowingVALUInsts, MBBInfos[Succ].NumVALUInstsAtStart);
		}
		if (AtStart)
		Info.NumVALUInstsAtStart += NumFollowingVALUInsts;
		NumVALUInstsAtEnd += NumFollowingVALUInsts;

		unsigned MaxNumVALUInsts =
		std::max(MaxNumVALUInstsInMiddle, NumVALUInstsAtEnd);
		Info.MayReachVMEMLoad =
		SuccsMayReachVMEMLoad \|\|
		(Info.LastVMEMLoad && MaxNumVALUInsts >= VALUInstsThreshold);
		nhaehnleUnsubmitted Not Done Reply Inline Actions If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one. I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far. nhaehnle: If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the…
		kosarevAuthorUnsubmitted Done Reply Inline Actions The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough. In terms of the code, as we reset `MaxNumVALUInstsInMiddle` every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner. `Info.LastVMEMLoad` storing the actual last VMEM load in the block should not be a problem because it is the `Info.MayReachVMEMLoad` flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length. kosarev: The ticket says the priority is supposed to be lowered after the last VMEM load, and this…
		nhaehnleUnsubmitted Not Done Reply Inline Actions I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is. nhaehnle: I did think the policy was supposed to be "lower priority before the longest sequence of VALU…
		}

MachineBasicBlock &Entry = MF.front();		MachineBasicBlock &Entry = MF.front();
if (!MBBInfos[&Entry].MayReachVMEMLoad)		if (!MBBInfos[&Entry].MayReachVMEMLoad)
return false;		return false;

// Raise the priority at the beginning of the shader.		// Raise the priority at the beginning of the shader.
MachineBasicBlock::iterator I = Entry.begin(), E = Entry.end();		MachineBasicBlock::iterator I = Entry.begin(), E = Entry.end();
while (I != E && !SIInstrInfo::isVALU(*I) && !I->isTerminator())		while (I != E && !SIInstrInfo::isVALU(*I) && !I->isTerminator())
++I;		++I;
Entry.insert(I, BuildSetprioMI(MF, HighPriority));		BuildSetprioMI(Entry, I, HighPriority);

// Lower the priority on edges where control leaves blocks from which		// Lower the priority on edges where control leaves blocks from which
// VMEM loads are reachable.		// the VMEM loads are reachable.
SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;		SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;
for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
if (MBBInfos[&MBB].MayReachVMEMLoad) {		if (MBBInfos[&MBB].MayReachVMEMLoad) {
if (MBB.succ_empty())		if (MBB.succ_empty())
PriorityLoweringBlocks.insert(&MBB);		PriorityLoweringBlocks.insert(&MBB);
continue;		continue;
}		}

Show All 10 Lines	for (MachineBasicBlock &MBB : MF) {
// place or the loop simplification/canonicalization pass should have		// place or the loop simplification/canonicalization pass should have
// already tried to split the edge and insert a preheader, and if for		// already tried to split the edge and insert a preheader, and if for
// whatever reason it failed to do so, then this leaves us with the		// whatever reason it failed to do so, then this leaves us with the
// only option of lowering the priority within the loop.		// only option of lowering the priority within the loop.
PriorityLoweringBlocks.insert(&MBB);		PriorityLoweringBlocks.insert(&MBB);
}		}

for (MachineBasicBlock *MBB : PriorityLoweringBlocks) {		for (MachineBasicBlock *MBB : PriorityLoweringBlocks) {
MachineBasicBlock::iterator I = MBB->end(), B = MBB->begin();		BuildSetprioMI(
while (I != B) {		*MBB,
if (isVMEMLoad(*--I)) {		MBBInfos[MBB].LastVMEMLoad
		arsenmUnsubmitted Not Done Reply Inline Actions Why not construct directly at the insert point? arsenm: Why not construct directly at the insert point?
		kosarevAuthorUnsubmitted Done Reply Inline Actions Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction. kosarev: Well, the instruction we create here is not just an auxiliary value that we coincidentally…
		arsenmUnsubmitted Not Done Reply Inline Actions But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter arsenm: But you can construct it in the right place rather than constructing the instruction and then…
		kosarevAuthorUnsubmitted Done Reply Inline Actions I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place. kosarev: I'm not sure I see how this answers the point. Granted that constructing the instruction at the…
		foadUnsubmitted Not Done Reply Inline Actions I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like `MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin()`. foad: I agree with Matt that it would be more normal to construct-and-insert at the same time, though…
		kosarevAuthorUnsubmitted Done Reply Inline Actions OK, changed. kosarev: OK, changed.
++I;		? std::next(MachineBasicBlock::iterator(MBBInfos[MBB].LastVMEMLoad))
break;		: MBB->begin(),
}		LowPriority);
}
MBB->insert(I, BuildSetprioMI(MF, LowPriority));
}		}

return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/set-wave-priority.ll

	; RUN: llc -mtriple=amdgcn -amdgpu-set-wave-priority=true -o - %s \| \			; RUN: llc -mtriple=amdgcn -amdgpu-set-wave-priority=true -o - %s \| \
	; RUN: FileCheck %s			; RUN: FileCheck %s

	; CHECK-LABEL: no_setprio:			; CHECK-LABEL: no_setprio:
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: ; return to shader part epilog			; CHECK: ; return to shader part epilog
	define amdgpu_ps <2 x float> @no_setprio() {			define amdgpu_ps <2 x float> @no_setprio(<2 x float> %a, <2 x float> %b) "amdgpu-wave-priority-threshold"="1" {
	ret <2 x float> <float 0.0, float 0.0>			%s = fadd <2 x float> %a, %b
				ret <2 x float> %s
	}			}

	; CHECK-LABEL: vmem_in_exit_block:			; CHECK-LABEL: vmem_in_exit_block:
	; CHECK: s_setprio 3			; CHECK: s_setprio 3
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: ; return to shader part epilog			; CHECK: ; return to shader part epilog
	define amdgpu_ps <2 x float> @vmem_in_exit_block(<4 x i32> inreg %p) {			define amdgpu_ps <2 x float> @vmem_in_exit_block(<4 x i32> inreg %p, <2 x float> %x) "amdgpu-wave-priority-threshold"="2" {
	%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)			%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
	ret <2 x float> %v			%s = fadd <2 x float> %v, %x
				ret <2 x float> %s
	}			}

	; CHECK-LABEL: branch:			; CHECK-LABEL: branch:
	; CHECK: s_setprio 3			; CHECK: s_setprio 3
	; CHECK: s_cbranch_scc0 [[A:.*]]			; CHECK: s_cbranch_scc0 [[A:.*]]
	; CHECK: {{.*}}: ; %b			; CHECK: {{.*}}: ; %b
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: s_branch [[EXIT:.*]]			; CHECK: s_branch [[EXIT:.*]]
	; CHECK: [[A]]: ; %a			; CHECK: [[A]]: ; %a
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: s_branch [[EXIT]]			; CHECK: s_branch [[EXIT]]
	; CHECK-NEXT: [[EXIT]]:			; CHECK-NEXT: [[EXIT]]:
	define amdgpu_ps <2 x float> @branch(<4 x i32> inreg %p, i32 inreg %i) {			define amdgpu_ps <2 x float> @branch(<4 x i32> inreg %p, i32 inreg %i, <2 x float> %x) "amdgpu-wave-priority-threshold"="2" {
	%cond = icmp eq i32 %i, 0			%cond = icmp eq i32 %i, 0
	br i1 %cond, label %a, label %b			br i1 %cond, label %a, label %b

	a:			a:
	ret <2 x float> <float 0.0, float 0.0>			ret <2 x float> <float 0.0, float 0.0>

	b:			b:
	%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)			%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
	ret <2 x float> %v			%s = fadd <2 x float> %v, %x
				ret <2 x float> %s
	}			}

	; CHECK-LABEL: setprio_follows_setprio:			; CHECK-LABEL: setprio_follows_setprio:
	; CHECK: s_setprio 3			; CHECK: s_setprio 3
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK: s_cbranch_scc1 [[C:.*]]			; CHECK: s_cbranch_scc1 [[C:.*]]
	; CHECK: {{.*}}: ; %a			; CHECK: {{.*}}: ; %a
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: s_cbranch_scc1 [[C]]			; CHECK: s_cbranch_vccnz [[C]]
	; CHECK: {{.*}}: ; %b			; CHECK: {{.*}}: ; %b
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: s_branch [[EXIT:.*]]			; CHECK: s_branch [[EXIT:.*]]
	; CHECK: [[C]]: ; %c			; CHECK: [[C]]: ; %c
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: s_branch [[EXIT]]			; CHECK: s_branch [[EXIT]]
	; CHECK: [[EXIT]]:			; CHECK: [[EXIT]]:
	define amdgpu_ps <2 x float> @setprio_follows_setprio(<4 x i32> inreg %p, i32 inreg %i) {			define amdgpu_ps <2 x float> @setprio_follows_setprio(<4 x i32> inreg %p, i32 inreg %i) "amdgpu-wave-priority-threshold"="3" {
	entry:			entry:
	%v1 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)			%v1 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
	%cond1 = icmp ne i32 %i, 0			%cond1 = icmp ne i32 %i, 0
	br i1 %cond1, label %a, label %c			br i1 %cond1, label %a, label %c

	a:			a:
	%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)			%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)
	%cond2 = icmp ne i32 %i, 1			%v20 = extractelement <2 x float> %v2, i32 0
				%v21 = extractelement <2 x float> %v2, i32 1
				%cond2 = fcmp ult float %v20, %v21
	br i1 %cond2, label %b, label %c			br i1 %cond2, label %b, label %c

	b:			b:
	ret <2 x float> %v2			ret <2 x float> %v2

	c:			c:
	%v3 = phi <2 x float> [%v1, %entry], [%v2, %a]			%v4 = fadd <2 x float> %v1, %v1
	%v4 = fadd <2 x float> %v1, %v3
	ret <2 x float> %v4			ret <2 x float> %v4
	}			}

	; CHECK-LABEL: loop:			; CHECK-LABEL: loop:
	; CHECK: {{.*}}: ; %entry			; CHECK: {{.*}}: ; %entry
	; CHECK: s_setprio 3			; CHECK: s_setprio 3
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: [[LOOP:.*]]: ; %loop			; CHECK: [[LOOP:.*]]: ; %loop
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: s_cbranch_scc1 [[LOOP]]			; CHECK: s_cbranch_scc1 [[LOOP]]
	; CHECK-NEXT: {{.*}}: ; %exit			; CHECK-NEXT: {{.*}}: ; %exit
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	define amdgpu_ps <2 x float> @loop(<4 x i32> inreg %p) {			define amdgpu_ps <2 x float> @loop(<4 x i32> inreg %p) "amdgpu-wave-priority-threshold"="2" {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%i = phi i32 [0, %entry], [%i2, %loop]			%i = phi i32 [0, %entry], [%i2, %loop]
	%sum = phi <2 x float> [<float 0.0, float 0.0>, %entry], [%sum2, %loop]			%sum = phi <2 x float> [<float 0.0, float 0.0>, %entry], [%sum2, %loop]

	%i2 = add i32 %i, 1			%i2 = add i32 %i, 1
	Show All 21 Lines
	; CHECK {{.*}}: ; %exit			; CHECK {{.*}}: ; %exit
	; CHECK-NOT: s_setprio			; CHECK-NOT: s_setprio
	; CHECK: s_branch [[RET:.*]]			; CHECK: s_branch [[RET:.*]]
	; CHECK: [[ANOTHER_LOAD]]: ; %another_load			; CHECK: [[ANOTHER_LOAD]]: ; %another_load
	; CHECK: buffer_load_dwordx2			; CHECK: buffer_load_dwordx2
	; CHECK-NEXT: s_setprio 0			; CHECK-NEXT: s_setprio 0
	; CHECK: s_branch [[RET]]			; CHECK: s_branch [[RET]]
	; CHECK: [[RET]]:			; CHECK: [[RET]]:
	define amdgpu_ps <2 x float> @edge_split(<4 x i32> inreg %p, i32 inreg %x) {			define amdgpu_ps <2 x float> @edge_split(<4 x i32> inreg %p, i32 inreg %x) "amdgpu-wave-priority-threshold"="2" {
	entry:			entry:
	%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)			%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
	%cond = icmp ne i32 %x, 0			%cond = icmp ne i32 %x, 0
	br i1 %cond, label %loop, label %another_load			br i1 %cond, label %loop, label %another_load

	loop:			loop:
	%i = phi i32 [0, %entry], [%i2, %loop]			%i = phi i32 [0, %entry], [%i2, %loop]
	%mul = phi <2 x float> [%v, %entry], [%mul2, %loop]			%mul = phi <2 x float> [%v, %entry], [%mul2, %loop]

	%i2 = add i32 %i, 1			%i2 = add i32 %i, 1
	%mul2 = fmul <2 x float> %mul, %v			%mul2 = fmul <2 x float> %mul, %v

	%cond2 = icmp ult i32 %i2, 5			%cond2 = icmp ult i32 %i2, 5
	br i1 %cond2, label %loop, label %exit			br i1 %cond2, label %loop, label %exit

	exit:			exit:
	ret <2 x float> %mul2			ret <2 x float> %mul2

	another_load:			another_load:
	%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)			%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)
	%sum = fadd <2 x float> %v, %v2			%sum = fadd <2 x float> %v, %v2
	ret <2 x float> %sum			ret <2 x float> %sum
	}			}

				; CHECK-LABEL: valu_insts_threshold:
				; CHECK: s_setprio 3
				; CHECK: buffer_load_dwordx2
				; CHECK-NEXT: s_setprio 0
				; CHECK-COUNT-4: v_add_f32_e32
				; CHECK: s_cbranch_scc0 [[A:.*]]
				; CHECK: {{.*}}: ; %b
				; CHECK-NEXT: buffer_load_dwordx2
				; CHECK: s_branch [[END:.*]]
				; CHECK: [[A]]: ; %a
				; CHECK: s_branch [[END]]
				; CHECK: [[END]]:
				define amdgpu_ps <2 x float> @valu_insts_threshold(<4 x i32> inreg %p, i32 inreg %i) "amdgpu-wave-priority-threshold"="4" {
				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				%add = fadd <2 x float> %v, %v
				%add2 = fadd <2 x float> %add, %add

				%cond = icmp eq i32 %i, 0
				br i1 %cond, label %a, label %b

				a:
				ret <2 x float> %add2

				b:
				%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 1, i32 0, i32 0)
				%sub = fsub <2 x float> %add2, %v2
				ret <2 x float> %sub
				}

				; CHECK-LABEL: valu_insts_threshold2:
				; CHECK-NOT: s_setprio
				; CHECK: ; -- End function
				define amdgpu_ps <2 x float> @valu_insts_threshold2(<4 x i32> inreg %p, i32 inreg %i) "amdgpu-wave-priority-threshold"="5" {
				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				%add = fadd <2 x float> %v, %v
				%add2 = fadd <2 x float> %add, %add

				%cond = icmp eq i32 %i, 0
				br i1 %cond, label %a, label %b

				a:
				ret <2 x float> %add2

				b:
				%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 1, i32 0, i32 0)
				%sub = fsub <2 x float> %add2, %v2
				ret <2 x float> %sub
				}

	declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) nounwind			declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) nounwind