This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
13/24
AMDGPUSetWavePriority.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
set-wave-priority.ll

Differential D124671

[AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions.
ClosedPublic

Authored by kosarev on Apr 29 2022, 6:14 AM.

Download Raw Diff

Details

Reviewers

foad
dstuttard
tpr
nhaehnle
tsymalla

Commits

rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU…

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kosarev created this revision.Apr 29 2022, 6:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM

Herald added subscribers: hsmhsm, foad, kerbowa and 9 others. · View Herald Transcript

kosarev requested review of this revision.Apr 29 2022, 6:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

kosarev added reviewers: foad, dstuttard, tpr, nhaehnle, tsymalla.Apr 29 2022, 6:18 AM

tsymalla added inline comments.Apr 29 2022, 7:30 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
135	You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct?

Harbormaster completed remote builds in B161971: Diff 426032.Apr 29 2022, 8:48 AM

arsenm added inline comments.Apr 29 2022, 10:58 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Should we be counting cycles instead of instructions?
140	You're counting VALU instructions here and above?
198–199	Why not construct directly at the insert point?

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here.
135	We could, but I'm not sure I know how that might be useful in practice.
140	Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.
198–199	Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction.

Harbormaster completed remote builds in B162420: Diff 426646.May 3 2022, 5:49 AM

Ping.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 10 2022, 6:01 AM

arsenm added inline comments.May 17 2022, 2:46 PM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199	But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter

kosarev added inline comments.May 18 2022, 1:01 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199	I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place.

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
115	If you're ignoring loops, can you described succinctly what you are counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question?

foad added inline comments.May 18 2022, 2:50 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199	I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like `MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin()`.

Reworked as suggested.

In D124671#3521606, @foad wrote:

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

Right, that's the hardest part of it, isn't it. Because we don't have much use cases provided, the idea was to start with replicating more or less the same counting logic we have in the other compiler, and then adjust things using feedback from people who can give it some proper testing.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
115	Good point. Amended the comment.
198–199	OK, changed.

Harbormaster completed remote builds in B165129: Diff 430382.May 18 2022, 8:40 AM

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context.
155–156	I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions. The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU.

kosarev added inline comments.Aug 8 2022, 4:44 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
155–156	So would something like this work for the purpose? if (SIInstrInfo::isVALU(MI)) ++NumVALUInsts; else NumVALUInsts = 0;

Updated to count VALU instructions that follow VMEM loads and to add
support for threshold function attributes.

In D124671#3698865, @nhaehnle wrote:

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

I think knowing if we are at a point past a VMEM load wouldn't help as there may be another VMEM load down the control flow. And a similar problem with the last sequence break as at where the sequence finally becomes 'long enough', we can have several points at which it begins.

If I'm not wrong that there is no harm in lowering the priority immediately after VMEM loads, it seems it's still easier implementation-wise to track them and not other things. The updated version just makes sure we only consider VMEM loads that are followed by VALU sequences of the required length. Then the rest of the logic remains the same.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31	Done.
155–156	Updated to break the sequence on DS instructions.

Update commit description.

kosarev edited the summary of this revision. (Show Details)Aug 12 2022, 4:27 AM

Harbormaster completed remote builds in B180885: Diff 452140.Aug 12 2022, 5:05 AM

Update commit title as well.

kosarev retitled this revision from [AMDGPU] Do not raise wave priority beyond a specific number of VALU instructions. to [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions..Aug 12 2022, 5:59 AM

Harbormaster completed remote builds in B180901: Diff 452164.Aug 12 2022, 7:06 AM

Ping.

nhaehnle added inline comments.Aug 31 2022, 3:51 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
170–171	Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU.

nhaehnle added inline comments.Sep 1 2022, 2:35 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172	If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one. I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far.
170–171	Thank you for the offline clarification, I understand this better now.

kosarev added inline comments.Sep 1 2022, 4:53 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172	The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough. In terms of the code, as we reset `MaxNumVALUInstsInMiddle` every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner. `Info.LastVMEMLoad` storing the actual last VMEM load in the block should not be a problem because it is the `Info.MayReachVMEMLoad` flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length.

nhaehnle added inline comments.Sep 1 2022, 4:59 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172	I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is.

@nhaehnle Are we fine with accepting and submitting this? Or, do we want to give this some testing beforehand? Also not quite sure if the pass is supposed to be enabled to by default.

Yes, I think it's fine.

This revision is now accepted and ready to land.Sep 7 2022, 1:03 PM

This revision was landed with ongoing or failed builds.Sep 8 2022, 7:22 AM

Closed by commit rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU… (authored by kosarev). · Explain Why

This revision was automatically updated to reflect the committed changes.

kosarev added a commit: rG57c943d5811d: [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU….

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUSetWavePriority.cpp

60 lines

test/

CodeGen/

AMDGPU/

set-wave-priority.ll

33 lines

Diff 426032

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp

Show All 20 Lines
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Support/Allocator.h"		#include "llvm/Support/Allocator.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "amdgpu-set-wave-priority"		#define DEBUG_TYPE "amdgpu-set-wave-priority"

		static cl::opt<unsigned> VALUInstsThreshold(
		"amdgpu-set-wave-priority-valu-insts-threshold",
		cl::desc("VALU instruction count threshold for adjusting wave priority"),
		arsenmUnsubmitted Not Done Reply Inline Actions Should we be counting cycles instead of instructions? arsenm: Should we be counting cycles instead of instructions?
		kosarevAuthorUnsubmitted Done Reply Inline Actions We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here. kosarev: We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that…
		nhaehnleUnsubmitted Done Reply Inline Actions Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context. nhaehnle: Could this also be configured via a function attribute? Those are less problematic than command…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Done. kosarev: Done.
		cl::init(100), cl::Hidden);

namespace {		namespace {

struct MBBInfo {		struct MBBInfo {
MBBInfo() = default;		MBBInfo() = default;
bool MayReachVMEMLoad = false;		bool MayReachVMEMLoad = false;
		unsigned NumVALUInstsAtBeginning = 0;
		unsigned NumVALUInstsAtEnd = 0;
		MachineInstr *LastVMEMLoad = nullptr;
};		};

using MBBInfoSet = DenseMap<const MachineBasicBlock *, MBBInfo>;		using MBBInfoSet = DenseMap<const MachineBasicBlock *, MBBInfo>;

class AMDGPUSetWavePriority : public MachineFunctionPass {		class AMDGPUSetWavePriority : public MachineFunctionPass {
public:		public:
static char ID;		static char ID;

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	bool AMDGPUSetWavePriority::runOnMachineFunction(MachineFunction &MF) {

Function &F = MF.getFunction();		Function &F = MF.getFunction();
if (skipFunction(F) \|\| !AMDGPU::isEntryFunctionCC(F.getCallingConv()))		if (skipFunction(F) \|\| !AMDGPU::isEntryFunctionCC(F.getCallingConv()))
return false;		return false;

const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();		const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();

		// Count VALU instructions that may be executed by the time each of the
		// function's blocks receive control. We currently assume that
		// backedges/loops, branch probabilities and other details can be
		// ignored.
MBBInfoSet MBBInfos;		MBBInfoSet MBBInfos;
		ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
		for (const MachineBasicBlock *MBB : RPOT) {
		unsigned NumVALUInsts = 0;
		foadUnsubmitted Not Done Reply Inline Actions If you're ignoring loops, can you described succinctly what you are counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question? foad: If you're ignoring loops, can you described succinctly what you are counting? Is it something…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Good point. Amended the comment. kosarev: Good point. Amended the comment.
		for (const MachineBasicBlock *Pred : MBB->predecessors())
		NumVALUInsts = std::max(NumVALUInsts, MBBInfos[Pred].NumVALUInstsAtEnd);

		MBBInfo &Info = MBBInfos[MBB];
		Info.NumVALUInstsAtBeginning = NumVALUInsts;
		Info.NumVALUInstsAtEnd =
		NumVALUInsts + count_if(*MBB, [](const MachineInstr &MI) {
		return SIInstrInfo::isVALU(MI);
		});
		}

		// Find VMEM loads that may be executed before the number of executed
		// VALU instructions hits the specified threshold.
SmallVector<const MachineBasicBlock *, 16> Worklist;		SmallVector<const MachineBasicBlock *, 16> Worklist;
for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
if (any_of(MBB, isVMEMLoad))		MBBInfo &Info = MBBInfos[&MBB];
		unsigned NumVALUInsts = Info.NumVALUInstsAtBeginning;
		MachineInstr *LastVMEMLoad = nullptr;
		for (MachineInstr &MI : MBB) {
		if (NumVALUInsts >= VALUInstsThreshold)
		tsymallaUnsubmitted Not Done Reply Inline Actions You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct? tsymalla: You could early opt-out when `VALUInstsThreshold == 0` (at the beginning), is that correct?
		kosarevAuthorUnsubmitted Done Reply Inline Actions We could, but I'm not sure I know how that might be useful in practice. kosarev: We could, but I'm not sure I know how that might be useful in practice.
		break;
		if (isVMEMLoad(MI))
		LastVMEMLoad = &MI;
		if (SIInstrInfo::isVALU(MI))
		++NumVALUInsts;
		arsenmUnsubmitted Not Done Reply Inline Actions You're counting VALU instructions here and above? arsenm: You're counting VALU instructions here and above?
		kosarevAuthorUnsubmitted Done Reply Inline Actions Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks. kosarev: Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.
		}
		if (LastVMEMLoad) {
		Info.LastVMEMLoad = LastVMEMLoad;
Worklist.push_back(&MBB);		Worklist.push_back(&MBB);
}		}
		}

// Mark blocks from which control may reach VMEM loads.		// Mark blocks from which control may reach the VMEM loads.
while (!Worklist.empty()) {		while (!Worklist.empty()) {
const MachineBasicBlock *MBB = Worklist.pop_back_val();		const MachineBasicBlock *MBB = Worklist.pop_back_val();
MBBInfo &Info = MBBInfos[MBB];		MBBInfo &Info = MBBInfos[MBB];
if (!Info.MayReachVMEMLoad) {		if (!Info.MayReachVMEMLoad) {
Info.MayReachVMEMLoad = true;		Info.MayReachVMEMLoad = true;
Worklist.append(MBB->pred_begin(), MBB->pred_end());		Worklist.append(MBB->pred_begin(), MBB->pred_end());
}		}
}		}
		nhaehnleUnsubmitted Done Reply Inline Actions I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions. The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU. nhaehnle: I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number…
		kosarevAuthorUnsubmitted Done Reply Inline Actions So would something like this work for the purpose? if (SIInstrInfo::isVALU(MI)) ++NumVALUInsts; else NumVALUInsts = 0; kosarev: So would something like this work for the purpose? ``` if (SIInstrInfo::isVALU(MI))…
		kosarevAuthorUnsubmitted Done Reply Inline Actions Updated to break the sequence on DS instructions. kosarev: Updated to break the sequence on DS instructions.

MachineBasicBlock &Entry = MF.front();		MachineBasicBlock &Entry = MF.front();
if (!MBBInfos[&Entry].MayReachVMEMLoad)		if (!MBBInfos[&Entry].MayReachVMEMLoad)
return false;		return false;

// Raise the priority at the beginning of the shader.		// Raise the priority at the beginning of the shader.
MachineBasicBlock::iterator I = Entry.begin(), E = Entry.end();		MachineBasicBlock::iterator I = Entry.begin(), E = Entry.end();
while (I != E && !SIInstrInfo::isVALU(*I) && !I->isTerminator())		while (I != E && !SIInstrInfo::isVALU(*I) && !I->isTerminator())
++I;		++I;
Entry.insert(I, BuildSetprioMI(MF, HighPriority));		Entry.insert(I, BuildSetprioMI(MF, HighPriority));

// Lower the priority on edges where control leaves blocks from which		// Lower the priority on edges where control leaves blocks from which
// VMEM loads are reachable.		// the VMEM loads are reachable.
SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;		SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;
for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
		nhaehnleUnsubmitted Not Done Reply Inline Actions Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU. nhaehnle: Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed…
		nhaehnleUnsubmitted Not Done Reply Inline Actions Thank you for the offline clarification, I understand this better now. nhaehnle: Thank you for the offline clarification, I understand this better now.
if (MBBInfos[&MBB].MayReachVMEMLoad) {		if (MBBInfos[&MBB].MayReachVMEMLoad) {
		nhaehnleUnsubmitted Not Done Reply Inline Actions If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one. I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far. nhaehnle: If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the…
		kosarevAuthorUnsubmitted Done Reply Inline Actions The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough. In terms of the code, as we reset `MaxNumVALUInstsInMiddle` every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner. `Info.LastVMEMLoad` storing the actual last VMEM load in the block should not be a problem because it is the `Info.MayReachVMEMLoad` flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length. kosarev: The ticket says the priority is supposed to be lowered after the last VMEM load, and this…
		nhaehnleUnsubmitted Not Done Reply Inline Actions I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is. nhaehnle: I did think the policy was supposed to be "lower priority before the longest sequence of VALU…
if (MBB.succ_empty())		if (MBB.succ_empty())
PriorityLoweringBlocks.insert(&MBB);		PriorityLoweringBlocks.insert(&MBB);
continue;		continue;
}		}

if (CanLowerPriorityDirectlyInPredecessors(MBB, MBBInfos)) {		if (CanLowerPriorityDirectlyInPredecessors(MBB, MBBInfos)) {
for (MachineBasicBlock *Pred : MBB.predecessors()) {		for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (MBBInfos[Pred].MayReachVMEMLoad)		if (MBBInfos[Pred].MayReachVMEMLoad)
PriorityLoweringBlocks.insert(Pred);		PriorityLoweringBlocks.insert(Pred);
}		}
continue;		continue;
}		}

// Where lowering the priority in predecessors is not possible, the		// Where lowering the priority in predecessors is not possible, the
// block receiving control either was not part of a loop in the first		// block receiving control either was not part of a loop in the first
// place or the loop simplification/canonicalization pass should have		// place or the loop simplification/canonicalization pass should have
// already tried to split the edge and insert a preheader, and if for		// already tried to split the edge and insert a preheader, and if for
// whatever reason it failed to do so, then this leaves us with the		// whatever reason it failed to do so, then this leaves us with the
// only option of lowering the priority within the loop.		// only option of lowering the priority within the loop.
PriorityLoweringBlocks.insert(&MBB);		PriorityLoweringBlocks.insert(&MBB);
}		}

for (MachineBasicBlock *MBB : PriorityLoweringBlocks) {		for (MachineBasicBlock *MBB : PriorityLoweringBlocks) {
MachineBasicBlock::iterator I = MBB->end(), B = MBB->begin();		MachineInstr *Setprio = BuildSetprioMI(MF, LowPriority);
while (I != B) {		if (MachineInstr *LastVMEMLoad = MBBInfos[MBB].LastVMEMLoad) {
if (isVMEMLoad(*--I)) {		MBB->insertAfter(MachineBasicBlock::instr_iterator(LastVMEMLoad),
++I;		Setprio);
		arsenmUnsubmitted Not Done Reply Inline Actions Why not construct directly at the insert point? arsenm: Why not construct directly at the insert point?
		kosarevAuthorUnsubmitted Done Reply Inline Actions Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction. kosarev: Well, the instruction we create here is not just an auxiliary value that we coincidentally…
		arsenmUnsubmitted Not Done Reply Inline Actions But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter arsenm: But you can construct it in the right place rather than constructing the instruction and then…
		kosarevAuthorUnsubmitted Done Reply Inline Actions I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place. kosarev: I'm not sure I see how this answers the point. Granted that constructing the instruction at the…
		foadUnsubmitted Not Done Reply Inline Actions I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like `MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin()`. foad: I agree with Matt that it would be more normal to construct-and-insert at the same time, though…
		kosarevAuthorUnsubmitted Done Reply Inline Actions OK, changed. kosarev: OK, changed.
break;		continue;
}
}		}
MBB->insert(I, BuildSetprioMI(MF, LowPriority));		MBB->insert(MBB->begin(), Setprio);
}		}

return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/set-wave-priority.ll

; RUN: llc -mtriple=amdgcn -amdgpu-set-wave-priority=true -o - %s \| \		; RUN: llc -mtriple=amdgcn -amdgpu-set-wave-priority=true \
; RUN: FileCheck %s		; RUN: -amdgpu-set-wave-priority-valu-insts-threshold=4 -o - %s \| FileCheck %s

; CHECK-LABEL: no_setprio:		; CHECK-LABEL: no_setprio:
; CHECK-NOT: s_setprio		; CHECK-NOT: s_setprio
; CHECK: ; return to shader part epilog		; CHECK: ; return to shader part epilog
define amdgpu_ps <2 x float> @no_setprio() {		define amdgpu_ps <2 x float> @no_setprio() {
ret <2 x float> <float 0.0, float 0.0>		ret <2 x float> <float 0.0, float 0.0>
}		}

▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	exit:
ret <2 x float> %mul2		ret <2 x float> %mul2

another_load:		another_load:
%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)		%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)
%sum = fadd <2 x float> %v, %v2		%sum = fadd <2 x float> %v, %v2
ret <2 x float> %sum		ret <2 x float> %sum
}		}

		; CHECK-LABEL: valu_insts_threshold:
		; CHECK: s_setprio 3
		; CHECK: buffer_load_dwordx2
		; CHECK-NEXT: s_setprio 0
		; CHECK-COUNT-4: v_add_f32_e32
		; CHECK: s_cbranch_scc0 [[A:.*]]
		; CHECK: {{.*}}: ; %b
		; CHECK-NEXT: buffer_load_dwordx2
		; CHECK: s_branch [[END:.*]]
		; CHECK: [[A]]: ; %a
		; CHECK: s_branch [[END]]
		; CHECK: [[END]]:
		define amdgpu_ps <2 x float> @valu_insts_threshold(<4 x i32> inreg %p, i32 inreg %i) {
		%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
		%add = fadd <2 x float> %v, %v
		%add2 = fadd <2 x float> %add, %add

		%cond = icmp eq i32 %i, 0
		br i1 %cond, label %a, label %b

		a:
		ret <2 x float> %add2

		b:
		%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 1, i32 0, i32 0)
		%sub = fsub <2 x float> %add2, %v2
		ret <2 x float> %sub
		}

declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) nounwind		declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) nounwind