This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
16/43
AMDGPUSetWavePriority.cpp
-
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
set-wave-priority.ll

Differential D124246

[AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.
ClosedPublic

Authored by kosarev on Apr 22 2022, 4:22 AM.

Download Raw Diff

Details

Reviewers

foad
dstuttard
tpr
nhaehnle
tsymalla

Commits

rG6ddf2a824da9: [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.

Summary

As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.

A few notes and questions:

This intentionally avoids introducing any counting of VALU instructions as at this moment it's not entirely clear how we would want to do that in presence of branches and loops. However, the patch aims to make adding support for such counting as simple as possible by separating identification of VMEM loads we want to take into account from the rest of the logic.

The implementation assumes that one s_setprio 0 followed by another one is acceptable when edge splitting would be the only way to avoid this.

I understand we want this be enabled by default, but left the pass disabled for now until it's in production state -- for the sake of not touching tests that should not normally be affected by the pass.

If the s_setprio instruction is not universally available, I guess we may want to make sure the selected target does actually support it?

What do we do if s_setprios already present in the input code? Would doing nothing be acceptable and reasonable in such a case?

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,190 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg_mask.c
	60,560 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg_mask_mf.c
	60,390 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlseg_mask.c
	60,500 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlsegff_mask.c
	60,690 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg_mask.c
		View Full Test Results (30 Failed)

Event Timeline

kosarev created this revision.Apr 22 2022, 4:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 4:22 AM

Herald added subscribers: hsmhsm, foad, kerbowa and 10 others. · View Herald Transcript

kosarev requested review of this revision.Apr 22 2022, 4:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 4:22 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Fix Lint complaints.

kosarev edited the summary of this revision. (Show Details)Apr 22 2022, 4:56 AM

kosarev added reviewers: foad, dstuttard, tpr.

foad added inline comments.Apr 22 2022, 4:57 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32	It's a bit more normal to put "Enable" options in AMDGPUTargetMachine, where you can use isPassEnabled.
123	Should bail out early if skipFunction(MF.getFunction()), to help with bisecting optimization problems.

kosarev added reviewers: nhaehnle, tsymalla.Apr 22 2022, 5:05 AM

Harbormaster completed remote builds in B160824: Diff 424435.Apr 22 2022, 5:11 AM

foad added inline comments.Apr 22 2022, 5:12 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	This seems like overkill. Can you just use a `DenseMap<const MachineBasicBlock , bool>` directly? Or maybe just a `DenseSet<const MachineBasicBlock >`?
98	I think it would be neater to have this function take a MachineBasicBlock::iterator instead of MF, and have it automatically insert the created instruction (by calling one of the BuildMI overloads that does the insertion).

foad added inline comments.Apr 22 2022, 5:41 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
103–104	I think this deserves a comment along the lines of: Check that for every predecessor Pred that can reach a VMEM load, none of Pred's successors can reach a VMEM load.
108–109	I don't understand what this is for. If Pred can reach a VMEM load and it is in a loop, then it must surely have at least one successor that can also reach a VMEM load (namely a successor that is on a path around the loop that leads back to Pred), so we will return false anyway on line 111 below.
137	Could use: if (any_of(MBB, [&](const MachineInstr &MI){ return isVMEMLoad(MI); })) Worklist.push_back(&MBB); instead of an explicit loop, here and elsewhere - although it is a matter of taste.

tsymalla added inline comments.Apr 22 2022, 5:58 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32	Probably invert that options so that setting it to true enables the pass
131	Can be moved to the beginning of the function.
160	Could you comment that section please?
164	Typo: priority
192	Could be using the reverse iterator

jpages added a subscriber: jpages.Apr 22 2022, 6:54 AM

arsenm added inline comments.Apr 22 2022, 7:10 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137	Is it really worthwhile to do this for all blocks? Should there be some kind of memory instruction count threshold?
192	Probably want llvm::make_early_inc_range(llvm::reverse(*MBB))
llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
386 ↗	(On Diff #424435)	Does it really need to run right here? Can you move it earlier to share the loop analysis?

Updated.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32	Changed to use `isPassEnabled()`, thanks.
43	On the other hand, if we know we will need more fields there to count VALUs, then it's probably not worth to rewrite it back and forth? (I don't mind either way.)
98	The inserting `BuildMI()` seem take both the MF/MBB and the iterator. Then, the aim was this function to create the instruction and leave the client to choose the way it wants to insert it, so we don't need to rewrite the function if we need to `insertAfter()` or something (which was already the case with some of the early versions of the patch).
103–104	Done.
108–109	Indeed. Removed. Thanks.
123	Done.
131	Moved out to `AMDGPUTargetMachine.cpp`.
137	Sure, done.
160	Done.
164	Fixed, thanks.
192	Yeah, I tried that, but `MBB->rend()->getIterator()` seems to unable to return a valid iterator and crashes. I'm also not quite sure that would simplify the logic.

foad added inline comments.Apr 22 2022, 9:21 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	if we know we will need more fields there to count VALUs I don't know that :) In any case, why do you use a map of pointers to MBBInfo, instead of just a map of MBBInfo? That seems wasteful.

kosarev added inline comments.Apr 22 2022, 9:35 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137	I understand there is going to be some counting for VALU instructions, but not aware of any plans to have thresholds for VMEMs. I guess others may know better.
llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
386 ↗	(On Diff #424435)	The Jay's finding eliminates the need for the loop analysis altogether, but speaking in general because this pass doesn't cause any major changes, we likely want it as late as possible.

Harbormaster completed remote builds in B160874: Diff 424500.Apr 22 2022, 9:59 AM

kosarev added inline comments.Apr 22 2022, 10:11 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	I think the main desire was to have references to `MBBInfo` remain valid on insertions but also avoid the risk of `DenseMap` be dealing with what may potentially become values of larger size. Will give it another thought on Monday. : )

kosarev added inline comments.Apr 25 2022, 3:43 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	So if I'm not wrong that `DenseMap<>` may reallocate values on insertion, that means some rather innocently looking things like `MBBInfos[A].NumSth + MBBInfos[B].NumSth` being outlawed, and they are probably not the easiest things to catch during preparing changes or review. If we see these few extra lines preventing the risk a complication beyond immediate needs, then I don't mind removing them, but do we really think it's worth it?

foad added inline comments.Apr 25 2022, 4:04 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	I would prefer not to have the extra complication. Other reviewers might disagree.

foad added inline comments.Apr 25 2022, 4:09 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	Actually I don't understand your argument. If `MBBInfos[A].NumSth + MBBInfos[B].NumSth` is not safe, then how does adding a level of indirection and writing `MBBInfos[A]->NumSth + MBBInfos[B]->NumSth` make it any safer?

kosarev added inline comments.Apr 25 2022, 4:35 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	Not sure I understand how levels of indirection are relevant, but if we still compare using the `MBBInfoSet::operator[]` above with using `DenseMap::operator[]` directly, then I think what makes the first safer is that it doesn't return references to reallocatable storage. Meaning in `MBBInfos[A].NumSth + MBBInfos[B].NumSth` we don't depend on the order of calling the `MBBInfoSet::operator[]`s and the structure dereferences.

foad added inline comments.Apr 25 2022, 4:54 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
41	If you keep this code, variables names should use CamelCase like `Info`.
43	OK, I understand now. Sorry I had missed (or forgotten) that you had written your own operator[]. Personally I still prefer not to have the extra complication.

Updated to map blocks directly to info structures.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43	OK, removed.

foad added inline comments.Apr 26 2022, 2:35 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96	I think it only makes sense to run this pass on kernels (i.e. the entry point of a wave), not on subfunctions, so you should bail out if `!AMDGPU::isKernel(MF.getFunction().getCallingConv())` or similar.
132	I could be wrong but I don't think kernels have a normal "return" at the end. Maybe change this to `MBB.succ_empty()` to make it a bit more generic?

The implementation assumes that one s_setprio 0 followed by another one is acceptable when edge splitting would be the only way to avoid this.

Sounds good to me.

If the s_setprio instruction is not universally available, I guess we may want to make sure the selected target does actually support it?

It is universally available.

foad added inline comments.Apr 26 2022, 2:44 AM

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137	A very minor point: you only lower priority in any predecessor if you can do it in all predecessors. Would it make sense to lower priority in some predecessors even if you can't do it in all of them?

Harbormaster completed remote builds in B161348: Diff 425153.Apr 26 2022, 3:10 AM

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96	Would `AMDGPU::isEntryFunctionCC()` be the right function for that?
132	Changed. For `amdgpu_ps` shaders in the test I do see return MIs, though they seem to be emitted just as `; return to shader part epilog` comments.
137	Yeah, good question. My analysis here is that while we know that would cost us extra time executing the `s_setprio 0` in `MBB` when control comes from one of such predecessors, I presently do not have any evidence at hand that that would help cases considered critical or at least important. The thinking therefore was that strategically we might prefer let the problem, if there is any, reveal itself and via that get a concrete reproducer, rather than to speculate early and never know for sure if it's actually necessary. I probably should clarify that I see this patch as an attempt to introduce a reliable ground for developing our understanding of the nature of the original issue (which I guess may take some time testing and collecting relevant use cases), and not an implementation that tries to do the best thing for all theoretically possible cases right away.

LGTM, thanks!

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96	Oh yes, isEntry* looks more appropriate than isKernel*.
137	OK!

This revision is now accepted and ready to land.Apr 26 2022, 5:14 AM

Harbormaster completed remote builds in B161367: Diff 425177.Apr 26 2022, 5:51 AM

Closed by commit rG6ddf2a824da9: [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling. (authored by kosarev). · Explain WhyApr 27 2022, 6:40 AM

This revision was automatically updated to reflect the committed changes.

kosarev added a commit: rG6ddf2a824da9: [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling..

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUSetWavePriority.cpp

179 lines

AMDGPUTargetMachine.cpp

6 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

set-wave-priority.ll

153 lines

Diff 424500

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines
	extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;			extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;

	void initializeGCNNSAReassignPass(PassRegistry &);			void initializeGCNNSAReassignPass(PassRegistry &);
	extern char &GCNNSAReassignID;			extern char &GCNNSAReassignID;

	void initializeGCNPreRAOptimizationsPass(PassRegistry &);			void initializeGCNPreRAOptimizationsPass(PassRegistry &);
	extern char &GCNPreRAOptimizationsID;			extern char &GCNPreRAOptimizationsID;

				FunctionPass *createAMDGPUSetWavePriorityPass();
				void initializeAMDGPUSetWavePriorityPass(PassRegistry &);

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	TI_SCRATCH_RSRC_DWORD2,			TI_SCRATCH_RSRC_DWORD2,
	TI_SCRATCH_RSRC_DWORD3			TI_SCRATCH_RSRC_DWORD3
	};			};
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp

This file was added.

				//===- AMDGPUSetWavePriority.cpp - Set wave priority ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// Pass to temporarily raise the wave priority beginning the start of
				/// the shader function until its last VMEM instructions to allow younger
				/// waves to issue their VMEM instructions as well.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "GCNSubtarget.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "SIInstrInfo.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Support/Allocator.h"

				using namespace llvm;

				#define DEBUG_TYPE "amdgpu-set-wave-priority"

				namespace {

				struct MBBInfo {
				MBBInfo() = default;
				foadUnsubmitted Not Done Reply Inline Actions It's a bit more normal to put "Enable" options in AMDGPUTargetMachine, where you can use isPassEnabled. foad: It's a bit more normal to put "Enable" options in AMDGPUTargetMachine, where you can use…
				kosarevAuthorUnsubmitted Not Done Reply Inline Actions Changed to use `isPassEnabled()`, thanks. kosarev: Changed to use `isPassEnabled()`, thanks.
				tsymallaUnsubmitted Not Done Reply Inline Actions Probably invert that options so that setting it to true enables the pass tsymalla: Probably invert that options so that setting it to true enables the pass
				bool MayReachVMEMLoad = false;
				};

				class MBBInfoSet {
				public:
				MBBInfoSet() = default;

				MBBInfo &operator[](const MachineBasicBlock *MBB) {
				MBBInfo *&info = Infos[MBB];
				foadUnsubmitted Not Done Reply Inline Actions If you keep this code, variables names should use CamelCase like `Info`. foad: If you keep this code, variables names should use CamelCase like `Info`.
				if (!info)
				info = new (Alloc) MBBInfo();
				foadUnsubmitted Not Done Reply Inline Actions This seems like overkill. Can you just use a `DenseMap<const MachineBasicBlock , bool>` directly? Or maybe just a `DenseSet<const MachineBasicBlock >`? foad: This seems like overkill. Can you just use a `DenseMap<const MachineBasicBlock *, bool>`…
				kosarevAuthorUnsubmitted Not Done Reply Inline Actions On the other hand, if we know we will need more fields there to count VALUs, then it's probably not worth to rewrite it back and forth? (I don't mind either way.) kosarev: On the other hand, if we know we will need more fields there to count VALUs, then it's probably…
				foadUnsubmitted Not Done Reply Inline Actions if we know we will need more fields there to count VALUs I don't know that :) In any case, why do you use a map of pointers to MBBInfo, instead of just a map of MBBInfo? That seems wasteful. foad: > if we know we will need more fields there to count VALUs I don't know that :) In any case…
				kosarevAuthorUnsubmitted Done Reply Inline Actions I think the main desire was to have references to `MBBInfo` remain valid on insertions but also avoid the risk of `DenseMap` be dealing with what may potentially become values of larger size. Will give it another thought on Monday. : ) kosarev: I think the main desire was to have references to `MBBInfo` remain valid on insertions but also…
				kosarevAuthorUnsubmitted Done Reply Inline Actions So if I'm not wrong that `DenseMap<>` may reallocate values on insertion, that means some rather innocently looking things like `MBBInfos[A].NumSth + MBBInfos[B].NumSth` being outlawed, and they are probably not the easiest things to catch during preparing changes or review. If we see these few extra lines preventing the risk a complication beyond immediate needs, then I don't mind removing them, but do we really think it's worth it? kosarev: So if I'm not wrong that `DenseMap<>` may reallocate values on insertion, that means some…
				foadUnsubmitted Not Done Reply Inline Actions I would prefer not to have the extra complication. Other reviewers might disagree. foad: I would prefer not to have the extra complication. Other reviewers might disagree.
				foadUnsubmitted Not Done Reply Inline Actions Actually I don't understand your argument. If `MBBInfos[A].NumSth + MBBInfos[B].NumSth` is not safe, then how does adding a level of indirection and writing `MBBInfos[A]->NumSth + MBBInfos[B]->NumSth` make it any safer? foad: Actually I don't understand your argument. If `MBBInfos[A].NumSth + MBBInfos[B].NumSth` is not…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Not sure I understand how levels of indirection are relevant, but if we still compare using the `MBBInfoSet::operator[]` above with using `DenseMap::operator[]` directly, then I think what makes the first safer is that it doesn't return references to reallocatable storage. Meaning in `MBBInfos[A].NumSth + MBBInfos[B].NumSth` we don't depend on the order of calling the `MBBInfoSet::operator[]`s and the structure dereferences. kosarev: Not sure I understand how levels of indirection are relevant, but if we still compare using the…
				foadUnsubmitted Not Done Reply Inline Actions OK, I understand now. Sorry I had missed (or forgotten) that you had written your own operator[]. Personally I still prefer not to have the extra complication. foad: OK, I understand now. Sorry I had missed (or forgotten) that you had written your own operator…
				kosarevAuthorUnsubmitted Done Reply Inline Actions OK, removed. kosarev: OK, removed.
				return *info;
				}

				private:
				BumpPtrAllocator Alloc;
				DenseMap<const MachineBasicBlock , MBBInfo > Infos;
				};

				class AMDGPUSetWavePriority : public MachineFunctionPass {
				public:
				static char ID;

				AMDGPUSetWavePriority() : MachineFunctionPass(ID) {}

				StringRef getPassName() const override { return "Set wave priority"; }

				bool runOnMachineFunction(MachineFunction &MF) override;

				private:
				MachineInstr *BuildSetprioMI(MachineFunction &MF, unsigned priority) const;

				const SIInstrInfo *TII;
				};

				} // End anonymous namespace.

				INITIALIZE_PASS(AMDGPUSetWavePriority, DEBUG_TYPE, "Set wave priority", false,
				false)

				char AMDGPUSetWavePriority::ID = 0;

				FunctionPass *llvm::createAMDGPUSetWavePriorityPass() {
				return new AMDGPUSetWavePriority();
				}

				MachineInstr *AMDGPUSetWavePriority::BuildSetprioMI(MachineFunction &MF,
				unsigned priority) const {
				return BuildMI(MF, DebugLoc(), TII->get(AMDGPU::S_SETPRIO)).addImm(priority);
				}

				// Checks that for every predecessor Pred that can reach a VMEM load,
				// none of Pred's successors can reach a VMEM load.
				static bool CanLowerPriorityDirectlyInPredecessors(const MachineBasicBlock &MBB,
				MBBInfoSet &MBBInfos) {
				for (const MachineBasicBlock *Pred : MBB.predecessors()) {
				if (!MBBInfos[Pred].MayReachVMEMLoad)
				continue;
				for (const MachineBasicBlock *Succ : Pred->successors()) {
				if (MBBInfos[Succ].MayReachVMEMLoad)
				return false;
				}
				}
				return true;
				foadUnsubmitted Not Done Reply Inline Actions I think it only makes sense to run this pass on kernels (i.e. the entry point of a wave), not on subfunctions, so you should bail out if `!AMDGPU::isKernel(MF.getFunction().getCallingConv())` or similar. foad: I think it only makes sense to run this pass on kernels (i.e. the entry point of a wave), not…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Would `AMDGPU::isEntryFunctionCC()` be the right function for that? kosarev: Would `AMDGPU::isEntryFunctionCC()` be the right function for that?
				foadUnsubmitted Not Done Reply Inline Actions Oh yes, isEntry* looks more appropriate than isKernel. foad:* Oh yes, isEntry* looks more appropriate than isKernel*.
				}

				foadUnsubmitted Not Done Reply Inline Actions I think it would be neater to have this function take a MachineBasicBlock::iterator instead of MF, and have it automatically insert the created instruction (by calling one of the BuildMI overloads that does the insertion). foad: I think it would be neater to have this function take a MachineBasicBlock::iterator instead of…
				kosarevAuthorUnsubmitted Not Done Reply Inline Actions The inserting `BuildMI()` seem take both the MF/MBB and the iterator. Then, the aim was this function to create the instruction and leave the client to choose the way it wants to insert it, so we don't need to rewrite the function if we need to `insertAfter()` or something (which was already the case with some of the early versions of the patch). kosarev: The inserting `BuildMI()` seem take both the MF/MBB and the iterator. Then, the aim was this…
				static bool isVMEMLoad(const MachineInstr &MI) {
				return SIInstrInfo::isVMEM(MI) && MI.mayLoad();
				}

				bool AMDGPUSetWavePriority::runOnMachineFunction(MachineFunction &MF) {
				const unsigned HighPriority = 3;
				foadUnsubmitted Not Done Reply Inline Actions I think this deserves a comment along the lines of: Check that for every predecessor Pred that can reach a VMEM load, none of Pred's successors can reach a VMEM load. foad: I think this deserves a comment along the lines of: Check that for every predecessor Pred that…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Done. kosarev: Done.
				const unsigned LowPriority = 0;

				if (skipFunction(MF.getFunction()))
				return false;

				foadUnsubmitted Not Done Reply Inline Actions I don't understand what this is for. If Pred can reach a VMEM load and it is in a loop, then it must surely have at least one successor that can also reach a VMEM load (namely a successor that is on a path around the loop that leads back to Pred), so we will return false anyway on line 111 below. foad: I don't understand what this is for. If Pred can reach a VMEM load and it is in a loop, then it…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Indeed. Removed. Thanks. kosarev: Indeed. Removed. Thanks.
				const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
				TII = ST.getInstrInfo();

				MBBInfoSet MBBInfos;
				SmallVector<const MachineBasicBlock *, 16> Worklist;
				for (MachineBasicBlock &MBB : MF) {
				if (any_of(MBB, isVMEMLoad))
				Worklist.push_back(&MBB);
				}

				// Mark blocks from which control may reach VMEM loads.
				while (!Worklist.empty()) {
				const MachineBasicBlock *MBB = Worklist.pop_back_val();
				MBBInfo &Info = MBBInfos[MBB];
				foadUnsubmitted Not Done Reply Inline Actions Should bail out early if skipFunction(MF.getFunction()), to help with bisecting optimization problems. foad: Should bail out early if skipFunction(MF.getFunction()), to help with bisecting optimization…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Done. kosarev: Done.
				if (!Info.MayReachVMEMLoad) {
				Info.MayReachVMEMLoad = true;
				Worklist.append(MBB->pred_begin(), MBB->pred_end());
				}
				}

				MachineBasicBlock &Entry = MF.front();
				if (!MBBInfos[&Entry].MayReachVMEMLoad)
				tsymallaUnsubmitted Not Done Reply Inline Actions Can be moved to the beginning of the function. tsymalla: Can be moved to the beginning of the function.
				kosarevAuthorUnsubmitted Done Reply Inline Actions Moved out to `AMDGPUTargetMachine.cpp`. kosarev: Moved out to `AMDGPUTargetMachine.cpp`.
				return false;
				foadUnsubmitted Not Done Reply Inline Actions I could be wrong but I don't think kernels have a normal "return" at the end. Maybe change this to `MBB.succ_empty()` to make it a bit more generic? foad: I could be wrong but I don't think kernels have a normal "return" at the end. Maybe change this…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Changed. For `amdgpu_ps` shaders in the test I do see return MIs, though they seem to be emitted just as `; return to shader part epilog` comments. kosarev: Changed. For `amdgpu_ps` shaders in the test I do see return MIs, though they seem to be…

				// Raise the priority at the beginning of the shader.
				MachineBasicBlock::iterator I = Entry.begin(), E = Entry.end();
				while (I != E && !SIInstrInfo::isVALU(*I) && !I->isTerminator())
				++I;
				foadUnsubmitted Not Done Reply Inline Actions Could use: if (any_of(MBB, [&](const MachineInstr &MI){ return isVMEMLoad(MI); })) Worklist.push_back(&MBB); instead of an explicit loop, here and elsewhere - although it is a matter of taste. foad: Could use: ``` if (any_of(MBB, [&](const MachineInstr &MI){ return isVMEMLoad(MI); }))…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Sure, done. kosarev: Sure, done.
				arsenmUnsubmitted Not Done Reply Inline Actions Is it really worthwhile to do this for all blocks? Should there be some kind of memory instruction count threshold? arsenm: Is it really worthwhile to do this for all blocks? Should there be some kind of memory…
				kosarevAuthorUnsubmitted Done Reply Inline Actions I understand there is going to be some counting for VALU instructions, but not aware of any plans to have thresholds for VMEMs. I guess others may know better. kosarev: I understand there is going to be some counting for VALU instructions, but not aware of any…
				foadUnsubmitted Not Done Reply Inline Actions A very minor point: you only lower priority in any predecessor if you can do it in all predecessors. Would it make sense to lower priority in some predecessors even if you can't do it in all of them? foad: A very minor point: you only lower priority in any predecessor if you can do it in all…
				kosarevAuthorUnsubmitted Done Reply Inline Actions Yeah, good question. My analysis here is that while we know that would cost us extra time executing the `s_setprio 0` in `MBB` when control comes from one of such predecessors, I presently do not have any evidence at hand that that would help cases considered critical or at least important. The thinking therefore was that strategically we might prefer let the problem, if there is any, reveal itself and via that get a concrete reproducer, rather than to speculate early and never know for sure if it's actually necessary. I probably should clarify that I see this patch as an attempt to introduce a reliable ground for developing our understanding of the nature of the original issue (which I guess may take some time testing and collecting relevant use cases), and not an implementation that tries to do the best thing for all theoretically possible cases right away. kosarev: Yeah, good question. My analysis here is that while we know that would cost us extra time…
				foadUnsubmitted Not Done Reply Inline Actions OK! foad: OK!
				Entry.insert(I, BuildSetprioMI(MF, HighPriority));

				// Lower the priority on edges where control leaves blocks from which
				// VMEM loads are reachable.
				SmallSet<MachineBasicBlock *, 16> PriorityLoweringBlocks;
				for (MachineBasicBlock &MBB : MF) {
				if (MBBInfos[&MBB].MayReachVMEMLoad) {
				if (MBB.isReturnBlock())
				PriorityLoweringBlocks.insert(&MBB);
				continue;
				}

				if (CanLowerPriorityDirectlyInPredecessors(MBB, MBBInfos)) {
				for (MachineBasicBlock *Pred : MBB.predecessors()) {
				if (MBBInfos[Pred].MayReachVMEMLoad)
				PriorityLoweringBlocks.insert(Pred);
				}
				continue;
				}

				// Where lowering the priority in predecessors is not possible, the
				// block receiving control either was not part of a loop in the first
				// place or the loop simplification/canonicalization pass should have
				tsymallaUnsubmitted Not Done Reply Inline Actions Could you comment that section please? tsymalla: Could you comment that section please?
				kosarevAuthorUnsubmitted Done Reply Inline Actions Done. kosarev: Done.
				// already tried to split the edge and insert a preheader, and if for
				// whatever reason it failed to do so, then this leaves us with the
				// only option of lowering the priority within the loop.
				PriorityLoweringBlocks.insert(&MBB);
				tsymallaUnsubmitted Not Done Reply Inline Actions Typo: priority tsymalla: Typo: priority
				kosarevAuthorUnsubmitted Done Reply Inline Actions Fixed, thanks. kosarev: Fixed, thanks.
				}

				for (MachineBasicBlock *MBB : PriorityLoweringBlocks) {
				MachineBasicBlock::iterator I = MBB->end(), B = MBB->begin();
				while (I != B) {
				if (isVMEMLoad(*--I)) {
				++I;
				break;
				}
				}
				MBB->insert(I, BuildSetprioMI(MF, LowPriority));
				}

				return true;
				}
				tsymallaUnsubmitted Not Done Reply Inline Actions Could be using the reverse iterator tsymalla: Could be using the reverse iterator
				arsenmUnsubmitted Not Done Reply Inline Actions Probably want llvm::make_early_inc_range(llvm::reverse(MBB)) arsenm:* Probably want llvm::make_early_inc_range(llvm::reverse(*MBB))
				kosarevAuthorUnsubmitted Done Reply Inline Actions Yeah, I tried that, but `MBB->rend()->getIterator()` seems to unable to return a valid iterator and crashes. I'm also not quite sure that would simplify the logic. kosarev: Yeah, I tried that, but `MBB->rend()->getIterator()` seems to unable to return a valid iterator…

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableSIModeRegisterPass(
cl::Hidden);		cl::Hidden);

// Option is used in lit tests to prevent deadcoding of patterns inspected.		// Option is used in lit tests to prevent deadcoding of patterns inspected.
static cl::opt<bool>		static cl::opt<bool>
EnableDCEInRA("amdgpu-dce-in-ra",		EnableDCEInRA("amdgpu-dce-in-ra",
cl::init(true), cl::Hidden,		cl::init(true), cl::Hidden,
cl::desc("Enable machine DCE inside regalloc"));		cl::desc("Enable machine DCE inside regalloc"));

		static cl::opt<bool> EnableSetWavePriority("amdgpu-set-wave-priority",
		cl::desc("Adjust wave priority"),
		cl::init(false), cl::Hidden);

static cl::opt<bool> EnableScalarIRPasses(		static cl::opt<bool> EnableScalarIRPasses(
"amdgpu-scalar-ir-passes",		"amdgpu-scalar-ir-passes",
cl::desc("Enable scalar IR passes"),		cl::desc("Enable scalar IR passes"),
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool> EnableStructurizerWorkarounds(		static cl::opt<bool> EnableStructurizerWorkarounds(
"amdgpu-enable-structurizer-workarounds",		"amdgpu-enable-structurizer-workarounds",
▲ Show 20 Lines • Show All 1,067 Lines • ▼ Show 20 Lines	void GCNPassConfig::addPreEmitPass() {
addPass(createSIInsertWaitcntsPass());		addPass(createSIInsertWaitcntsPass());

addPass(createSIModeRegisterPass());		addPass(createSIModeRegisterPass());

if (getOptLevel() > CodeGenOpt::None)		if (getOptLevel() > CodeGenOpt::None)
addPass(&SIInsertHardClausesID);		addPass(&SIInsertHardClausesID);

addPass(&SILateBranchLoweringPassID);		addPass(&SILateBranchLoweringPassID);
		if (isPassEnabled(EnableSetWavePriority, CodeGenOpt::Less))
		addPass(createAMDGPUSetWavePriorityPass());
if (getOptLevel() > CodeGenOpt::None)		if (getOptLevel() > CodeGenOpt::None)
addPass(&SIPreEmitPeepholeID);		addPass(&SIPreEmitPeepholeID);
// The hazard recognizer that runs as part of the post-ra scheduler does not		// The hazard recognizer that runs as part of the post-ra scheduler does not
// guarantee to be able handle all hazards correctly. This is because if there		// guarantee to be able handle all hazards correctly. This is because if there
// are multiple scheduling regions in a basic block, the regions are scheduled		// are multiple scheduling regions in a basic block, the regions are scheduled
// bottom up, so when we begin to schedule a region we don't know what		// bottom up, so when we begin to schedule a region we don't know what
// instructions were emitted directly before it.		// instructions were emitted directly before it.
//		//
▲ Show 20 Lines • Show All 171 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPUPromoteKernelArguments.cpp		AMDGPUPromoteKernelArguments.cpp
AMDGPURegBankCombiner.cpp		AMDGPURegBankCombiner.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
AMDGPUReplaceLDSUseWithPointer.cpp		AMDGPUReplaceLDSUseWithPointer.cpp
AMDGPUResourceUsageAnalysis.cpp		AMDGPUResourceUsageAnalysis.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
		AMDGPUSetWavePriority.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
R600MachineCFGStructurizer.cpp		R600MachineCFGStructurizer.cpp
GCNDPPCombine.cpp		GCNDPPCombine.cpp
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/set-wave-priority.ll

This file was added.

				; RUN: llc -mtriple=amdgcn -amdgpu-set-wave-priority=true -o - %s \| \
				; RUN: FileCheck %s

				; CHECK-LABEL: no_setprio:
				; CHECK-NOT: s_setprio
				; CHECK: ; return to shader part epilog
				define amdgpu_ps <2 x float> @no_setprio() {
				ret <2 x float> <float 0.0, float 0.0>
				}

				; CHECK-LABEL: vmem_in_exit_block:
				; CHECK: s_setprio 3
				; CHECK: buffer_load_dwordx2
				; CHECK-NEXT: s_setprio 0
				; CHECK: ; return to shader part epilog
				define amdgpu_ps <2 x float> @vmem_in_exit_block(<4 x i32> inreg %p) {
				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				ret <2 x float> %v
				}

				; CHECK-LABEL: branch:
				; CHECK: s_setprio 3
				; CHECK: s_cbranch_scc0 [[A:.*]]
				; CHECK: {{.*}}: ; %b
				; CHECK: buffer_load_dwordx2
				; CHECK-NEXT: s_setprio 0
				; CHECK: s_branch [[EXIT:.*]]
				; CHECK: [[A]]: ; %a
				; CHECK-NEXT: s_setprio 0
				; CHECK: s_branch [[EXIT]]
				; CHECK-NEXT: [[EXIT]]:
				define amdgpu_ps <2 x float> @branch(<4 x i32> inreg %p, i32 inreg %i) {
				%cond = icmp eq i32 %i, 0
				br i1 %cond, label %a, label %b

				a:
				ret <2 x float> <float 0.0, float 0.0>

				b:
				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				ret <2 x float> %v
				}

				; CHECK-LABEL: setprio_follows_setprio:
				; CHECK: s_setprio 3
				; CHECK: buffer_load_dwordx2
				; CHECK: s_cbranch_scc1 [[C:.*]]
				; CHECK: {{.*}}: ; %a
				; CHECK: buffer_load_dwordx2
				; CHECK-NEXT: s_setprio 0
				; CHECK: s_cbranch_scc1 [[C]]
				; CHECK: {{.*}}: ; %b
				; CHECK-NOT: s_setprio
				; CHECK: s_branch [[EXIT:.*]]
				; CHECK: [[C]]: ; %c
				; CHECK-NEXT: s_setprio 0
				; CHECK: s_branch [[EXIT]]
				; CHECK: [[EXIT]]:
				define amdgpu_ps <2 x float> @setprio_follows_setprio(<4 x i32> inreg %p, i32 inreg %i) {
				entry:
				%v1 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				%cond1 = icmp ne i32 %i, 0
				br i1 %cond1, label %a, label %c

				a:
				%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)
				%cond2 = icmp ne i32 %i, 1
				br i1 %cond2, label %b, label %c

				b:
				ret <2 x float> %v2

				c:
				%v3 = phi <2 x float> [%v1, %entry], [%v2, %a]
				%v4 = fadd <2 x float> %v1, %v3
				ret <2 x float> %v4
				}

				; CHECK-LABEL: loop:
				; CHECK: {{.*}}: ; %entry
				; CHECK: s_setprio 3
				; CHECK-NOT: s_setprio
				; CHECK: [[LOOP:.*]]: ; %loop
				; CHECK-NOT: s_setprio
				; CHECK: buffer_load_dwordx2
				; CHECK-NOT: s_setprio
				; CHECK: s_cbranch_scc1 [[LOOP]]
				; CHECK-NEXT: {{.*}}: ; %exit
				; CHECK-NEXT: s_setprio 0
				define amdgpu_ps <2 x float> @loop(<4 x i32> inreg %p) {
				entry:
				br label %loop

				loop:
				%i = phi i32 [0, %entry], [%i2, %loop]
				%sum = phi <2 x float> [<float 0.0, float 0.0>, %entry], [%sum2, %loop]

				%i2 = add i32 %i, 1

				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 %i, i32 0, i32 0, i32 0)
				%sum2 = fadd <2 x float> %sum, %v

				%cond = icmp ult i32 %i2, 5
				br i1 %cond, label %loop, label %exit

				exit:
				ret <2 x float> %sum2
				}

				; CHECK-LABEL: edge_split:
				; CHECK: s_setprio 3
				; CHECK: buffer_load_dwordx2
				; CHECK-NOT: s_setprio
				; CHECK: s_cbranch_scc1 [[ANOTHER_LOAD:.*]]
				; CHECK: {{.*}}: ; %loop.preheader
				; CHECK-NEXT: s_setprio 0
				; CHECK: [[LOOP:.*]]: ; %loop
				; CHECK-NOT: s_setprio
				; CHECK: s_cbranch_scc1 [[LOOP]]
				; CHECK {{.*}}: ; %exit
				; CHECK-NOT: s_setprio
				; CHECK: s_branch [[RET:.*]]
				; CHECK: [[ANOTHER_LOAD]]: ; %another_load
				; CHECK: buffer_load_dwordx2
				; CHECK-NEXT: s_setprio 0
				; CHECK: s_branch [[RET]]
				; CHECK: [[RET]]:
				define amdgpu_ps <2 x float> @edge_split(<4 x i32> inreg %p, i32 inreg %x) {
				entry:
				%v = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 0, i32 0)
				%cond = icmp ne i32 %x, 0
				br i1 %cond, label %loop, label %another_load

				loop:
				%i = phi i32 [0, %entry], [%i2, %loop]
				%mul = phi <2 x float> [%v, %entry], [%mul2, %loop]

				%i2 = add i32 %i, 1
				%mul2 = fmul <2 x float> %mul, %v

				%cond2 = icmp ult i32 %i2, 5
				br i1 %cond2, label %loop, label %exit

				exit:
				ret <2 x float> %mul2

				another_load:
				%v2 = call <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32> %p, i32 0, i32 0, i32 1, i32 0)
				%sum = fadd <2 x float> %v, %v2
				ret <2 x float> %sum
				}

				declare <2 x float> @llvm.amdgcn.struct.buffer.load.v2f32(<4 x i32>, i32, i32, i32, i32) nounwind