This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add perf hints to functions
ClosedPublic

Authored by rampitec on May 16 2018, 6:12 PM.

Download Raw Diff

Details

Reviewers

yaxunl
vpykhtin
arsenm

Commits

rG1c538423dc24: [AMDGPU] Add perf hints to functions
rL333289: [AMDGPU] Add perf hints to functions

Summary

This is adoption of HSAIL perfhint pass. Two types of hints are produced:

Function is memory bound.
Kernel can use wave limiter.

Currently these hints are used in the scheduler. If a function is suspected
to be memory bound we allow occupancy to decrease to 4 waves in the course
of scheduling.

Diff Detail

Event Timeline

rampitec created this revision.May 16 2018, 6:12 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptMay 16 2018, 6:12 PM

t-tye added inline comments.May 16 2018, 9:31 PM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Should this be done at the beginning of the visit to ensure will terminate for mutual recursive functions?
388 ↗	(On Diff #147223)	indentation
392 ↗	(On Diff #147223)	Does having the std::move() prevent named return value optimization (which can happen for the return above)? Returning a prvalue (eg by directly returning a constructor) would get guaranteed copy elision in C++17.

How can UMDs disable this optimization?

Are there cases where this decreases performance?

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
13 ↗	(On Diff #147223)	Did you mean "cache thrashing"?

In D46992#1102499, @mareko wrote:

How can UMDs disable this optimization?

Are there cases where this decreases performance?

This is analysis. Optimization itself must be done in the runtime. OpenCL RT used to control it with the env. Graphics RT never did it. At any rate if you know your ideal occupancy it is better to set amdgpu-waves-per-eu attribute.

The only optimization implemented here based on the analysis is in the scheduler. On practice there is no way for a memory intensive program to benefit from an occupancy higher than 4, usually it is lower. However, the impact of the optimization is to let scheduler work where previously it just reverted the schedule if occupancy has decreased. Therefor the natural way to return the old behavior after this change is to disable scheduler (-enable-mished=0) which will result in the same code as before if this condition is triggered.

arsenm added inline comments.May 17 2018, 3:17 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
336–337 ↗	(On Diff #147223)	You're not really supposed to use attributes to pass information through to others, although we do this in a few places to hack around isel limitations. Can you make this an analysis pass instead which returns yes / no at the point you actually need this?
382–383 ↗	(On Diff #147223)	Should handle other memory operations too, like atomics and intrinsics. There is already a wrapper somewhere which should find the pointer operand for all of these operations

Addressed review comments.
Pass is converted to analysis.

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Thank you!
392 ↗	(On Diff #147223)	After the port it is trivially copyable, so move is not required any more.

arsenm added inline comments.May 18 2018, 1:25 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
1 ↗	(On Diff #147223)	Update comment
11–13 ↗	(On Diff #147223)	Comment needs update. Maybe add a todo that this should be a machine analysis?
166 ↗	(On Diff #147223)	What does this mean exactly by indirect access? This seems to me like it's reimplementing something like GetUnderlyingObject
259 ↗	(On Diff #147223)	Probably should check for CallSite to cover the possible future case of InvokeInsts
260–261 ↗	(On Diff #147223)	!Callee
271 ↗	(On Diff #147223)	Extra space
277–278 ↗	(On Diff #147223)	isLegalAddressingMode (although at this point this should probably be a machine pass, but I understand that's more work to rewrite)
289 ↗	(On Diff #147223)	Can you add a test for this case
397 ↗	(On Diff #147223)	There's also CONSTANT_ADDRESS_32BIT
lib/Target/AMDGPU/SIDefines.h
538–539 ↗	(On Diff #147223)	Should be able to also remove this
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184	It seems like there's no reason to actually put this code in SIMachineFunctionInfo. Can you just do this directly in the AsmPrinter where you emit this?

yaxunl added inline comments.May 18 2018, 4:08 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
166 ↗	(On Diff #147223)	indirect access here means something like a[b[i]], i.e., the index of the array is loaded from memory. It usually results in random access in stead of stream access. Probably it can have a better name.

rampitec added inline comments.May 18 2018, 9:39 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
11–13 ↗	(On Diff #147223)	I do not think it has to be machine IR pass. It will be really difficult to perform this analysis on machine IR.
289 ↗	(On Diff #147223)	Only as opt test. BE will fail on recursion.
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184	By the time it is needed function's IR is already destroyed. Note, it is not only needed from printer, it is also checked in the scheduler.

rampitec added inline comments.May 18 2018, 9:40 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
260–261 ↗	(On Diff #147223)	This file does not exist in the patch. You seem to comment on the old version somehow.

rampitec updated this revision to Diff 147546.May 18 2018, 10:31 AM

rampitec marked 5 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Actually since it is now an on-demand analysis I cannot do it even with opt. We will need to generally fix recursion handling in the BE, it is not a problem specific to this patch.
lib/Target/AMDGPU/SIDefines.h
538–539 ↗	(On Diff #147223)	It was removed when pass was converted to analysis. Please check the current patch.

arsenm added inline comments.May 18 2018, 3:13 PM

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
258–262	isLegalAddressingMode. I'm not sure I understand why this pass is doing most of what it's doing. Why does the addressing mode match matter for determining if the function is probably memory bound? With a machine pass you would have a much more exact idea of the number of memory operations really being executed

rampitec added inline comments.May 18 2018, 3:35 PM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
277–278 ↗	(On Diff #147223)	There is no TLI or subtarget here yet.
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
258–262	It matters because we are trying to estimate memory to ALU instruction ratio. A foldable GEP does not result in an instruction. In fact this is a rough estimation, completely correct answer is not needed. On the machine IR in turn it will be very difficult to track pointers.

rampitec updated this revision to Diff 147628.May 18 2018, 5:59 PM

rampitec marked 2 inline comments as done.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

In D46992#1105636, @rampitec wrote:

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

In D46992#1105803, @mareko wrote:

In D46992#1105636, @rampitec wrote:

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

Ok, that's what I meant, it is considered an ordinary instruction. It is just not covered by the pass and there is no impact. Primarily because no measurements were done, no workloads analyzed and no statistics collected to try to perform any optimizations of that kind on compute side. gfx side may have data collected, but the pass will do nothing to that kind of loads.

Switched to use isLegalAddressingMode in GEP processing.

arsenm added inline comments.May 21 2018, 2:50 PM

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
37	Naming convention for existing flags seem to all use the full word threshold (same for the rest)
160	Seems like a SmallSet?
241–243	Run clang-format
294	Move this up to avoid calling the same find twice? visit can also return the inserted iterator
316	I think there's a policy of generally avoiding FP computations
321	Ditto
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h
2	c++ mode comment
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
182	Missing space

rampitec updated this revision to Diff 147886.May 21 2018, 3:10 PM

rampitec marked 8 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
294	It is not the same, it is one after visit() call. But yes, changing to visit to return the iterator.

Small cleanup after last changes.

rampitec added a reviewer: arsenm.May 23 2018, 1:37 PM

LGTM. Just a hint: whenever you use "auto X = ..." it's worth to specify explicitly if X is pointer or reference. It's not only saves you from accidental temp object by copy but also makes program easier to read.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178	I would use "auto &Resolver" to emphasize Resolver is a reference not a temp object

rampitec marked an inline comment as done.May 25 2018, 9:56 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178	Resolver is a pointer, not an object or reference.

Rebase to master.
Moved info from SIMachineFunctionInfo into its parent AMDGPUMachineFunction since SI/R600 is not clearly separated in AMDGPUAsmPrinter anymore.

vpykhtin accepted this revision.May 25 2018, 10:16 AM

This revision is now accepted and ready to land.May 25 2018, 10:16 AM

Closed by commit rL333289: [AMDGPU] Add perf hints to functions (authored by rampitec). · Explain WhyMay 25 2018, 10:29 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUAsmPrinter.h

4 lines

AMDGPUAsmPrinter.cpp

14 lines

AMDGPUISelDAGToDAG.cpp

3 lines

AMDGPUPerfHintAnalysis.h

55 lines

AMDGPUPerfHintAnalysis.cpp

404 lines

CMakeLists.txt

1 line

GCNSchedStrategy.cpp

12 lines

SIMachineFunctionInfo.h

14 lines

SIMachineFunctionInfo.cpp

12 lines

test/

CodeGen/

AMDGPU/

hsa-metadata-kernel-code-props.ll

10 lines

perfhint.ll

85 lines

Diff 147628

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	extern char &SIFixWWMLivenessID;			extern char &SIFixWWMLivenessID;

	void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);			void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);
	extern char &AMDGPUSimplifyLibCallsID;			extern char &AMDGPUSimplifyLibCallsID;

	void initializeAMDGPUUseNativeCallsPass(PassRegistry &);			void initializeAMDGPUUseNativeCallsPass(PassRegistry &);
	extern char &AMDGPUUseNativeCallsID;			extern char &AMDGPUUseNativeCallsID;

				void initializeAMDGPUPerfHintAnalysisPass(PassRegistry &);
				extern char &AMDGPUPerfHintAnalysisID;

	// Passes common to R600 and SI			// Passes common to R600 and SI
	FunctionPass *createAMDGPUPromoteAlloca();			FunctionPass *createAMDGPUPromoteAlloca();
	void initializeAMDGPUPromoteAllocaPass(PassRegistry&);			void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
	extern char &AMDGPUPromoteAllocaID;			extern char &AMDGPUPromoteAllocaID;

	Pass *createAMDGPUStructurizeCFGPass();			Pass *createAMDGPUStructurizeCFGPass();
	FunctionPass *createAMDGPUISelDag(			FunctionPass *createAMDGPUISelDag(
	TargetMachine *TM = nullptr,			TargetMachine *TM = nullptr,
	▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUAsmPrinter.h

Show All 25 Lines
#include <memory>		#include <memory>
#include <string>		#include <string>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

class AMDGPUTargetStreamer;		class AMDGPUTargetStreamer;
class MCOperand;		class MCOperand;
		class SIMachineFunctionInfo;
class SISubtarget;		class SISubtarget;

class AMDGPUAsmPrinter final : public AsmPrinter {		class AMDGPUAsmPrinter final : public AsmPrinter {
private:		private:
// Track resource usage for callee functions.		// Track resource usage for callee functions.
struct SIFunctionResourceInfo {		struct SIFunctionResourceInfo {
// Track the number of explicitly used VGPRs. Special registers reserved at		// Track the number of explicitly used VGPRs. Special registers reserved at
// the end are tracked separately.		// the end are tracked separately.
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	private:
void EmitProgramInfoR600(const MachineFunction &MF);		void EmitProgramInfoR600(const MachineFunction &MF);
void EmitProgramInfoSI(const MachineFunction &MF,		void EmitProgramInfoSI(const MachineFunction &MF,
const SIProgramInfo &KernelInfo);		const SIProgramInfo &KernelInfo);
void EmitPALMetadata(const MachineFunction &MF,		void EmitPALMetadata(const MachineFunction &MF,
const SIProgramInfo &KernelInfo);		const SIProgramInfo &KernelInfo);
void emitCommonFunctionComments(uint32_t NumVGPR,		void emitCommonFunctionComments(uint32_t NumVGPR,
uint32_t NumSGPR,		uint32_t NumSGPR,
uint64_t ScratchSize,		uint64_t ScratchSize,
uint64_t CodeSize);		uint64_t CodeSize,
		const SIMachineFunctionInfo* MFI);

public:		public:
explicit AMDGPUAsmPrinter(TargetMachine &TM,		explicit AMDGPUAsmPrinter(TargetMachine &TM,
std::unique_ptr<MCStreamer> Streamer);		std::unique_ptr<MCStreamer> Streamer);

StringRef getPassName() const override;		StringRef getPassName() const override;

const MCSubtargetInfo* getSTI() const;		const MCSubtargetInfo* getSTI() const;
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Show First 20 Lines • Show All 275 Lines • ▼ Show 20 Lines	void AMDGPUAsmPrinter::readPALMetadata(Module &M) {
}		}
}		}

// Print comments that apply to both callable functions and entry points.		// Print comments that apply to both callable functions and entry points.
void AMDGPUAsmPrinter::emitCommonFunctionComments(		void AMDGPUAsmPrinter::emitCommonFunctionComments(
uint32_t NumVGPR,		uint32_t NumVGPR,
uint32_t NumSGPR,		uint32_t NumSGPR,
uint64_t ScratchSize,		uint64_t ScratchSize,
uint64_t CodeSize) {		uint64_t CodeSize,
		const SIMachineFunctionInfo *MFI) {
OutStreamer->emitRawComment(" codeLenInByte = " + Twine(CodeSize), false);		OutStreamer->emitRawComment(" codeLenInByte = " + Twine(CodeSize), false);
OutStreamer->emitRawComment(" NumSgprs: " + Twine(NumSGPR), false);		OutStreamer->emitRawComment(" NumSgprs: " + Twine(NumSGPR), false);
OutStreamer->emitRawComment(" NumVgprs: " + Twine(NumVGPR), false);		OutStreamer->emitRawComment(" NumVgprs: " + Twine(NumVGPR), false);
OutStreamer->emitRawComment(" ScratchSize: " + Twine(ScratchSize), false);		OutStreamer->emitRawComment(" ScratchSize: " + Twine(ScratchSize), false);
		OutStreamer->emitRawComment(" MemoryBound: " + Twine(MFI->isMemoryBound()),
		false);
}		}

bool AMDGPUAsmPrinter::runOnMachineFunction(MachineFunction &MF) {		bool AMDGPUAsmPrinter::runOnMachineFunction(MachineFunction &MF) {
CurrentProgramInfo = SIProgramInfo();		CurrentProgramInfo = SIProgramInfo();

const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();		const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();

// The starting address of all shader programs must be 256 bytes aligned.		// The starting address of all shader programs must be 256 bytes aligned.
Show All 38 Lines	bool AMDGPUAsmPrinter::runOnMachineFunction(MachineFunction &MF) {
EmitFunctionBody();		EmitFunctionBody();

if (isVerbose()) {		if (isVerbose()) {
MCSectionELF *CommentSection =		MCSectionELF *CommentSection =
Context.getELFSection(".AMDGPU.csdata", ELF::SHT_PROGBITS, 0);		Context.getELFSection(".AMDGPU.csdata", ELF::SHT_PROGBITS, 0);
OutStreamer->SwitchSection(CommentSection);		OutStreamer->SwitchSection(CommentSection);

if (STM.getGeneration() >= AMDGPUSubtarget::SOUTHERN_ISLANDS) {		if (STM.getGeneration() >= AMDGPUSubtarget::SOUTHERN_ISLANDS) {
		const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

if (!MFI->isEntryFunction()) {		if (!MFI->isEntryFunction()) {
OutStreamer->emitRawComment(" Function info:", false);		OutStreamer->emitRawComment(" Function info:", false);
SIFunctionResourceInfo &Info = CallGraphResourceInfo[&MF.getFunction()];		SIFunctionResourceInfo &Info = CallGraphResourceInfo[&MF.getFunction()];
emitCommonFunctionComments(		emitCommonFunctionComments(
Info.NumVGPR,		Info.NumVGPR,
Info.getTotalNumSGPRs(MF.getSubtarget<SISubtarget>()),		Info.getTotalNumSGPRs(MF.getSubtarget<SISubtarget>()),
Info.PrivateSegmentSize,		Info.PrivateSegmentSize,
getFunctionCodeSize(MF));		getFunctionCodeSize(MF), MFI);
return false;		return false;
}		}

OutStreamer->emitRawComment(" Kernel info:", false);		OutStreamer->emitRawComment(" Kernel info:", false);
emitCommonFunctionComments(CurrentProgramInfo.NumVGPR,		emitCommonFunctionComments(CurrentProgramInfo.NumVGPR,
CurrentProgramInfo.NumSGPR,		CurrentProgramInfo.NumSGPR,
CurrentProgramInfo.ScratchSize,		CurrentProgramInfo.ScratchSize,
getFunctionCodeSize(MF));		getFunctionCodeSize(MF), MFI);

OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" FloatMode: " + Twine(CurrentProgramInfo.FloatMode), false);		" FloatMode: " + Twine(CurrentProgramInfo.FloatMode), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" IeeeMode: " + Twine(CurrentProgramInfo.IEEEMode), false);		" IeeeMode: " + Twine(CurrentProgramInfo.IEEEMode), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" LDSByteSize: " + Twine(CurrentProgramInfo.LDSSize) +		" LDSByteSize: " + Twine(CurrentProgramInfo.LDSSize) +
" bytes/workgroup (compile time only)", false);		" bytes/workgroup (compile time only)", false);
Show All 12 Lines	if (STM.getGeneration() >= AMDGPUSubtarget::SOUTHERN_ISLANDS) {

OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" ReservedVGPRFirst: " + Twine(CurrentProgramInfo.ReservedVGPRFirst),		" ReservedVGPRFirst: " + Twine(CurrentProgramInfo.ReservedVGPRFirst),
false);		false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" ReservedVGPRCount: " + Twine(CurrentProgramInfo.ReservedVGPRCount),		" ReservedVGPRCount: " + Twine(CurrentProgramInfo.ReservedVGPRCount),
false);		false);

		OutStreamer->emitRawComment(
		" WaveLimiterHint : " + Twine(MFI->needsWaveLimiter()), false);

if (MF.getSubtarget<SISubtarget>().debuggerEmitPrologue()) {		if (MF.getSubtarget<SISubtarget>().debuggerEmitPrologue()) {
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" DebuggerWavefrontPrivateSegmentOffsetSGPR: s" +		" DebuggerWavefrontPrivateSegmentOffsetSGPR: s" +
Twine(CurrentProgramInfo.DebuggerWavefrontPrivateSegmentOffsetSGPR), false);		Twine(CurrentProgramInfo.DebuggerWavefrontPrivateSegmentOffsetSGPR), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" DebuggerPrivateSegmentBufferSGPR: s" +		" DebuggerPrivateSegmentBufferSGPR: s" +
Twine(CurrentProgramInfo.DebuggerPrivateSegmentBufferSGPR), false);		Twine(CurrentProgramInfo.DebuggerPrivateSegmentBufferSGPR), false);
}		}
▲ Show 20 Lines • Show All 898 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show All 10 Lines
/// Defines an instruction selector for the AMDGPU target.		/// Defines an instruction selector for the AMDGPU target.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUArgumentUsageInfo.h"		#include "AMDGPUArgumentUsageInfo.h"
#include "AMDGPUISelLowering.h" // For AMDGPUISD		#include "AMDGPUISelLowering.h" // For AMDGPUISD
#include "AMDGPUInstrInfo.h"		#include "AMDGPUInstrInfo.h"
		#include "AMDGPUPerfHintAnalysis.h"
#include "AMDGPURegisterInfo.h"		#include "AMDGPURegisterInfo.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "AMDGPUTargetMachine.h"		#include "AMDGPUTargetMachine.h"
#include "SIDefines.h"		#include "SIDefines.h"
#include "SIISelLowering.h"		#include "SIISelLowering.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	explicit AMDGPUDAGToDAGISel(TargetMachine *TM = nullptr,
: SelectionDAGISel(*TM, OptLevel) {		: SelectionDAGISel(*TM, OptLevel) {
AMDGPUASI = AMDGPU::getAMDGPUAS(*TM);		AMDGPUASI = AMDGPU::getAMDGPUAS(*TM);
EnableLateStructurizeCFG = AMDGPUTargetMachine::EnableLateStructurizeCFG;		EnableLateStructurizeCFG = AMDGPUTargetMachine::EnableLateStructurizeCFG;
}		}
~AMDGPUDAGToDAGISel() override = default;		~AMDGPUDAGToDAGISel() override = default;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AMDGPUArgumentUsageInfo>();		AU.addRequired<AMDGPUArgumentUsageInfo>();
		AU.addRequired<AMDGPUPerfHintAnalysis>();
AU.addRequired<DivergenceAnalysis>();		AU.addRequired<DivergenceAnalysis>();
SelectionDAGISel::getAnalysisUsage(AU);		SelectionDAGISel::getAnalysisUsage(AU);
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
void Select(SDNode *N) override;		void Select(SDNode *N) override;
StringRef getPassName() const override;		StringRef getPassName() const override;
void PostprocessISelDAG() override;		void PostprocessISelDAG() override;
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	bool SelectADDRVTX_READ(SDValue Addr, SDValue &Base,
SDValue &Offset) override;		SDValue &Offset) override;
};		};

} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS_BEGIN(AMDGPUDAGToDAGISel, "isel",		INITIALIZE_PASS_BEGIN(AMDGPUDAGToDAGISel, "isel",
"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)		"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)
INITIALIZE_PASS_DEPENDENCY(AMDGPUArgumentUsageInfo)		INITIALIZE_PASS_DEPENDENCY(AMDGPUArgumentUsageInfo)
		INITIALIZE_PASS_DEPENDENCY(AMDGPUPerfHintAnalysis)
INITIALIZE_PASS_END(AMDGPUDAGToDAGISel, "isel",		INITIALIZE_PASS_END(AMDGPUDAGToDAGISel, "isel",
"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)		"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)

/// This pass converts a legalized DAG into a AMDGPU-specific		/// This pass converts a legalized DAG into a AMDGPU-specific
// DAG, ready for instruction scheduling.		// DAG, ready for instruction scheduling.
FunctionPass llvm::createAMDGPUISelDag(TargetMachine TM,		FunctionPass llvm::createAMDGPUISelDag(TargetMachine TM,
CodeGenOpt::Level OptLevel) {		CodeGenOpt::Level OptLevel) {
return new AMDGPUDAGToDAGISel(TM, OptLevel);		return new AMDGPUDAGToDAGISel(TM, OptLevel);
▲ Show 20 Lines • Show All 1,995 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h

This file was added.

				//===- AMDGPUPerfHintAnalysis.cpp - analysis of functions memory traffic --===//
				//
				arsenmUnsubmitted Done Reply Inline Actions c++ mode comment arsenm: c++ mode comment
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief Analyzes if a function potentially memory bound and if a kernel
				/// kernel may benefit from limiting number of waves to reduce cache thrashing.
				///
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H
				#define LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H
				#include "llvm/IR/ValueMap.h"
				#include "llvm/Pass.h"

				namespace llvm {

				struct AMDGPUPerfHintAnalysis : public FunctionPass {
				static char ID;

				public:
				AMDGPUPerfHintAnalysis() : FunctionPass(ID) {}

				bool runOnFunction(Function &F) override;

				void getAnalysisUsage(AnalysisUsage &AU) const {
				AU.setPreservesAll();
				}

				bool isMemoryBound(const Function *F) const;

				bool needsWaveLimiter(const Function *F) const;

				struct FuncInfo {
				unsigned MemInstCount;
				unsigned InstCount;
				unsigned IAMInstCount; // Indirect access memory instruction count
				unsigned LSMInstCount; // Large stride memory instruction count
				FuncInfo() : MemInstCount(0), InstCount(0), IAMInstCount(0),
				LSMInstCount(0) {}
				};

				typedef ValueMap<const Function*, FuncInfo> FuncInfoMap;

				private:

				FuncInfoMap FIM;
				};
				} // namespace llvm
				#endif // LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp

This file was added.

				//===- AMDGPUPerfHintAnalysis.cpp - analysis of functions memory traffic --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief Analyzes if a function potentially memory bound and if a kernel
				/// kernel may benefit from limiting number of waves to reduce cache thrashing.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUPerfHintAnalysis.h"
				#include "llvm/ADT/DenseSet.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/ValueMap.h"
				#include "llvm/Support/CommandLine.h"
				#include "Utils/AMDGPUBaseInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "amdgpu-perf-hint"

				static cl::opt<float> MemBoundThresh("amdgpu-membound-thresh",
				cl::init(50),
				cl::Hidden,
				cl::value_desc("fp value"),
				cl::desc("Function mem bound threshold"));
				arsenmUnsubmitted Done Reply Inline Actions Naming convention for existing flags seem to all use the full word threshold (same for the rest) arsenm: Naming convention for existing flags seem to all use the full word threshold (same for the rest)

				static cl::opt<float> LimitWaveThresh("amdgpu-limit-wave-thresh",
				cl::init(50),
				cl::Hidden,
				cl::value_desc("fp value"),
				cl::desc("Kernel limit wave threshold"));

				static cl::opt<float> IAWeight("amdgpu-indirect-access-weight",
				cl::init(1000),
				cl::Hidden,
				cl::value_desc("fp value"),
				cl::desc("Indirect access memory instruction weight"));

				static cl::opt<float> LSWeight("amdgpu-large-stride-weight",
				cl::init(1000),
				cl::Hidden,
				cl::value_desc("fp value"),
				cl::desc("Large stride memory access weight"));

				static cl::opt<unsigned> LargeStrideThresh("amdgpu-large-stride-thresh",
				cl::init(64),
				cl::Hidden,
				cl::value_desc("int value"),
				cl::desc("Large stride memory access threshold"));

				STATISTIC(NumMemBound, "Number of functions marked as memory bound");
				STATISTIC(NumLimitWave, "Number of functions marked as needing limit wave");

				char llvm::AMDGPUPerfHintAnalysis::ID = 0;
				char &llvm::AMDGPUPerfHintAnalysisID = AMDGPUPerfHintAnalysis::ID;

				INITIALIZE_PASS(AMDGPUPerfHintAnalysis, DEBUG_TYPE,
				"Analysis if a function is memory bound", true, true)

				namespace {

				struct AMDGPUPerfHint {
				friend AMDGPUPerfHintAnalysis;

				public:
				AMDGPUPerfHint(AMDGPUPerfHintAnalysis::FuncInfoMap &FIM_)
				: FIM(FIM_), DL(nullptr) {}

				void runOnFunction(Function &F);

				private:

				struct MemAccessInfo {
				const Value *V;
				const Value *Base;
				int64_t Offset;
				MemAccessInfo() : V(nullptr), Base(nullptr), Offset(0){}
				bool isLargeStride(MemAccessInfo &Reference) const;
				#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
				Printable print() const {
				return Printable([this](raw_ostream &OS) {
				OS << "Value: " << *V << '\n'
				<< "Base: " << *Base << " Offset: " << Offset << '\n';
				});
				}
				#endif
				};

				MemAccessInfo makeMemAccessInfo(Instruction *) const;

				MemAccessInfo LastAccess; // Last memory access info

				AMDGPUPerfHintAnalysis::FuncInfoMap &FIM;

				const DataLayout *DL;
				AMDGPUAS AS;

				void visit(const Function &F);
				static bool isMemBound(const AMDGPUPerfHintAnalysis::FuncInfo &F);
				static bool needLimitWave(const AMDGPUPerfHintAnalysis::FuncInfo &F);

				bool isIndirectAccess(const Instruction *Inst) const;

				/// Check if the instruction is large stride.
				/// The purpose is to identify memory access pattern like:
				/// x = a[i];
				/// y = a[i+1000];
				/// z = a[i+2000];
				/// In the above example, the second and third memory access will be marked
				/// large stride memory access.
				bool isLargeStride(const Instruction *Inst);

				bool isGlobalAddr(const Value *V) const;
				bool isLocalAddr(const Value *V) const;
				bool isConstantAddr(const Value *V) const;
				};

				static const Value getMemoryInstrPtr(const Instruction Inst) {
				if (auto LI = dyn_cast<LoadInst>(Inst)) {
				return LI->getPointerOperand();
				}
				if (auto SI = dyn_cast<StoreInst>(Inst)) {
				return SI->getPointerOperand();
				}
				if (auto AI = dyn_cast<AtomicCmpXchgInst>(Inst)) {
				return AI->getPointerOperand();
				}
				if (auto AI = dyn_cast<AtomicRMWInst>(Inst)) {
				return AI->getPointerOperand();
				}
				if (auto MI = dyn_cast<AnyMemIntrinsic>(Inst)) {
				return MI->getRawDest();
				}

				return nullptr;
				}

				bool AMDGPUPerfHint::isIndirectAccess(const Instruction *Inst) const {
				DEBUG(dbgs() << "[isIndirectAccess] " << *Inst << '\n');
				DenseSet<const Value *> WorkSet;
				DenseSet<const Value *> Visited;
				if (const Value *MO = getMemoryInstrPtr(Inst)) {
				if (isGlobalAddr(MO))
				WorkSet.insert(MO);
				}

				while (!WorkSet.empty()) {
				const Value V = WorkSet.begin();
				arsenmUnsubmitted Done Reply Inline Actions Seems like a SmallSet? arsenm: Seems like a SmallSet?
				WorkSet.erase(WorkSet.begin());
				if (!Visited.insert(V).second)
				continue;
				DEBUG(dbgs() << " check: " << *V << '\n');

				if (auto LD = dyn_cast<LoadInst>(V)) {
				auto M = LD->getPointerOperand();
				if (isGlobalAddr(M) \|\|
				isLocalAddr(M) \|\|
				isConstantAddr(M)) {
				DEBUG(dbgs() << " is IA\n");
				return true;
				}
				continue;
				}

				if (auto GEP = dyn_cast<GetElementPtrInst>(V)) {
				auto P = GEP->getPointerOperand();
				WorkSet.insert(P);
				for (unsigned I = 1, E = GEP->getNumIndices() + 1; I != E; ++I)
				WorkSet.insert(GEP->getOperand(I));
				continue;
				}

				if (auto U = dyn_cast<UnaryInstruction>(V)) {
				WorkSet.insert(U->getOperand(0));
				continue;
				}

				if (auto BO = dyn_cast<BinaryOperator>(V)) {
				WorkSet.insert(BO->getOperand(0));
				WorkSet.insert(BO->getOperand(1));
				continue;
				}

				if (auto S = dyn_cast<SelectInst>(V)) {
				WorkSet.insert(S->getFalseValue());
				WorkSet.insert(S->getTrueValue());
				continue;
				}

				if (auto E = dyn_cast<ExtractElementInst>(V)) {
				WorkSet.insert(E->getVectorOperand());
				continue;
				}

				if (auto Phi = dyn_cast<PHINode>(V)) {
				for (unsigned I = 0, E = Phi->getNumIncomingValues(); I != E; ++I)
				WorkSet.insert(Phi->getIncomingValue(I));
				continue;
				}

				DEBUG(dbgs() << " dropped\n");
				}

				DEBUG(dbgs() << " is not IA\n");
				return false;
				}

				void AMDGPUPerfHint::visit(const Function &F) {
				auto FIP = FIM.insert(std::make_pair(&F, AMDGPUPerfHintAnalysis::FuncInfo()));
				if (!FIP.second)
				return;

				AMDGPUPerfHintAnalysis::FuncInfo &FI = FIP.first->second;

				DEBUG(dbgs() << "[AMDGPUPerfHint] process " << F.getName() << '\n');

				for (auto &B : F) {
				LastAccess = MemAccessInfo();
				for (auto &I : B) {
				if (getMemoryInstrPtr(&I)) {
				if(isIndirectAccess(&I))
				++FI.IAMInstCount;
				if(isLargeStride(&I))
				++FI.LSMInstCount;
				++FI.MemInstCount;
				++FI.InstCount;
				continue;
				}
				CallSite CS(const_cast<Instruction*>(&I));
				if (CS) {
				Function *Callee = CS.getCalledFunction();
				arsenmUnsubmitted Done Reply Inline Actions Run clang-format arsenm: Run clang-format
				if (!Callee \|\| Callee->isDeclaration()) {
				++FI.InstCount;
				continue;
				}
				if (&F == Callee) // Handle immediate recursion
				continue;

				visit(*Callee);

				AMDGPUPerfHintAnalysis::FuncInfoMap::iterator Loc = FIM.find(Callee);
				assert(Loc != FIM.end() && "No func info");
				FI.MemInstCount += Loc->second.MemInstCount;
				FI.InstCount += Loc->second.InstCount;
				FI.IAMInstCount += Loc->second.IAMInstCount;
				FI.LSMInstCount += Loc->second.LSMInstCount;
				} else if (const auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
				APInt Off(DL->getIndexSizeInBits(GEP->getPointerAddressSpace()), 0);
				if (GEP->accumulateConstantOffset(*DL, Off)) {
				if (Off.isIntN(12))
				arsenmUnsubmitted Done Reply Inline Actions isLegalAddressingMode. I'm not sure I understand why this pass is doing most of what it's doing. Why does the addressing mode match matter for determining if the function is probably memory bound? With a machine pass you would have a much more exact idea of the number of memory operations really being executed arsenm: isLegalAddressingMode. I'm not sure I understand why this pass is doing most of what it's…
				rampitecAuthorUnsubmitted Done Reply Inline Actions It matters because we are trying to estimate memory to ALU instruction ratio. A foldable GEP does not result in an instruction. In fact this is a rough estimation, completely correct answer is not needed. On the machine IR in turn it will be very difficult to track pointers. rampitec: It matters because we are trying to estimate memory to ALU instruction ratio. A foldable GEP…
				// Offset will likely be folded into load or store
				continue;
				}
				++FI.InstCount;
				} else {
				++FI.InstCount;
				}
				}
				}
				}

				void AMDGPUPerfHint::runOnFunction(Function &F) {
				if (FIM.find(&F) != FIM.end())
				return;

				const Module &M = *F.getParent();
				DL = &M.getDataLayout();
				AS = AMDGPU::getAMDGPUAS(M);

				visit(F);

				AMDGPUPerfHintAnalysis::FuncInfoMap::iterator Loc = FIM.find(&F);
				assert(Loc != FIM.end() && "No func info");
				DEBUG(dbgs() << F.getName() <<
				" MemInst: " << Loc->second.MemInstCount << '\n' <<
				" IAMInst: " << Loc->second.IAMInstCount << '\n' <<
				" LSMInst: " << Loc->second.LSMInstCount << '\n' <<
				" TotalInst: " << Loc->second.InstCount << '\n');

				auto &FI = Loc->second;

				if (isMemBound(FI)) {
				arsenmUnsubmitted Done Reply Inline Actions Move this up to avoid calling the same find twice? visit can also return the inserted iterator arsenm: Move this up to avoid calling the same find twice? visit can also return the inserted iterator
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions It is not the same, it is one after visit() call. But yes, changing to visit to return the iterator. rampitec: It is not the same, it is one after visit() call. But yes, changing to visit to return the…
				DEBUG(dbgs() << F.getName() << " is memory bound\n");
				NumMemBound++;
				}

				if (AMDGPU::isEntryFunctionCC(F.getCallingConv()) && needLimitWave(FI)) {
				DEBUG(dbgs() << F.getName() << " needs limit wave\n");
				NumLimitWave++;
				}
				}

				bool AMDGPUPerfHint::isMemBound(const AMDGPUPerfHintAnalysis::FuncInfo &FI) {
				return static_cast<double>(FI.MemInstCount) / FI.InstCount * 100 >
				MemBoundThresh;
				}

				bool AMDGPUPerfHint::needLimitWave(const AMDGPUPerfHintAnalysis::FuncInfo& FI) {
				return static_cast<double>(FI.MemInstCount
				+ FI.IAMInstCount * IAWeight
				+ FI.LSMInstCount * LSWeight)
				/ FI.InstCount * 100 > LimitWaveThresh;
				}

				arsenmUnsubmitted Done Reply Inline Actions I think there's a policy of generally avoiding FP computations arsenm: I think there's a policy of generally avoiding FP computations
				bool AMDGPUPerfHint::isGlobalAddr(const Value* V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType())) {
				unsigned As = PT->getAddressSpace();
				// Flat likely points to global too.
				return As == AS.GLOBAL_ADDRESS \|\| As == AS.FLAT_ADDRESS;
				arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
				}
				return false;
				}

				bool AMDGPUPerfHint::isLocalAddr(const Value* V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType()))
				return PT->getAddressSpace() == AS.LOCAL_ADDRESS;
				return false;
				}

				bool AMDGPUPerfHint::isLargeStride(const Instruction* Inst) {
				DEBUG(dbgs() << "[isLargeStride] " << *Inst << '\n');

				MemAccessInfo MAI = makeMemAccessInfo(const_cast<Instruction*>(Inst));
				bool IsLargeStride = MAI.isLargeStride(LastAccess);
				if (MAI.Base)
				LastAccess = std::move(MAI);

				return IsLargeStride;
				}

				AMDGPUPerfHint::MemAccessInfo
				AMDGPUPerfHint::makeMemAccessInfo(Instruction *Inst) const {
				MemAccessInfo MAI;
				const Value* MO = getMemoryInstrPtr(Inst);

				DEBUG(dbgs() << "[isLargeStride] MO: " << *MO << '\n');
				// Do not treat local-addr memory access as large stride.
				if (isLocalAddr(MO))
				return MAI;

				MAI.V = MO;
				MAI.Base = llvm::GetPointerBaseWithConstantOffset(MO, MAI.Offset, *DL);
				return MAI;
				}

				bool AMDGPUPerfHint::isConstantAddr(const Value* V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType())) {
				unsigned As = PT->getAddressSpace();
				return As == AS.CONSTANT_ADDRESS \|\| As == AS.CONSTANT_ADDRESS_32BIT;
				}
				return false;
				}

				bool AMDGPUPerfHint::MemAccessInfo::isLargeStride(MemAccessInfo& Reference)
				const {

				if (!Base \|\| !Reference.Base \|\| Base != Reference.Base)
				return false;

				uint64_t Diff = Offset > Reference.Offset ? Offset - Reference.Offset
				: Reference.Offset - Offset;
				bool Result = Diff > LargeStrideThresh;
				DEBUG(dbgs() << "[isLargeStride compare]\n"
				<< print()
				<< "<=>\n"
				<< Reference.print()
				<< "Result:" << Result << '\n');
				return Result;
				}
				} // namespace

				bool AMDGPUPerfHintAnalysis::runOnFunction(Function &F) {
				AMDGPUPerfHint Analyzer(FIM);
				Analyzer.runOnFunction(F);
				return false;
				}

				bool AMDGPUPerfHintAnalysis::isMemoryBound(const Function *F) const {
				auto FI = FIM.find(F);
				if (FI == FIM.end())
				return false;

				return AMDGPUPerfHint::isMemBound(FI->second);
				}

				bool AMDGPUPerfHintAnalysis::needsWaveLimiter(const Function *F) const {
				auto FI = FIM.find(F);
				if (FI == FIM.end())
				return false;

				return AMDGPUPerfHint::needLimitWave(FI->second);
				}

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUInline.cpp		AMDGPUInline.cpp
		AMDGPUPerfHintAnalysis.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
GCNHazardRecognizer.cpp		GCNHazardRecognizer.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
GCNSchedStrategy.cpp		GCNSchedStrategy.cpp
R600ClauseMergePass.cpp		R600ClauseMergePass.cpp
R600ControlFlowFinalizer.cpp		R600ControlFlowFinalizer.cpp
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines	void GCNScheduleDAGMILive::schedule() {
WavesAfter = std::min(WavesAfter, MFI.getMaxWavesPerEU());		WavesAfter = std::min(WavesAfter, MFI.getMaxWavesPerEU());
WavesBefore = std::min(WavesBefore, MFI.getMaxWavesPerEU());		WavesBefore = std::min(WavesBefore, MFI.getMaxWavesPerEU());
LLVM_DEBUG(dbgs() << "Occupancy before scheduling: " << WavesBefore		LLVM_DEBUG(dbgs() << "Occupancy before scheduling: " << WavesBefore
<< ", after " << WavesAfter << ".\n");		<< ", after " << WavesAfter << ".\n");

// We could not keep current target occupancy because of the just scheduled		// We could not keep current target occupancy because of the just scheduled
// region. Record new occupancy for next scheduling cycle.		// region. Record new occupancy for next scheduling cycle.
unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);		unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);
		// Allow memory bound functions to drop to 4 waves if not limited by an
		// attribute.
		unsigned MinMemBoundWaves = std::max(MFI.getMinWavesPerEU(), 4u);
		if (WavesAfter < WavesBefore && WavesAfter < MinOccupancy &&
		WavesAfter >= MinMemBoundWaves &&
		(MFI.isMemoryBound() \|\| MFI.needsWaveLimiter())) {
		LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "
		<< MinMemBoundWaves << " waves\n");
		NewOccupancy = WavesAfter;
		}
if (NewOccupancy < MinOccupancy) {		if (NewOccupancy < MinOccupancy) {
MinOccupancy = NewOccupancy;		MinOccupancy = NewOccupancy;
LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "		LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "
<< MinOccupancy << ".\n");		<< MinOccupancy << ".\n");
}		}

if (WavesAfter >= WavesBefore) {		if (WavesAfter >= MinOccupancy) {
Pressure[RegionIdx] = PressureAfter;		Pressure[RegionIdx] = PressureAfter;
return;		return;
}		}

LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");		LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");
RegionEnd = RegionBegin;		RegionEnd = RegionBegin;
for (MachineInstr *MI : Unsched) {		for (MachineInstr *MI : Unsched) {
if (MI->isDebugInstr())		if (MI->isDebugInstr())
▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	private:
// Compute directly in sgpr[0:1]		// Compute directly in sgpr[0:1]
// Other shaders indirect 64-bits at sgpr[0:1]		// Other shaders indirect 64-bits at sgpr[0:1]
bool ImplicitBufferPtr : 1;		bool ImplicitBufferPtr : 1;

// Pointer to where the ABI inserts special kernel arguments separate from the		// Pointer to where the ABI inserts special kernel arguments separate from the
// user arguments. This is an offset from the KernargSegmentPtr.		// user arguments. This is an offset from the KernargSegmentPtr.
bool ImplicitArgPtr : 1;		bool ImplicitArgPtr : 1;

		// Function may be memory bound.
		bool MemoryBound : 1;

		// Kernel may need limited waves per EU for better performance.
		bool WaveLimiter : 1;

// The hard-wired high half of the address of the global information table		// The hard-wired high half of the address of the global information table
// for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since		// for AMDPAL OS type. 0xffffffff represents no hard-wired high half, since
// current hardware only allows a 16 bit value.		// current hardware only allows a 16 bit value.
unsigned GITPtrHigh;		unsigned GITPtrHigh;

unsigned HighBitsOf32BitAddress;		unsigned HighBitsOf32BitAddress;

MCPhysReg getNextUserSGPR() const {		MCPhysReg getNextUserSGPR() const {
▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	public:
bool hasImplicitArgPtr() const {		bool hasImplicitArgPtr() const {
return ImplicitArgPtr;		return ImplicitArgPtr;
}		}

bool hasImplicitBufferPtr() const {		bool hasImplicitBufferPtr() const {
return ImplicitBufferPtr;		return ImplicitBufferPtr;
}		}

		bool isMemoryBound() const {
		return MemoryBound;
		}

		bool needsWaveLimiter() const {
		return WaveLimiter;
		}

AMDGPUFunctionArgInfo &getArgInfo() {		AMDGPUFunctionArgInfo &getArgInfo() {
return ArgInfo;		return ArgInfo;
}		}

const AMDGPUFunctionArgInfo &getArgInfo() const {		const AMDGPUFunctionArgInfo &getArgInfo() const {
return ArgInfo;		return ArgInfo;
}		}

▲ Show 20 Lines • Show All 265 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

//===- SIMachineFunctionInfo.cpp - SI Machine Function Info ---------------===//		//===- SIMachineFunctionInfo.cpp - SI Machine Function Info ---------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "AMDGPUArgumentUsageInfo.h"		#include "AMDGPUArgumentUsageInfo.h"
		#include "AMDGPUPerfHintAnalysis.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "Utils/AMDGPUBaseInfo.h"		#include "Utils/AMDGPUBaseInfo.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
		#include "llvm/CodeGen/MachineModuleInfo.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/IR/CallingConv.h"		#include "llvm/IR/CallingConv.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include <cassert>		#include <cassert>
#include <vector>		#include <vector>

#define MAX_LANES 64		#define MAX_LANES 64

Show All 15 Lines	: AMDGPUMachineFunction(MF),
WorkGroupIDZ(false),		WorkGroupIDZ(false),
WorkGroupInfo(false),		WorkGroupInfo(false),
PrivateSegmentWaveByteOffset(false),		PrivateSegmentWaveByteOffset(false),
WorkItemIDX(false),		WorkItemIDX(false),
WorkItemIDY(false),		WorkItemIDY(false),
WorkItemIDZ(false),		WorkItemIDZ(false),
ImplicitBufferPtr(false),		ImplicitBufferPtr(false),
ImplicitArgPtr(false),		ImplicitArgPtr(false),
		MemoryBound(false),
		WaveLimiter(false),
GITPtrHigh(0xffffffff),		GITPtrHigh(0xffffffff),
HighBitsOf32BitAddress(0) {		HighBitsOf32BitAddress(0) {
const SISubtarget &ST = MF.getSubtarget<SISubtarget>();		const SISubtarget &ST = MF.getSubtarget<SISubtarget>();
const Function &F = MF.getFunction();		const Function &F = MF.getFunction();
FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);		FlatWorkGroupSizes = ST.getFlatWorkGroupSizes(F);
WavesPerEU = ST.getWavesPerEU(F);		WavesPerEU = ST.getWavesPerEU(F);

if (!isEntryFunction()) {		if (!isEntryFunction()) {
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)
StringRef S = A.getValueAsString();		StringRef S = A.getValueAsString();
if (!S.empty())		if (!S.empty())
S.consumeInteger(0, GITPtrHigh);		S.consumeInteger(0, GITPtrHigh);

A = F.getFnAttribute("amdgpu-32bit-address-high-bits");		A = F.getFnAttribute("amdgpu-32bit-address-high-bits");
S = A.getValueAsString();		S = A.getValueAsString();
if (!S.empty())		if (!S.empty())
S.consumeInteger(0, HighBitsOf32BitAddress);		S.consumeInteger(0, HighBitsOf32BitAddress);

		if (auto Resolver = MF.getMMI().getResolver()) {
		vpykhtinUnsubmitted Done Reply Inline Actions I would use "auto &Resolver" to emphasize Resolver is a reference not a temp object vpykhtin: I would use "auto &Resolver" to emphasize Resolver is a reference not a temp object
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Resolver is a pointer, not an object or reference. rampitec: Resolver is a pointer, not an object or reference.
		if (AMDGPUPerfHintAnalysis PHA = static_cast<AMDGPUPerfHintAnalysis>(
		Resolver->getAnalysisIfAvailable(&AMDGPUPerfHintAnalysisID, true))) {
		MemoryBound = PHA->isMemoryBound(&MF.getFunction());
		WaveLimiter= PHA->needsWaveLimiter(&MF.getFunction());
		arsenmUnsubmitted Done Reply Inline Actions Missing space arsenm: Missing space
		}
		}
		arsenmUnsubmitted Not Done Reply Inline Actions It seems like there's no reason to actually put this code in SIMachineFunctionInfo. Can you just do this directly in the AsmPrinter where you emit this? arsenm: It seems like there's no reason to actually put this code in SIMachineFunctionInfo. Can you…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions By the time it is needed function's IR is already destroyed. Note, it is not only needed from printer, it is also checked in the scheduler. rampitec: By the time it is needed function's IR is already destroyed. Note, it is not only needed from…
}		}

unsigned SIMachineFunctionInfo::addPrivateSegmentBuffer(		unsigned SIMachineFunctionInfo::addPrivateSegmentBuffer(
const SIRegisterInfo &TRI) {		const SIRegisterInfo &TRI) {
ArgInfo.PrivateSegmentBuffer =		ArgInfo.PrivateSegmentBuffer =
ArgDescriptor::createRegister(TRI.getMatchingSuperReg(		ArgDescriptor::createRegister(TRI.getMatchingSuperReg(
getNextUserSGPR(), AMDGPU::sub0, &AMDGPU::SReg_128RegClass));		getNextUserSGPR(), AMDGPU::sub0, &AMDGPU::SReg_128RegClass));
NumUserSGPRs += 4;		NumUserSGPRs += 4;
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=NOTES %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=NOTES %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=NOTES %s

	@var = addrspace(1) global float 0.0			@var = addrspace(1) global float 0.0

	; CHECK: ---			; CHECK: ---
	; CHECK: Version: [ 1, 0 ]			; CHECK: Version: [ 1, 0 ]
	; CHECK: Kernels:			; CHECK: Kernels:

	; CHECK: - Name: test			; CHECK: - Name: test
	; CHECK: SymbolName: 'test@kd'			; CHECK: SymbolName: 'test@kd'
	; CHECK: CodeProps:			; CHECK: CodeProps:
	; CHECK: KernargSegmentSize: 24			; CHECK: KernargSegmentSize: 24
	; CHECK: GroupSegmentFixedSize: 0			; CHECK: GroupSegmentFixedSize: 0
	; CHECK: PrivateSegmentFixedSize: 0			; CHECK: PrivateSegmentFixedSize: 0
	; CHECK: KernargSegmentAlign: 8			; CHECK: KernargSegmentAlign: 8
	; CHECK: WavefrontSize: 64			; CHECK: WavefrontSize: 64
	; CHECK: NumSGPRs: 6			; CHECK: NumSGPRs: 6
	; GFX700: NumVGPRs: 4			; CHECK: NumVGPRs: 3
	; GFX803: NumVGPRs: 6
	; GFX900: NumVGPRs: 6
	; CHECK: MaxFlatWorkGroupSize: 256			; CHECK: MaxFlatWorkGroupSize: 256
	define amdgpu_kernel void @test(			define amdgpu_kernel void @test(
	half addrspace(1)* %r,			half addrspace(1)* %r,
	half addrspace(1)* %a,			half addrspace(1)* %a,
	half addrspace(1)* %b) {			half addrspace(1)* %b) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%b.val = load half, half addrspace(1)* %b			%b.val = load half, half addrspace(1)* %b
	▲ Show 20 Lines • Show All 112 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/perfhint.ll

This file was added.

				; RUN: llc -march=amdgcn < %s \| FileCheck -check-prefix=GCN %s

				; GCN-LABEL: {{^}}test_membound:
				; MemoryBound: 1
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_membound(<4 x i32> addrspace(1)* nocapture readonly %arg, <4 x i32> addrspace(1)* nocapture %arg1) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp2 = zext i32 %tmp to i64
				%tmp3 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp2
				%tmp4 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp3, align 16
				%tmp5 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp2
				store <4 x i32> %tmp4, <4 x i32> addrspace(1)* %tmp5, align 16
				%tmp6 = add nuw nsw i64 %tmp2, 1
				%tmp7 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp6
				%tmp8 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp7, align 16
				%tmp9 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp6
				store <4 x i32> %tmp8, <4 x i32> addrspace(1)* %tmp9, align 16
				%tmp10 = add nuw nsw i64 %tmp2, 2
				%tmp11 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp10
				%tmp12 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp11, align 16
				%tmp13 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp10
				store <4 x i32> %tmp12, <4 x i32> addrspace(1)* %tmp13, align 16
				%tmp14 = add nuw nsw i64 %tmp2, 3
				%tmp15 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp14
				%tmp16 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp15, align 16
				%tmp17 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp14
				store <4 x i32> %tmp16, <4 x i32> addrspace(1)* %tmp17, align 16
				ret void
				}

				; GCN-LABEL: {{^}}test_large_stride:
				; MemoryBound: 0
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_large_stride(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 4096
				%tmp1 = load i32, i32 addrspace(1)* %tmp, align 4
				%tmp2 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				store i32 %tmp1, i32 addrspace(1)* %tmp2, align 4
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 8192
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4
				%tmp5 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				store i32 %tmp4, i32 addrspace(1)* %tmp5, align 4
				%tmp6 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 12288
				%tmp7 = load i32, i32 addrspace(1)* %tmp6, align 4
				%tmp8 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 3
				store i32 %tmp7, i32 addrspace(1)* %tmp8, align 4
				ret void
				}

				; GCN-LABEL: {{^}}test_indirect:
				; MemoryBound: 0
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_indirect(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				%tmp1 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				%tmp2 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 3
				%tmp3 = bitcast i32 addrspace(1)* %arg to <4 x i32> addrspace(1)*
				%tmp4 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp3, align 4
				%tmp5 = extractelement <4 x i32> %tmp4, i32 0
				%tmp6 = sext i32 %tmp5 to i64
				%tmp7 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp6
				%tmp8 = load i32, i32 addrspace(1)* %tmp7, align 4
				store i32 %tmp8, i32 addrspace(1)* %arg, align 4
				%tmp9 = extractelement <4 x i32> %tmp4, i32 1
				%tmp10 = sext i32 %tmp9 to i64
				%tmp11 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp10
				%tmp12 = load i32, i32 addrspace(1)* %tmp11, align 4
				store i32 %tmp12, i32 addrspace(1)* %tmp, align 4
				%tmp13 = extractelement <4 x i32> %tmp4, i32 2
				%tmp14 = sext i32 %tmp13 to i64
				%tmp15 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp14
				%tmp16 = load i32, i32 addrspace(1)* %tmp15, align 4
				store i32 %tmp16, i32 addrspace(1)* %tmp1, align 4
				%tmp17 = extractelement <4 x i32> %tmp4, i32 3
				%tmp18 = sext i32 %tmp17 to i64
				%tmp19 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp18
				%tmp20 = load i32, i32 addrspace(1)* %tmp19, align 4
				store i32 %tmp20, i32 addrspace(1)* %tmp2, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add perf hints to functionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 147628

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUAsmPrinter.h

lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp

lib/Target/AMDGPU/CMakeLists.txt

lib/Target/AMDGPU/GCNSchedStrategy.cpp

lib/Target/AMDGPU/SIMachineFunctionInfo.h

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp

test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

test/CodeGen/AMDGPU/perfhint.ll

[AMDGPU] Add perf hints to functions
ClosedPublic