This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add perf hints to functions
ClosedPublic

Authored by rampitec on May 16 2018, 6:12 PM.

Download Raw Diff

Details

Reviewers

yaxunl
vpykhtin
arsenm

Commits

rG1c538423dc24: [AMDGPU] Add perf hints to functions
rL333289: [AMDGPU] Add perf hints to functions

Summary

This is adoption of HSAIL perfhint pass. Two types of hints are produced:

Function is memory bound.
Kernel can use wave limiter.

Currently these hints are used in the scheduler. If a function is suspected
to be memory bound we allow occupancy to decrease to 4 waves in the course
of scheduling.

Diff Detail

Repository: rL LLVM

Event Timeline

rampitec created this revision.May 16 2018, 6:12 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptMay 16 2018, 6:12 PM

t-tye added inline comments.May 16 2018, 9:31 PM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Should this be done at the beginning of the visit to ensure will terminate for mutual recursive functions?
388 ↗	(On Diff #147223)	indentation
392 ↗	(On Diff #147223)	Does having the std::move() prevent named return value optimization (which can happen for the return above)? Returning a prvalue (eg by directly returning a constructor) would get guaranteed copy elision in C++17.

How can UMDs disable this optimization?

Are there cases where this decreases performance?

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
13 ↗	(On Diff #147223)	Did you mean "cache thrashing"?

In D46992#1102499, @mareko wrote:

How can UMDs disable this optimization?

Are there cases where this decreases performance?

This is analysis. Optimization itself must be done in the runtime. OpenCL RT used to control it with the env. Graphics RT never did it. At any rate if you know your ideal occupancy it is better to set amdgpu-waves-per-eu attribute.

The only optimization implemented here based on the analysis is in the scheduler. On practice there is no way for a memory intensive program to benefit from an occupancy higher than 4, usually it is lower. However, the impact of the optimization is to let scheduler work where previously it just reverted the schedule if occupancy has decreased. Therefor the natural way to return the old behavior after this change is to disable scheduler (-enable-mished=0) which will result in the same code as before if this condition is triggered.

arsenm added inline comments.May 17 2018, 3:17 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
336–337 ↗	(On Diff #147223)	You're not really supposed to use attributes to pass information through to others, although we do this in a few places to hack around isel limitations. Can you make this an analysis pass instead which returns yes / no at the point you actually need this?
382–383 ↗	(On Diff #147223)	Should handle other memory operations too, like atomics and intrinsics. There is already a wrapper somewhere which should find the pointer operand for all of these operations

Addressed review comments.
Pass is converted to analysis.

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Thank you!
392 ↗	(On Diff #147223)	After the port it is trivially copyable, so move is not required any more.

arsenm added inline comments.May 18 2018, 1:25 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
1 ↗	(On Diff #147223)	Update comment
11–13 ↗	(On Diff #147223)	Comment needs update. Maybe add a todo that this should be a machine analysis?
166 ↗	(On Diff #147223)	What does this mean exactly by indirect access? This seems to me like it's reimplementing something like GetUnderlyingObject
259 ↗	(On Diff #147223)	Probably should check for CallSite to cover the possible future case of InvokeInsts
260–261 ↗	(On Diff #147223)	!Callee
271 ↗	(On Diff #147223)	Extra space
277–278 ↗	(On Diff #147223)	isLegalAddressingMode (although at this point this should probably be a machine pass, but I understand that's more work to rewrite)
289 ↗	(On Diff #147223)	Can you add a test for this case
397 ↗	(On Diff #147223)	There's also CONSTANT_ADDRESS_32BIT
lib/Target/AMDGPU/SIDefines.h
538–539 ↗	(On Diff #147223)	Should be able to also remove this
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184 ↗	(On Diff #147421)	It seems like there's no reason to actually put this code in SIMachineFunctionInfo. Can you just do this directly in the AsmPrinter where you emit this?

yaxunl added inline comments.May 18 2018, 4:08 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
166 ↗	(On Diff #147223)	indirect access here means something like a[b[i]], i.e., the index of the array is loaded from memory. It usually results in random access in stead of stream access. Probably it can have a better name.

rampitec added inline comments.May 18 2018, 9:39 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
11–13 ↗	(On Diff #147223)	I do not think it has to be machine IR pass. It will be really difficult to perform this analysis on machine IR.
289 ↗	(On Diff #147223)	Only as opt test. BE will fail on recursion.
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184 ↗	(On Diff #147421)	By the time it is needed function's IR is already destroyed. Note, it is not only needed from printer, it is also checked in the scheduler.

rampitec added inline comments.May 18 2018, 9:40 AM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
260–261 ↗	(On Diff #147223)	This file does not exist in the patch. You seem to comment on the old version somehow.

rampitec updated this revision to Diff 147546.May 18 2018, 10:31 AM

rampitec marked 5 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗	(On Diff #147223)	Actually since it is now an on-demand analysis I cannot do it even with opt. We will need to generally fix recursion handling in the BE, it is not a problem specific to this patch.
lib/Target/AMDGPU/SIDefines.h
538–539 ↗	(On Diff #147223)	It was removed when pass was converted to analysis. Please check the current patch.

arsenm added inline comments.May 18 2018, 3:13 PM

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
257–261 ↗	(On Diff #147546)	isLegalAddressingMode. I'm not sure I understand why this pass is doing most of what it's doing. Why does the addressing mode match matter for determining if the function is probably memory bound? With a machine pass you would have a much more exact idea of the number of memory operations really being executed

rampitec added inline comments.May 18 2018, 3:35 PM

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
277–278 ↗	(On Diff #147223)	There is no TLI or subtarget here yet.
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
257–261 ↗	(On Diff #147546)	It matters because we are trying to estimate memory to ALU instruction ratio. A foldable GEP does not result in an instruction. In fact this is a rough estimation, completely correct answer is not needed. On the machine IR in turn it will be very difficult to track pointers.

rampitec updated this revision to Diff 147628.May 18 2018, 5:59 PM

rampitec marked 2 inline comments as done.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

In D46992#1105636, @rampitec wrote:

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

In D46992#1105803, @mareko wrote:

In D46992#1105636, @rampitec wrote:

In D46992#1105635, @mareko wrote:

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

Ok, that's what I meant, it is considered an ordinary instruction. It is just not covered by the pass and there is no impact. Primarily because no measurements were done, no workloads analyzed and no statistics collected to try to perform any optimizations of that kind on compute side. gfx side may have data collected, but the pass will do nothing to that kind of loads.

Switched to use isLegalAddressingMode in GEP processing.

arsenm added inline comments.May 21 2018, 2:50 PM

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
36 ↗	(On Diff #147856)	Naming convention for existing flags seem to all use the full word threshold (same for the rest)
159 ↗	(On Diff #147856)	Seems like a SmallSet?
240–242 ↗	(On Diff #147856)	Run clang-format
293 ↗	(On Diff #147856)	Move this up to avoid calling the same find twice? visit can also return the inserted iterator
315 ↗	(On Diff #147856)	I think there's a policy of generally avoiding FP computations
320 ↗	(On Diff #147856)	Ditto
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h
1 ↗	(On Diff #147856)	c++ mode comment
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
182 ↗	(On Diff #147856)	Missing space

rampitec updated this revision to Diff 147886.May 21 2018, 3:10 PM

rampitec marked 8 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
293 ↗	(On Diff #147856)	It is not the same, it is one after visit() call. But yes, changing to visit to return the iterator.

Small cleanup after last changes.

rampitec added a reviewer: arsenm.May 23 2018, 1:37 PM

LGTM. Just a hint: whenever you use "auto X = ..." it's worth to specify explicitly if X is pointer or reference. It's not only saves you from accidental temp object by copy but also makes program easier to read.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178 ↗	(On Diff #147937)	I would use "auto &Resolver" to emphasize Resolver is a reference not a temp object

rampitec marked an inline comment as done.May 25 2018, 9:56 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178 ↗	(On Diff #147937)	Resolver is a pointer, not an object or reference.

Rebase to master.
Moved info from SIMachineFunctionInfo into its parent AMDGPUMachineFunction since SI/R600 is not clearly separated in AMDGPUAsmPrinter anymore.

vpykhtin accepted this revision.May 25 2018, 10:16 AM

This revision is now accepted and ready to land.May 25 2018, 10:16 AM

Closed by commit rL333289: [AMDGPU] Add perf hints to functions (authored by rampitec). · Explain WhyMay 25 2018, 10:29 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPU.h

3 lines

AMDGPUAsmPrinter.h

4 lines

AMDGPUAsmPrinter.cpp

12 lines

AMDGPUISelDAGToDAG.cpp

3 lines

AMDGPUMachineFunction.h

14 lines

AMDGPUMachineFunction.cpp

14 lines

AMDGPUPerfHintAnalysis.h

55 lines

AMDGPUPerfHintAnalysis.cpp

404 lines

CMakeLists.txt

1 line

GCNSchedStrategy.cpp

12 lines

test/

CodeGen/

AMDGPU/

hsa-metadata-kernel-code-props.ll

10 lines

perfhint.ll

85 lines

Diff 148626

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	extern char &SIFixWWMLivenessID;			extern char &SIFixWWMLivenessID;

	void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);			void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);
	extern char &AMDGPUSimplifyLibCallsID;			extern char &AMDGPUSimplifyLibCallsID;

	void initializeAMDGPUUseNativeCallsPass(PassRegistry &);			void initializeAMDGPUUseNativeCallsPass(PassRegistry &);
	extern char &AMDGPUUseNativeCallsID;			extern char &AMDGPUUseNativeCallsID;

				void initializeAMDGPUPerfHintAnalysisPass(PassRegistry &);
				extern char &AMDGPUPerfHintAnalysisID;

	// Passes common to R600 and SI			// Passes common to R600 and SI
	FunctionPass *createAMDGPUPromoteAlloca();			FunctionPass *createAMDGPUPromoteAlloca();
	void initializeAMDGPUPromoteAllocaPass(PassRegistry&);			void initializeAMDGPUPromoteAllocaPass(PassRegistry&);
	extern char &AMDGPUPromoteAllocaID;			extern char &AMDGPUPromoteAllocaID;

	Pass *createAMDGPUStructurizeCFGPass();			Pass *createAMDGPUStructurizeCFGPass();
	FunctionPass *createAMDGPUISelDag(			FunctionPass *createAMDGPUISelDag(
	TargetMachine *TM = nullptr,			TargetMachine *TM = nullptr,
	▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUAsmPrinter.h

Show All 23 Lines
#include <cstdint>		#include <cstdint>
#include <limits>		#include <limits>
#include <memory>		#include <memory>
#include <string>		#include <string>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

		class AMDGPUMachineFunction;
class AMDGPUTargetStreamer;		class AMDGPUTargetStreamer;
class MCOperand;		class MCOperand;
class SISubtarget;		class SISubtarget;

class AMDGPUAsmPrinter final : public AsmPrinter {		class AMDGPUAsmPrinter final : public AsmPrinter {
private:		private:
// Track resource usage for callee functions.		// Track resource usage for callee functions.
struct SIFunctionResourceInfo {		struct SIFunctionResourceInfo {
▲ Show 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	private:
/// can correctly setup the GPU state.		/// can correctly setup the GPU state.
void EmitProgramInfoSI(const MachineFunction &MF,		void EmitProgramInfoSI(const MachineFunction &MF,
const SIProgramInfo &KernelInfo);		const SIProgramInfo &KernelInfo);
void EmitPALMetadata(const MachineFunction &MF,		void EmitPALMetadata(const MachineFunction &MF,
const SIProgramInfo &KernelInfo);		const SIProgramInfo &KernelInfo);
void emitCommonFunctionComments(uint32_t NumVGPR,		void emitCommonFunctionComments(uint32_t NumVGPR,
uint32_t NumSGPR,		uint32_t NumSGPR,
uint64_t ScratchSize,		uint64_t ScratchSize,
uint64_t CodeSize);		uint64_t CodeSize,
		const AMDGPUMachineFunction* MFI);

public:		public:
explicit AMDGPUAsmPrinter(TargetMachine &TM,		explicit AMDGPUAsmPrinter(TargetMachine &TM,
std::unique_ptr<MCStreamer> Streamer);		std::unique_ptr<MCStreamer> Streamer);

StringRef getPassName() const override;		StringRef getPassName() const override;

const MCSubtargetInfo* getSTI() const;		const MCSubtargetInfo* getSTI() const;
▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Show First 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	void AMDGPUAsmPrinter::readPALMetadata(Module &M) {
}		}
}		}

// Print comments that apply to both callable functions and entry points.		// Print comments that apply to both callable functions and entry points.
void AMDGPUAsmPrinter::emitCommonFunctionComments(		void AMDGPUAsmPrinter::emitCommonFunctionComments(
uint32_t NumVGPR,		uint32_t NumVGPR,
uint32_t NumSGPR,		uint32_t NumSGPR,
uint64_t ScratchSize,		uint64_t ScratchSize,
uint64_t CodeSize) {		uint64_t CodeSize,
		const AMDGPUMachineFunction *MFI) {
OutStreamer->emitRawComment(" codeLenInByte = " + Twine(CodeSize), false);		OutStreamer->emitRawComment(" codeLenInByte = " + Twine(CodeSize), false);
OutStreamer->emitRawComment(" NumSgprs: " + Twine(NumSGPR), false);		OutStreamer->emitRawComment(" NumSgprs: " + Twine(NumSGPR), false);
OutStreamer->emitRawComment(" NumVgprs: " + Twine(NumVGPR), false);		OutStreamer->emitRawComment(" NumVgprs: " + Twine(NumVGPR), false);
OutStreamer->emitRawComment(" ScratchSize: " + Twine(ScratchSize), false);		OutStreamer->emitRawComment(" ScratchSize: " + Twine(ScratchSize), false);
		OutStreamer->emitRawComment(" MemoryBound: " + Twine(MFI->isMemoryBound()),
		false);
}		}

bool AMDGPUAsmPrinter::runOnMachineFunction(MachineFunction &MF) {		bool AMDGPUAsmPrinter::runOnMachineFunction(MachineFunction &MF) {
CurrentProgramInfo = SIProgramInfo();		CurrentProgramInfo = SIProgramInfo();

const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();		const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();

// The starting address of all shader programs must be 256 bytes aligned.		// The starting address of all shader programs must be 256 bytes aligned.
Show All 40 Lines	if (isVerbose()) {

if (!MFI->isEntryFunction()) {		if (!MFI->isEntryFunction()) {
OutStreamer->emitRawComment(" Function info:", false);		OutStreamer->emitRawComment(" Function info:", false);
SIFunctionResourceInfo &Info = CallGraphResourceInfo[&MF.getFunction()];		SIFunctionResourceInfo &Info = CallGraphResourceInfo[&MF.getFunction()];
emitCommonFunctionComments(		emitCommonFunctionComments(
Info.NumVGPR,		Info.NumVGPR,
Info.getTotalNumSGPRs(MF.getSubtarget<SISubtarget>()),		Info.getTotalNumSGPRs(MF.getSubtarget<SISubtarget>()),
Info.PrivateSegmentSize,		Info.PrivateSegmentSize,
getFunctionCodeSize(MF));		getFunctionCodeSize(MF), MFI);
return false;		return false;
}		}

OutStreamer->emitRawComment(" Kernel info:", false);		OutStreamer->emitRawComment(" Kernel info:", false);
emitCommonFunctionComments(CurrentProgramInfo.NumVGPR,		emitCommonFunctionComments(CurrentProgramInfo.NumVGPR,
CurrentProgramInfo.NumSGPR,		CurrentProgramInfo.NumSGPR,
CurrentProgramInfo.ScratchSize,		CurrentProgramInfo.ScratchSize,
getFunctionCodeSize(MF));		getFunctionCodeSize(MF), MFI);

OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" FloatMode: " + Twine(CurrentProgramInfo.FloatMode), false);		" FloatMode: " + Twine(CurrentProgramInfo.FloatMode), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" IeeeMode: " + Twine(CurrentProgramInfo.IEEEMode), false);		" IeeeMode: " + Twine(CurrentProgramInfo.IEEEMode), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" LDSByteSize: " + Twine(CurrentProgramInfo.LDSSize) +		" LDSByteSize: " + Twine(CurrentProgramInfo.LDSSize) +
" bytes/workgroup (compile time only)", false);		" bytes/workgroup (compile time only)", false);
Show All 12 Lines	if (isVerbose()) {

OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" ReservedVGPRFirst: " + Twine(CurrentProgramInfo.ReservedVGPRFirst),		" ReservedVGPRFirst: " + Twine(CurrentProgramInfo.ReservedVGPRFirst),
false);		false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" ReservedVGPRCount: " + Twine(CurrentProgramInfo.ReservedVGPRCount),		" ReservedVGPRCount: " + Twine(CurrentProgramInfo.ReservedVGPRCount),
false);		false);

		OutStreamer->emitRawComment(
		" WaveLimiterHint : " + Twine(MFI->needsWaveLimiter()), false);

if (MF.getSubtarget<SISubtarget>().debuggerEmitPrologue()) {		if (MF.getSubtarget<SISubtarget>().debuggerEmitPrologue()) {
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" DebuggerWavefrontPrivateSegmentOffsetSGPR: s" +		" DebuggerWavefrontPrivateSegmentOffsetSGPR: s" +
Twine(CurrentProgramInfo.DebuggerWavefrontPrivateSegmentOffsetSGPR), false);		Twine(CurrentProgramInfo.DebuggerWavefrontPrivateSegmentOffsetSGPR), false);
OutStreamer->emitRawComment(		OutStreamer->emitRawComment(
" DebuggerPrivateSegmentBufferSGPR: s" +		" DebuggerPrivateSegmentBufferSGPR: s" +
Twine(CurrentProgramInfo.DebuggerPrivateSegmentBufferSGPR), false);		Twine(CurrentProgramInfo.DebuggerPrivateSegmentBufferSGPR), false);
}		}
▲ Show 20 Lines • Show All 834 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show All 10 Lines
/// Defines an instruction selector for the AMDGPU target.		/// Defines an instruction selector for the AMDGPU target.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUArgumentUsageInfo.h"		#include "AMDGPUArgumentUsageInfo.h"
#include "AMDGPUISelLowering.h" // For AMDGPUISD		#include "AMDGPUISelLowering.h" // For AMDGPUISD
#include "AMDGPUInstrInfo.h"		#include "AMDGPUInstrInfo.h"
		#include "AMDGPUPerfHintAnalysis.h"
#include "AMDGPURegisterInfo.h"		#include "AMDGPURegisterInfo.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "AMDGPUTargetMachine.h"		#include "AMDGPUTargetMachine.h"
#include "SIDefines.h"		#include "SIDefines.h"
#include "SIISelLowering.h"		#include "SIISelLowering.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	explicit AMDGPUDAGToDAGISel(TargetMachine *TM = nullptr,
: SelectionDAGISel(*TM, OptLevel) {		: SelectionDAGISel(*TM, OptLevel) {
AMDGPUASI = AMDGPU::getAMDGPUAS(*TM);		AMDGPUASI = AMDGPU::getAMDGPUAS(*TM);
EnableLateStructurizeCFG = AMDGPUTargetMachine::EnableLateStructurizeCFG;		EnableLateStructurizeCFG = AMDGPUTargetMachine::EnableLateStructurizeCFG;
}		}
~AMDGPUDAGToDAGISel() override = default;		~AMDGPUDAGToDAGISel() override = default;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AMDGPUArgumentUsageInfo>();		AU.addRequired<AMDGPUArgumentUsageInfo>();
		AU.addRequired<AMDGPUPerfHintAnalysis>();
AU.addRequired<DivergenceAnalysis>();		AU.addRequired<DivergenceAnalysis>();
SelectionDAGISel::getAnalysisUsage(AU);		SelectionDAGISel::getAnalysisUsage(AU);
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
void Select(SDNode *N) override;		void Select(SDNode *N) override;
StringRef getPassName() const override;		StringRef getPassName() const override;
void PostprocessISelDAG() override;		void PostprocessISelDAG() override;
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	bool SelectADDRVTX_READ(SDValue Addr, SDValue &Base,
SDValue &Offset) override;		SDValue &Offset) override;
};		};

} // end anonymous namespace		} // end anonymous namespace

INITIALIZE_PASS_BEGIN(AMDGPUDAGToDAGISel, "isel",		INITIALIZE_PASS_BEGIN(AMDGPUDAGToDAGISel, "isel",
"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)		"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)
INITIALIZE_PASS_DEPENDENCY(AMDGPUArgumentUsageInfo)		INITIALIZE_PASS_DEPENDENCY(AMDGPUArgumentUsageInfo)
		INITIALIZE_PASS_DEPENDENCY(AMDGPUPerfHintAnalysis)
INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)		INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)
INITIALIZE_PASS_END(AMDGPUDAGToDAGISel, "isel",		INITIALIZE_PASS_END(AMDGPUDAGToDAGISel, "isel",
"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)		"AMDGPU DAG->DAG Pattern Instruction Selection", false, false)

/// This pass converts a legalized DAG into a AMDGPU-specific		/// This pass converts a legalized DAG into a AMDGPU-specific
// DAG, ready for instruction scheduling.		// DAG, ready for instruction scheduling.
FunctionPass llvm::createAMDGPUISelDag(TargetMachine TM,		FunctionPass llvm::createAMDGPUISelDag(TargetMachine TM,
CodeGenOpt::Level OptLevel) {		CodeGenOpt::Level OptLevel) {
▲ Show 20 Lines • Show All 1,995 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUMachineFunction.h

Show All 30 Lines	class AMDGPUMachineFunction : public MachineFunctionInfo {
unsigned ABIArgOffset;		unsigned ABIArgOffset;

// Kernels + shaders. i.e. functions called by the driver and not called		// Kernels + shaders. i.e. functions called by the driver and not called
// by other functions.		// by other functions.
bool IsEntryFunction;		bool IsEntryFunction;

bool NoSignedZerosFPMath;		bool NoSignedZerosFPMath;

		// Function may be memory bound.
		bool MemoryBound;

		// Kernel may need limited waves per EU for better performance.
		bool WaveLimiter;

public:		public:
AMDGPUMachineFunction(const MachineFunction &MF);		AMDGPUMachineFunction(const MachineFunction &MF);

uint64_t allocateKernArg(uint64_t Size, unsigned Align) {		uint64_t allocateKernArg(uint64_t Size, unsigned Align) {
assert(isPowerOf2_32(Align));		assert(isPowerOf2_32(Align));
KernArgSize = alignTo(KernArgSize, Align);		KernArgSize = alignTo(KernArgSize, Align);

uint64_t Result = KernArgSize;		uint64_t Result = KernArgSize;
Show All 26 Lines	public:
bool isEntryFunction() const {		bool isEntryFunction() const {
return IsEntryFunction;		return IsEntryFunction;
}		}

bool hasNoSignedZerosFPMath() const {		bool hasNoSignedZerosFPMath() const {
return NoSignedZerosFPMath;		return NoSignedZerosFPMath;
}		}

		bool isMemoryBound() const {
		return MemoryBound;
		}

		bool needsWaveLimiter() const {
		return WaveLimiter;
		}

unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalValue &GV);		unsigned allocateLDSGlobal(const DataLayout &DL, const GlobalValue &GV);
};		};

}		}
#endif		#endif

llvm/trunk/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

	//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//			//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPUMachineFunction.h"			#include "AMDGPUMachineFunction.h"
	#include "AMDGPUSubtarget.h"			#include "AMDGPUSubtarget.h"
				#include "AMDGPUPerfHintAnalysis.h"
				#include "llvm/CodeGen/MachineModuleInfo.h"

	using namespace llvm;			using namespace llvm;

	AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF) :			AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF) :
	MachineFunctionInfo(),			MachineFunctionInfo(),
	LocalMemoryObjects(),			LocalMemoryObjects(),
	KernArgSize(0),			KernArgSize(0),
	MaxKernArgAlign(0),			MaxKernArgAlign(0),
	LDSSize(0),			LDSSize(0),
	ABIArgOffset(0),			ABIArgOffset(0),
	IsEntryFunction(AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),			IsEntryFunction(AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),
	NoSignedZerosFPMath(MF.getTarget().Options.NoSignedZerosFPMath) {			NoSignedZerosFPMath(MF.getTarget().Options.NoSignedZerosFPMath),
				MemoryBound(false),
				WaveLimiter(false) {
	// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,			// FIXME: Should initialize KernArgSize based on ExplicitKernelArgOffset,
	// except reserved size is not correctly aligned.			// except reserved size is not correctly aligned.

				if (auto *Resolver = MF.getMMI().getResolver()) {
				if (AMDGPUPerfHintAnalysis PHA = static_cast<AMDGPUPerfHintAnalysis>(
				Resolver->getAnalysisIfAvailable(&AMDGPUPerfHintAnalysisID, true))) {
				MemoryBound = PHA->isMemoryBound(&MF.getFunction());
				WaveLimiter = PHA->needsWaveLimiter(&MF.getFunction());
				}
				}
	}			}

	unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,			unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
	const GlobalValue &GV) {			const GlobalValue &GV) {
	auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));			auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));
	if (!Entry.second)			if (!Entry.second)
	return Entry.first->second;			return Entry.first->second;

	Show All 14 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h

				//===- AMDGPUPerfHintAnalysis.h - analysis of functions memory traffic ----===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief Analyzes if a function potentially memory bound and if a kernel
				/// kernel may benefit from limiting number of waves to reduce cache thrashing.
				///
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H
				#define LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H
				#include "llvm/IR/ValueMap.h"
				#include "llvm/Pass.h"

				namespace llvm {

				struct AMDGPUPerfHintAnalysis : public FunctionPass {
				static char ID;

				public:
				AMDGPUPerfHintAnalysis() : FunctionPass(ID) {}

				bool runOnFunction(Function &F) override;

				void getAnalysisUsage(AnalysisUsage &AU) const {
				AU.setPreservesAll();
				}

				bool isMemoryBound(const Function *F) const;

				bool needsWaveLimiter(const Function *F) const;

				struct FuncInfo {
				unsigned MemInstCount;
				unsigned InstCount;
				unsigned IAMInstCount; // Indirect access memory instruction count
				unsigned LSMInstCount; // Large stride memory instruction count
				FuncInfo() : MemInstCount(0), InstCount(0), IAMInstCount(0),
				LSMInstCount(0) {}
				};

				typedef ValueMap<const Function*, FuncInfo> FuncInfoMap;

				private:

				FuncInfoMap FIM;
				};
				} // namespace llvm
				#endif // LLVM_LIB_TARGET_AMDGPU_MDGPUPERFHINTANALYSIS_H

llvm/trunk/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp

				//===- AMDGPUPerfHintAnalysis.cpp - analysis of functions memory traffic --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// \brief Analyzes if a function potentially memory bound and if a kernel
				/// kernel may benefit from limiting number of waves to reduce cache thrashing.
				///
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUPerfHintAnalysis.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "llvm/ADT/SmallSet.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/CodeGen/TargetLowering.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/CodeGen/TargetSubtargetInfo.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/ValueMap.h"
				#include "llvm/Support/CommandLine.h"

				using namespace llvm;

				#define DEBUG_TYPE "amdgpu-perf-hint"

				static cl::opt<unsigned>
				MemBoundThresh("amdgpu-membound-threshold", cl::init(50), cl::Hidden,
				cl::desc("Function mem bound threshold in %"));

				static cl::opt<unsigned>
				LimitWaveThresh("amdgpu-limit-wave-threshold", cl::init(50), cl::Hidden,
				cl::desc("Kernel limit wave threshold in %"));

				static cl::opt<unsigned>
				IAWeight("amdgpu-indirect-access-weight", cl::init(1000), cl::Hidden,
				cl::desc("Indirect access memory instruction weight"));

				static cl::opt<unsigned>
				LSWeight("amdgpu-large-stride-weight", cl::init(1000), cl::Hidden,
				cl::desc("Large stride memory access weight"));

				static cl::opt<unsigned>
				LargeStrideThresh("amdgpu-large-stride-threshold", cl::init(64), cl::Hidden,
				cl::desc("Large stride memory access threshold"));

				STATISTIC(NumMemBound, "Number of functions marked as memory bound");
				STATISTIC(NumLimitWave, "Number of functions marked as needing limit wave");

				char llvm::AMDGPUPerfHintAnalysis::ID = 0;
				char &llvm::AMDGPUPerfHintAnalysisID = AMDGPUPerfHintAnalysis::ID;

				INITIALIZE_PASS(AMDGPUPerfHintAnalysis, DEBUG_TYPE,
				"Analysis if a function is memory bound", true, true)

				namespace {

				struct AMDGPUPerfHint {
				friend AMDGPUPerfHintAnalysis;

				public:
				AMDGPUPerfHint(AMDGPUPerfHintAnalysis::FuncInfoMap &FIM_,
				const TargetLowering *TLI_)
				: FIM(FIM_), DL(nullptr), TLI(TLI_) {}

				void runOnFunction(Function &F);

				private:
				struct MemAccessInfo {
				const Value *V;
				const Value *Base;
				int64_t Offset;
				MemAccessInfo() : V(nullptr), Base(nullptr), Offset(0) {}
				bool isLargeStride(MemAccessInfo &Reference) const;
				#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
				Printable print() const {
				return Printable([this](raw_ostream &OS) {
				OS << "Value: " << *V << '\n'
				<< "Base: " << *Base << " Offset: " << Offset << '\n';
				});
				}
				#endif
				};

				MemAccessInfo makeMemAccessInfo(Instruction *) const;

				MemAccessInfo LastAccess; // Last memory access info

				AMDGPUPerfHintAnalysis::FuncInfoMap &FIM;

				const DataLayout *DL;

				AMDGPUAS AS;

				const TargetLowering *TLI;

				AMDGPUPerfHintAnalysis::FuncInfoMap::iterator visit(const Function &F);
				static bool isMemBound(const AMDGPUPerfHintAnalysis::FuncInfo &F);
				static bool needLimitWave(const AMDGPUPerfHintAnalysis::FuncInfo &F);

				bool isIndirectAccess(const Instruction *Inst) const;

				/// Check if the instruction is large stride.
				/// The purpose is to identify memory access pattern like:
				/// x = a[i];
				/// y = a[i+1000];
				/// z = a[i+2000];
				/// In the above example, the second and third memory access will be marked
				/// large stride memory access.
				bool isLargeStride(const Instruction *Inst);

				bool isGlobalAddr(const Value *V) const;
				bool isLocalAddr(const Value *V) const;
				bool isConstantAddr(const Value *V) const;
				};

				static const Value getMemoryInstrPtr(const Instruction Inst) {
				if (auto LI = dyn_cast<LoadInst>(Inst)) {
				return LI->getPointerOperand();
				}
				if (auto SI = dyn_cast<StoreInst>(Inst)) {
				return SI->getPointerOperand();
				}
				if (auto AI = dyn_cast<AtomicCmpXchgInst>(Inst)) {
				return AI->getPointerOperand();
				}
				if (auto AI = dyn_cast<AtomicRMWInst>(Inst)) {
				return AI->getPointerOperand();
				}
				if (auto MI = dyn_cast<AnyMemIntrinsic>(Inst)) {
				return MI->getRawDest();
				}

				return nullptr;
				}

				bool AMDGPUPerfHint::isIndirectAccess(const Instruction *Inst) const {
				LLVM_DEBUG(dbgs() << "[isIndirectAccess] " << *Inst << '\n');
				SmallSet<const Value *, 32> WorkSet;
				SmallSet<const Value *, 32> Visited;
				if (const Value *MO = getMemoryInstrPtr(Inst)) {
				if (isGlobalAddr(MO))
				WorkSet.insert(MO);
				}

				while (!WorkSet.empty()) {
				const Value V = WorkSet.begin();
				WorkSet.erase(*WorkSet.begin());
				if (!Visited.insert(V).second)
				continue;
				LLVM_DEBUG(dbgs() << " check: " << *V << '\n');

				if (auto LD = dyn_cast<LoadInst>(V)) {
				auto M = LD->getPointerOperand();
				if (isGlobalAddr(M) \|\| isLocalAddr(M) \|\| isConstantAddr(M)) {
				LLVM_DEBUG(dbgs() << " is IA\n");
				return true;
				}
				continue;
				}

				if (auto GEP = dyn_cast<GetElementPtrInst>(V)) {
				auto P = GEP->getPointerOperand();
				WorkSet.insert(P);
				for (unsigned I = 1, E = GEP->getNumIndices() + 1; I != E; ++I)
				WorkSet.insert(GEP->getOperand(I));
				continue;
				}

				if (auto U = dyn_cast<UnaryInstruction>(V)) {
				WorkSet.insert(U->getOperand(0));
				continue;
				}

				if (auto BO = dyn_cast<BinaryOperator>(V)) {
				WorkSet.insert(BO->getOperand(0));
				WorkSet.insert(BO->getOperand(1));
				continue;
				}

				if (auto S = dyn_cast<SelectInst>(V)) {
				WorkSet.insert(S->getFalseValue());
				WorkSet.insert(S->getTrueValue());
				continue;
				}

				if (auto E = dyn_cast<ExtractElementInst>(V)) {
				WorkSet.insert(E->getVectorOperand());
				continue;
				}

				if (auto Phi = dyn_cast<PHINode>(V)) {
				for (unsigned I = 0, E = Phi->getNumIncomingValues(); I != E; ++I)
				WorkSet.insert(Phi->getIncomingValue(I));
				continue;
				}

				LLVM_DEBUG(dbgs() << " dropped\n");
				}

				LLVM_DEBUG(dbgs() << " is not IA\n");
				return false;
				}

				AMDGPUPerfHintAnalysis::FuncInfoMap::iterator
				AMDGPUPerfHint::visit(const Function &F) {
				auto FIP = FIM.insert(std::make_pair(&F, AMDGPUPerfHintAnalysis::FuncInfo()));
				if (!FIP.second)
				return FIP.first;

				AMDGPUPerfHintAnalysis::FuncInfo &FI = FIP.first->second;

				LLVM_DEBUG(dbgs() << "[AMDGPUPerfHint] process " << F.getName() << '\n');

				for (auto &B : F) {
				LastAccess = MemAccessInfo();
				for (auto &I : B) {
				if (getMemoryInstrPtr(&I)) {
				if (isIndirectAccess(&I))
				++FI.IAMInstCount;
				if (isLargeStride(&I))
				++FI.LSMInstCount;
				++FI.MemInstCount;
				++FI.InstCount;
				continue;
				}
				CallSite CS(const_cast<Instruction *>(&I));
				if (CS) {
				Function *Callee = CS.getCalledFunction();
				if (!Callee \|\| Callee->isDeclaration()) {
				++FI.InstCount;
				continue;
				}
				if (&F == Callee) // Handle immediate recursion
				continue;

				auto Loc = visit(*Callee);

				assert(Loc != FIM.end() && "No func info");
				FI.MemInstCount += Loc->second.MemInstCount;
				FI.InstCount += Loc->second.InstCount;
				FI.IAMInstCount += Loc->second.IAMInstCount;
				FI.LSMInstCount += Loc->second.LSMInstCount;
				} else if (auto *GEP = dyn_cast<GetElementPtrInst>(&I)) {
				TargetLoweringBase::AddrMode AM;
				auto Ptr = GetPointerBaseWithConstantOffset(GEP, AM.BaseOffs, DL);
				AM.BaseGV = dyn_cast_or_null<GlobalValue>(const_cast<Value *>(Ptr));
				AM.HasBaseReg = !AM.BaseGV;
				if (TLI->isLegalAddressingMode(*DL, AM, GEP->getResultElementType(),
				GEP->getPointerAddressSpace()))
				// Offset will likely be folded into load or store
				continue;
				++FI.InstCount;
				} else {
				++FI.InstCount;
				}
				}
				}

				return FIP.first;
				}

				void AMDGPUPerfHint::runOnFunction(Function &F) {
				if (FIM.find(&F) != FIM.end())
				return;

				const Module &M = *F.getParent();
				DL = &M.getDataLayout();
				AS = AMDGPU::getAMDGPUAS(M);

				auto Loc = visit(F);

				assert(Loc != FIM.end() && "No func info");
				LLVM_DEBUG(dbgs() << F.getName() << " MemInst: " << Loc->second.MemInstCount
				<< '\n'
				<< " IAMInst: " << Loc->second.IAMInstCount << '\n'
				<< " LSMInst: " << Loc->second.LSMInstCount << '\n'
				<< " TotalInst: " << Loc->second.InstCount << '\n');

				auto &FI = Loc->second;

				if (isMemBound(FI)) {
				LLVM_DEBUG(dbgs() << F.getName() << " is memory bound\n");
				NumMemBound++;
				}

				if (AMDGPU::isEntryFunctionCC(F.getCallingConv()) && needLimitWave(FI)) {
				LLVM_DEBUG(dbgs() << F.getName() << " needs limit wave\n");
				NumLimitWave++;
				}
				}

				bool AMDGPUPerfHint::isMemBound(const AMDGPUPerfHintAnalysis::FuncInfo &FI) {
				return FI.MemInstCount * 100 / FI.InstCount > MemBoundThresh;
				}

				bool AMDGPUPerfHint::needLimitWave(const AMDGPUPerfHintAnalysis::FuncInfo &FI) {
				return ((FI.MemInstCount + FI.IAMInstCount * IAWeight +
				FI.LSMInstCount * LSWeight) *
				100 / FI.InstCount) > LimitWaveThresh;
				}

				bool AMDGPUPerfHint::isGlobalAddr(const Value *V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType())) {
				unsigned As = PT->getAddressSpace();
				// Flat likely points to global too.
				return As == AS.GLOBAL_ADDRESS \|\| As == AS.FLAT_ADDRESS;
				}
				return false;
				}

				bool AMDGPUPerfHint::isLocalAddr(const Value *V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType()))
				return PT->getAddressSpace() == AS.LOCAL_ADDRESS;
				return false;
				}

				bool AMDGPUPerfHint::isLargeStride(const Instruction *Inst) {
				LLVM_DEBUG(dbgs() << "[isLargeStride] " << *Inst << '\n');

				MemAccessInfo MAI = makeMemAccessInfo(const_cast<Instruction *>(Inst));
				bool IsLargeStride = MAI.isLargeStride(LastAccess);
				if (MAI.Base)
				LastAccess = std::move(MAI);

				return IsLargeStride;
				}

				AMDGPUPerfHint::MemAccessInfo
				AMDGPUPerfHint::makeMemAccessInfo(Instruction *Inst) const {
				MemAccessInfo MAI;
				const Value *MO = getMemoryInstrPtr(Inst);

				LLVM_DEBUG(dbgs() << "[isLargeStride] MO: " << *MO << '\n');
				// Do not treat local-addr memory access as large stride.
				if (isLocalAddr(MO))
				return MAI;

				MAI.V = MO;
				MAI.Base = GetPointerBaseWithConstantOffset(MO, MAI.Offset, *DL);
				return MAI;
				}

				bool AMDGPUPerfHint::isConstantAddr(const Value *V) const {
				if (auto PT = dyn_cast<PointerType>(V->getType())) {
				unsigned As = PT->getAddressSpace();
				return As == AS.CONSTANT_ADDRESS \|\| As == AS.CONSTANT_ADDRESS_32BIT;
				}
				return false;
				}

				bool AMDGPUPerfHint::MemAccessInfo::isLargeStride(
				MemAccessInfo &Reference) const {

				if (!Base \|\| !Reference.Base \|\| Base != Reference.Base)
				return false;

				uint64_t Diff = Offset > Reference.Offset ? Offset - Reference.Offset
				: Reference.Offset - Offset;
				bool Result = Diff > LargeStrideThresh;
				LLVM_DEBUG(dbgs() << "[isLargeStride compare]\n"
				<< print() << "<=>\n"
				<< Reference.print() << "Result:" << Result << '\n');
				return Result;
				}
				} // namespace

				bool AMDGPUPerfHintAnalysis::runOnFunction(Function &F) {
				auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();
				if (!TPC)
				return false;

				const TargetMachine &TM = TPC->getTM<TargetMachine>();
				const TargetSubtargetInfo *ST = TM.getSubtargetImpl(F);

				AMDGPUPerfHint Analyzer(FIM, ST->getTargetLowering());
				Analyzer.runOnFunction(F);
				return false;
				}

				bool AMDGPUPerfHintAnalysis::isMemoryBound(const Function *F) const {
				auto FI = FIM.find(F);
				if (FI == FIM.end())
				return false;

				return AMDGPUPerfHint::isMemBound(FI->second);
				}

				bool AMDGPUPerfHintAnalysis::needsWaveLimiter(const Function *F) const {
				auto FI = FIM.find(F);
				if (FI == FIM.end())
				return false;

				return AMDGPUPerfHint::needLimitWave(FI->second);
				}

llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUInline.cpp		AMDGPUInline.cpp
		AMDGPUPerfHintAnalysis.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
GCNHazardRecognizer.cpp		GCNHazardRecognizer.cpp
GCNIterativeScheduler.cpp		GCNIterativeScheduler.cpp
GCNMinRegStrategy.cpp		GCNMinRegStrategy.cpp
GCNRegPressure.cpp		GCNRegPressure.cpp
GCNSchedStrategy.cpp		GCNSchedStrategy.cpp
R600AsmPrinter.cpp		R600AsmPrinter.cpp
R600ClauseMergePass.cpp		R600ClauseMergePass.cpp
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/GCNSchedStrategy.cpp

Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines	void GCNScheduleDAGMILive::schedule() {
WavesAfter = std::min(WavesAfter, MFI.getMaxWavesPerEU());		WavesAfter = std::min(WavesAfter, MFI.getMaxWavesPerEU());
WavesBefore = std::min(WavesBefore, MFI.getMaxWavesPerEU());		WavesBefore = std::min(WavesBefore, MFI.getMaxWavesPerEU());
LLVM_DEBUG(dbgs() << "Occupancy before scheduling: " << WavesBefore		LLVM_DEBUG(dbgs() << "Occupancy before scheduling: " << WavesBefore
<< ", after " << WavesAfter << ".\n");		<< ", after " << WavesAfter << ".\n");

// We could not keep current target occupancy because of the just scheduled		// We could not keep current target occupancy because of the just scheduled
// region. Record new occupancy for next scheduling cycle.		// region. Record new occupancy for next scheduling cycle.
unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);		unsigned NewOccupancy = std::max(WavesAfter, WavesBefore);
		// Allow memory bound functions to drop to 4 waves if not limited by an
		// attribute.
		unsigned MinMemBoundWaves = std::max(MFI.getMinWavesPerEU(), 4u);
		if (WavesAfter < WavesBefore && WavesAfter < MinOccupancy &&
		WavesAfter >= MinMemBoundWaves &&
		(MFI.isMemoryBound() \|\| MFI.needsWaveLimiter())) {
		LLVM_DEBUG(dbgs() << "Function is memory bound, allow occupancy drop up to "
		<< MinMemBoundWaves << " waves\n");
		NewOccupancy = WavesAfter;
		}
if (NewOccupancy < MinOccupancy) {		if (NewOccupancy < MinOccupancy) {
MinOccupancy = NewOccupancy;		MinOccupancy = NewOccupancy;
LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "		LLVM_DEBUG(dbgs() << "Occupancy lowered for the function to "
<< MinOccupancy << ".\n");		<< MinOccupancy << ".\n");
}		}

if (WavesAfter >= WavesBefore) {		if (WavesAfter >= MinOccupancy) {
Pressure[RegionIdx] = PressureAfter;		Pressure[RegionIdx] = PressureAfter;
return;		return;
}		}

LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");		LLVM_DEBUG(dbgs() << "Attempting to revert scheduling.\n");
RegionEnd = RegionBegin;		RegionEnd = RegionBegin;
for (MachineInstr *MI : Unsched) {		for (MachineInstr *MI : Unsched) {
if (MI->isDebugInstr())		if (MI->isDebugInstr())
▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=NOTES %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=NOTES %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=NOTES %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=NOTES %s

	@var = addrspace(1) global float 0.0			@var = addrspace(1) global float 0.0

	; CHECK: ---			; CHECK: ---
	; CHECK: Version: [ 1, 0 ]			; CHECK: Version: [ 1, 0 ]
	; CHECK: Kernels:			; CHECK: Kernels:

	; CHECK: - Name: test			; CHECK: - Name: test
	; CHECK: SymbolName: 'test@kd'			; CHECK: SymbolName: 'test@kd'
	; CHECK: CodeProps:			; CHECK: CodeProps:
	; CHECK: KernargSegmentSize: 24			; CHECK: KernargSegmentSize: 24
	; CHECK: GroupSegmentFixedSize: 0			; CHECK: GroupSegmentFixedSize: 0
	; CHECK: PrivateSegmentFixedSize: 0			; CHECK: PrivateSegmentFixedSize: 0
	; CHECK: KernargSegmentAlign: 8			; CHECK: KernargSegmentAlign: 8
	; CHECK: WavefrontSize: 64			; CHECK: WavefrontSize: 64
	; CHECK: NumSGPRs: 6			; CHECK: NumSGPRs: 6
	; GFX700: NumVGPRs: 4			; CHECK: NumVGPRs: 3
	; GFX803: NumVGPRs: 6
	; GFX900: NumVGPRs: 6
	; CHECK: MaxFlatWorkGroupSize: 256			; CHECK: MaxFlatWorkGroupSize: 256
	define amdgpu_kernel void @test(			define amdgpu_kernel void @test(
	half addrspace(1)* %r,			half addrspace(1)* %r,
	half addrspace(1)* %a,			half addrspace(1)* %a,
	half addrspace(1)* %b) {			half addrspace(1)* %b) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%b.val = load half, half addrspace(1)* %b			%b.val = load half, half addrspace(1)* %b
	▲ Show 20 Lines • Show All 112 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/perfhint.ll

				; RUN: llc -march=amdgcn < %s \| FileCheck -check-prefix=GCN %s

				; GCN-LABEL: {{^}}test_membound:
				; MemoryBound: 1
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_membound(<4 x i32> addrspace(1)* nocapture readonly %arg, <4 x i32> addrspace(1)* nocapture %arg1) {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp2 = zext i32 %tmp to i64
				%tmp3 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp2
				%tmp4 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp3, align 16
				%tmp5 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp2
				store <4 x i32> %tmp4, <4 x i32> addrspace(1)* %tmp5, align 16
				%tmp6 = add nuw nsw i64 %tmp2, 1
				%tmp7 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp6
				%tmp8 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp7, align 16
				%tmp9 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp6
				store <4 x i32> %tmp8, <4 x i32> addrspace(1)* %tmp9, align 16
				%tmp10 = add nuw nsw i64 %tmp2, 2
				%tmp11 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp10
				%tmp12 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp11, align 16
				%tmp13 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp10
				store <4 x i32> %tmp12, <4 x i32> addrspace(1)* %tmp13, align 16
				%tmp14 = add nuw nsw i64 %tmp2, 3
				%tmp15 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg, i64 %tmp14
				%tmp16 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp15, align 16
				%tmp17 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(1)* %arg1, i64 %tmp14
				store <4 x i32> %tmp16, <4 x i32> addrspace(1)* %tmp17, align 16
				ret void
				}

				; GCN-LABEL: {{^}}test_large_stride:
				; MemoryBound: 0
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_large_stride(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 4096
				%tmp1 = load i32, i32 addrspace(1)* %tmp, align 4
				%tmp2 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				store i32 %tmp1, i32 addrspace(1)* %tmp2, align 4
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 8192
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4
				%tmp5 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				store i32 %tmp4, i32 addrspace(1)* %tmp5, align 4
				%tmp6 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 12288
				%tmp7 = load i32, i32 addrspace(1)* %tmp6, align 4
				%tmp8 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 3
				store i32 %tmp7, i32 addrspace(1)* %tmp8, align 4
				ret void
				}

				; GCN-LABEL: {{^}}test_indirect:
				; MemoryBound: 0
				; WaveLimiterHint : 1
				define amdgpu_kernel void @test_indirect(i32 addrspace(1)* nocapture %arg) {
				bb:
				%tmp = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 1
				%tmp1 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 2
				%tmp2 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 3
				%tmp3 = bitcast i32 addrspace(1)* %arg to <4 x i32> addrspace(1)*
				%tmp4 = load <4 x i32>, <4 x i32> addrspace(1)* %tmp3, align 4
				%tmp5 = extractelement <4 x i32> %tmp4, i32 0
				%tmp6 = sext i32 %tmp5 to i64
				%tmp7 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp6
				%tmp8 = load i32, i32 addrspace(1)* %tmp7, align 4
				store i32 %tmp8, i32 addrspace(1)* %arg, align 4
				%tmp9 = extractelement <4 x i32> %tmp4, i32 1
				%tmp10 = sext i32 %tmp9 to i64
				%tmp11 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp10
				%tmp12 = load i32, i32 addrspace(1)* %tmp11, align 4
				store i32 %tmp12, i32 addrspace(1)* %tmp, align 4
				%tmp13 = extractelement <4 x i32> %tmp4, i32 2
				%tmp14 = sext i32 %tmp13 to i64
				%tmp15 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp14
				%tmp16 = load i32, i32 addrspace(1)* %tmp15, align 4
				store i32 %tmp16, i32 addrspace(1)* %tmp1, align 4
				%tmp17 = extractelement <4 x i32> %tmp4, i32 3
				%tmp18 = sext i32 %tmp17 to i64
				%tmp19 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp18
				%tmp20 = load i32, i32 addrspace(1)* %tmp19, align 4
				store i32 %tmp20, i32 addrspace(1)* %tmp2, align 4
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add perf hints to functionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 148626

llvm/trunk/lib/Target/AMDGPU/AMDGPU.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUAsmPrinter.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUMachineFunction.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp

llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt

llvm/trunk/lib/Target/AMDGPU/GCNSchedStrategy.cpp

llvm/trunk/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

llvm/trunk/test/CodeGen/AMDGPU/perfhint.ll

[AMDGPU] Add perf hints to functions
ClosedPublic