This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Sort LDS globals based on thier size and alignment.
AbandonedPublic

Authored by hsmhsm on May 12 2021, 9:36 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
JonChesterfield
t-tye
b-sumner
foad

Summary

In order to increase the probability of aligned accessing of LDS memory
operations, sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	730 ms	x64 debian > Clang.CodeGenOpenCL::backend-unsupported-warning.ll
	270 ms	x64 debian > LLVM.CodeGen/AMDGPU::32-bit-local-address-space.ll
	140 ms	x64 debian > LLVM.CodeGen/AMDGPU::addrspacecast-initializer-unsupported.ll
	410 ms	x64 debian > LLVM.CodeGen/AMDGPU::amdgpu.private-memory.ll
	310 ms	x64 debian > LLVM.CodeGen/AMDGPU::atomic_optimizations_local_pointer.ll
		View Full Test Results (110 Failed)

Event Timeline

hsmhsm created this revision.May 12 2021, 9:36 AM

Herald added subscribers: kerbowa, hiraditya, tpr and 6 others. · View Herald TranscriptMay 12 2021, 9:36 AM

hsmhsm requested review of this revision.May 12 2021, 9:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2021, 9:36 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B104066: Diff 344850.May 12 2021, 10:18 AM

As far as I understand all kernels in the module will use all the LDS globals irrespective if they actually need it? We may run out of LDS.

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp
51	llvm::find() and you do not need to include algorithm then.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
41	I do not think you can keep per-module stuff in the TM.

Why do we need another pass if we always just produce llvm.amdgcn.module.lds such that codegen will always allocate one single global?

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h
41	You absolutely cannot do this here

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

In D102334#2754646, @JonChesterfield wrote:

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

Right, I think we need a minimal (or as small as we can if not minimal) set per kernel.

In D102334#2754554, @rampitec wrote:

As far as I understand all kernels in the module will use all the LDS globals irrespective if they actually need it? We may run out of LDS.

Hmm, I did not realize it. This patch is totally bad then.

In D102334#2754812, @rampitec wrote:

In D102334#2754646, @JonChesterfield wrote:

This seems to put all variables into one struct. I was suggesting a struct type and instance per kernel, populated by the subset of variables that can be accessed by kernel and functions called by it. Strictly speaking this does do that, but the subset can usually be smaller than the set of all variables in the module.

Right, I think we need a minimal (or as small as we can if not minimal) set per kernel.

As I have mentioned in the email -

(1) Which means to further improve the pass - "LowerModuleLDS" implemented by Jon, we need to run this pass twice (a) - before promoteAlloca and (b) afterPromoteAlloca, and I am not sure, if we face any hurdle doing this.
(2) Carefully implement a logic to traverse the call graph to find out all reachable LDS globals
(3) Create a separate struct type per kernel based on their reachable LDS set
(4) Create an instance of respective struct type within each kernel
(5) Replace within non kernel-function uses of LDS by their respective struct member offset counter-part which would be different for different kernels.

Item (2) is problamtic in the presence of call graph. But, we can probably conservatively make an attempt and see if it works
Item (5) whichI have no idea how to do it. @JonChesterfield do you have any idea in mind w.r.t this?

All in all, it is going to take time.

[AMDGPU] Sort LDS globals based on thier size and alignment.

Typo "their" :-)

In order to increase the probability of aligned accessing of LDS memory
operations,

Are you expecting this to affect instruction selection? Or just to increase the probability that the selected instructions will go fast because they will see an aligned address at run time?

sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.

Instead how about explicitly over-aligning all globals with something like:

for each lds global:
  if size > 8:
    // we might want to use a b96 or b128 load
    alignment = max(alignment, 16)
  else if size > 4:
    // we might want to use a b64 load
    alignment = max(alignment, 8)
  else if size > 2:
    // we might want to use a b32 load
    alignment = max(alignment, 4)
  else if size > 1
    // we might want to use a b16 load
    alignment = max(alignment, 2)

(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)

In D102334#2756371, @foad wrote:

[AMDGPU] Sort LDS globals based on thier size and alignment.

Typo "their" :-)

In order to increase the probability of aligned accessing of LDS memory
operations,

Are you expecting this to affect instruction selection? Or just to increase the probability that the selected instructions will go fast because they will see an aligned address at run time?

It is to increase the probability that the selected instructions will go fast because they will see an aligned address at run time.

sort LDS globals based on thier size and alignment, and then
allocate memory in that sorted order.

Sorting seems like an odd way to do this, because it *might* happen to over-align a global, but only by chance depending on what other globals get sorted before it.

Lately, I am actually of same openion - sorting will definetely help us reduce space overhead due to padding, but may not be with alignment.

Instead how about explicitly over-aligning all globals with something like:
for each lds global:
  if size > 8:
    // we might want to use a b96 or b128 load
    alignment = max(alignment, 16)
  else if size > 4:
    // we might want to use a b64 load
    alignment = max(alignment, 8)
  else if size > 2:
    // we might want to use a b32 load
    alignment = max(alignment, 4)
  else if size > 1
    // we might want to use a b16 load
    alignment = max(alignment, 2)
(You could still sort them after doing that, but the only reason would be to try to minimise wasted space due to alignment gaps.)

Looks like a good suggestion to me.

I will abandon this patch, and I am working on yet another patch which is implemented with in llc (AMDGPUMachineFunction). Will post that new patch soon.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

9 lines

AMDGPUMachineFunction.cpp

34 lines

AMDGPUSortLDSGlobals.cpp

101 lines

AMDGPUTargetMachine.h

3 lines

AMDGPUTargetMachine.cpp

18 lines

CMakeLists.txt

1 line

Utils/

AMDGPULDSUtils.h

2 lines

AMDGPULDSUtils.cpp

45 lines

Diff 344850

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
	FunctionPass *createAMDGPULateCodeGenPreparePass();			FunctionPass *createAMDGPULateCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();
	FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );			FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );
	ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );			ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );
	FunctionPass *createAMDGPURewriteOutArgumentsPass();			FunctionPass *createAMDGPURewriteOutArgumentsPass();
	ModulePass *createAMDGPULowerModuleLDSPass();			ModulePass *createAMDGPULowerModuleLDSPass();
	FunctionPass *createSIModeRegisterPass();			FunctionPass *createSIModeRegisterPass();
				ModulePass *createAMDGPUSortLDSGlobalsPass();

	struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {			struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {
	AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}			AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}
	PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);			PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

	private:			private:
	TargetMachine &TM;			TargetMachine &TM;
	};			};
	▲ Show 20 Lines • Show All 246 Lines • ▼ Show 20 Lines

	ModulePass *createAMDGPUOpenCLEnqueuedBlockLoweringPass();			ModulePass *createAMDGPUOpenCLEnqueuedBlockLoweringPass();
	void initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(PassRegistry &);			void initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(PassRegistry &);
	extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;			extern char &AMDGPUOpenCLEnqueuedBlockLoweringID;

	void initializeGCNNSAReassignPass(PassRegistry &);			void initializeGCNNSAReassignPass(PassRegistry &);
	extern char &GCNNSAReassignID;			extern char &GCNNSAReassignID;

				void initializeAMDGPUSortLDSGlobalsPass(PassRegistry &);
				extern char &AMDGPUSortLDSGlobalsID;

				struct AMDGPUSortLDSGlobalsPass : PassInfoMixin<AMDGPUSortLDSGlobalsPass> {
				AMDGPUSortLDSGlobalsPass() {}
				PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
				};

	namespace AMDGPU {			namespace AMDGPU {
	enum TargetIndex {			enum TargetIndex {
	TI_CONSTDATA_START,			TI_CONSTDATA_START,
	TI_SCRATCH_RSRC_DWORD0,			TI_SCRATCH_RSRC_DWORD0,
	TI_SCRATCH_RSRC_DWORD1,			TI_SCRATCH_RSRC_DWORD1,
	TI_SCRATCH_RSRC_DWORD2,			TI_SCRATCH_RSRC_DWORD2,
	TI_SCRATCH_RSRC_DWORD3			TI_SCRATCH_RSRC_DWORD3
	};			};
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp

	//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//			//===-- AMDGPUMachineFunctionInfo.cpp ---------------------------------------=//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPUMachineFunction.h"			#include "AMDGPUMachineFunction.h"
	#include "AMDGPUPerfHintAnalysis.h"			#include "AMDGPUPerfHintAnalysis.h"
	#include "AMDGPUSubtarget.h"			#include "AMDGPUSubtarget.h"
				#include "AMDGPUTargetMachine.h"
	#include "llvm/CodeGen/MachineModuleInfo.h"			#include "llvm/CodeGen/MachineModuleInfo.h"
	#include "llvm/Target/TargetMachine.h"			#include "llvm/Target/TargetMachine.h"
				#include <algorithm>

	using namespace llvm;			using namespace llvm;

	AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF)			AMDGPUMachineFunction::AMDGPUMachineFunction(const MachineFunction &MF)
	: MachineFunctionInfo(), Mode(MF.getFunction()),			: MachineFunctionInfo(), Mode(MF.getFunction()),
	IsEntryFunction(			IsEntryFunction(
	AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),			AMDGPU::isEntryFunctionCC(MF.getFunction().getCallingConv())),
	IsModuleEntryFunction(			IsModuleEntryFunction(
	Show All 17 Lines
	}			}

	unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,			unsigned AMDGPUMachineFunction::allocateLDSGlobal(const DataLayout &DL,
	const GlobalVariable &GV) {			const GlobalVariable &GV) {
	auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));			auto Entry = LocalMemoryObjects.insert(std::make_pair(&GV, 0));
	if (!Entry.second)			if (!Entry.second)
	return Entry.first->second;			return Entry.first->second;

				// Find the position of `GV` within sorted LDS global list, and it should be
				// available.
				auto Iter = std::find(AMDGPUTargetMachine::SortedLDSGlobals.begin(),
				rampitecUnsubmitted Not Done Reply Inline Actions llvm::find() and you do not need to include algorithm then. rampitec: llvm::find() and you do not need to include algorithm then.
				AMDGPUTargetMachine::SortedLDSGlobals.end(), &GV);
				assert(Iter != AMDGPUTargetMachine::SortedLDSGlobals.end() &&
				"Expected GV to be available within sorted LDS global list");

				// If required, allocate memory for all the predecessors of `GV` in the sorted
				// LDS global list.
				//
				// FIXME: Factor out common code.
				for (auto LI = AMDGPUTargetMachine::SortedLDSGlobals.begin(); LI != Iter;
				++LI) {
				auto LDS = LI;

				auto PredEntry = LocalMemoryObjects.insert(std::make_pair(LDS, 0));
				if (!PredEntry.second)
				continue;

				Align PredAlignment =
				DL.getValueOrABITypeAlignment(LDS->getAlign(), LDS->getValueType());

				unsigned PredOffset = StaticLDSSize = alignTo(StaticLDSSize, PredAlignment);

				PredEntry.first->second = PredOffset;
				StaticLDSSize += DL.getTypeAllocSize(LDS->getValueType());
				}

				// Allocate memory for `GV`.
	Align Alignment =			Align Alignment =
	DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());			DL.getValueOrABITypeAlignment(GV.getAlign(), GV.getValueType());

	/// TODO: We should sort these to minimize wasted space due to alignment
	/// padding. Currently the padding is decided by the first encountered use
	/// during lowering.
	unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);			unsigned Offset = StaticLDSSize = alignTo(StaticLDSSize, Alignment);

	Entry.first->second = Offset;			Entry.first->second = Offset;
	StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());			StaticLDSSize += DL.getTypeAllocSize(GV.getValueType());

	// Update the LDS size considering the padding to align the dynamic shared			// Update the LDS size considering the padding to align the dynamic shared
	// memory.			// memory.
	LDSSize = alignTo(StaticLDSSize, DynLDSAlign);			LDSSize = alignTo(StaticLDSSize, DynLDSAlign);
	Show All 28 Lines

llvm/lib/Target/AMDGPU/AMDGPUSortLDSGlobals.cpp

This file was added.

				//===-- AMDGPUSortLDSGlobals.cpp ------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUTargetMachine.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "Utils/AMDGPULDSUtils.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/Module.h"
				#include "llvm/InitializePasses.h"
				#include <algorithm>
				#include <vector>

				#define DEBUG_TYPE "amdgpu-sort-lds-globals"

				using namespace llvm;

				namespace {

				static void sortLDSGlobals(Module &M) {
				// Collect all valid static LDS globals.
				std::vector<GlobalVariable *> LDSGlobals = AMDGPU::collectStaticLDSGlobals(M);

				const DataLayout &DL = M.getDataLayout();

				// Sort LDS globals by size, descending, and then, by alignment, descending.
				// On ties, sort by name, lexicographical.
				llvm::stable_sort(
				LDSGlobals,
				[&](const GlobalVariable LHS, const GlobalVariable RHS) -> bool {
				TypeSize SLHS = DL.getTypeAllocSize(LHS->getValueType());
				TypeSize SRHS = DL.getTypeAllocSize(RHS->getValueType());
				if (SLHS != SRHS) {
				return SLHS > SRHS;
				}

				Align ALHS = AMDGPU::getAlign(DL, LHS);
				Align ARHS = AMDGPU::getAlign(DL, RHS);
				if (ALHS != ARHS) {
				return ALHS > ARHS;
				}

				return LHS->getName() < RHS->getName();
				});

				// Preserve sorted LDS globals which will be required to use during LDS
				// allocation in ISEL pass.
				//
				// Module LDS which is possibly created by the "Lower Module LDS" pass, should
				// be allocated at address 0, irrespective of its size and alignment.
				GlobalVariable *ModuleLDS = M.getGlobalVariable("llvm.amdgcn.module.lds");
				if (ModuleLDS)
				AMDGPUTargetMachine::SortedLDSGlobals.push_back(ModuleLDS);

				for (auto *LDS : LDSGlobals) {
				if (LDS != ModuleLDS)
				AMDGPUTargetMachine::SortedLDSGlobals.push_back(LDS);
				}
				}

				class AMDGPUSortLDSGlobals : public ModulePass {
				public:
				static char ID;

				AMDGPUSortLDSGlobals() : ModulePass(ID) {
				initializeAMDGPUSortLDSGlobalsPass(*PassRegistry::getPassRegistry());
				}

				bool runOnModule(Module &M) override;
				};

				} // namespace

				char AMDGPUSortLDSGlobals::ID = 0;
				char &llvm::AMDGPUSortLDSGlobalsID = AMDGPUSortLDSGlobals::ID;

				INITIALIZE_PASS(AMDGPUSortLDSGlobals, DEBUG_TYPE,
				"Sort LDS globals based on size and alignment",
				false /only look at the cfg/, false /analysis pass/)

				bool AMDGPUSortLDSGlobals::runOnModule(Module &M) {
				sortLDSGlobals(M);
				return false;
				}

				ModulePass *llvm::createAMDGPUSortLDSGlobalsPass() {
				return new AMDGPUSortLDSGlobals();
				}

				PreservedAnalyses AMDGPUSortLDSGlobalsPass::run(Module &M,
				ModuleAnalysisManager &AM) {
				sortLDSGlobals(M);
				return PreservedAnalyses::all();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h

Show All 30 Lines	protected:
StringRef getGPUName(const Function &F) const;		StringRef getGPUName(const Function &F) const;
StringRef getFeatureString(const Function &F) const;		StringRef getFeatureString(const Function &F) const;

public:		public:
static bool EnableLateStructurizeCFG;		static bool EnableLateStructurizeCFG;
static bool EnableFunctionCalls;		static bool EnableFunctionCalls;
static bool EnableFixedFunctionABI;		static bool EnableFixedFunctionABI;
static bool EnableLowerModuleLDS;		static bool EnableLowerModuleLDS;
		// FIXME: Ugly programming, find a suitable place for this data structure as
		// soon as possible.
		static std::vector<GlobalVariable *> SortedLDSGlobals;
		rampitecUnsubmitted Not Done Reply Inline Actions I do not think you can keep per-module stuff in the TM. rampitec: I do not think you can keep per-module stuff in the TM.
		arsenmUnsubmitted Not Done Reply Inline Actions You absolutely cannot do this here arsenm: You absolutely cannot do this here

AMDGPUTargetMachine(const Target &T, const Triple &TT, StringRef CPU,		AMDGPUTargetMachine(const Target &T, const Triple &TT, StringRef CPU,
StringRef FS, TargetOptions Options,		StringRef FS, TargetOptions Options,
Optional<Reloc::Model> RM, Optional<CodeModel::Model> CM,		Optional<Reloc::Model> RM, Optional<CodeModel::Model> CM,
CodeGenOpt::Level OL);		CodeGenOpt::Level OL);
~AMDGPUTargetMachine() override;		~AMDGPUTargetMachine() override;

const TargetSubtargetInfo *getSubtargetImpl() const;		const TargetSubtargetInfo *getSubtargetImpl() const;
▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableStructurizerWorkarounds(
cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),		cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool, true> EnableLowerModuleLDS(		static cl::opt<bool, true> EnableLowerModuleLDS(
"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),		"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),
cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),		cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool>
		EnableSortLDSGlobals("amdgpu-enable-sort-lds-globals",
		cl::desc("Enable sort LDS globals pass"),
		cl::init(true), cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeR600ClauseMergePassPass(*PR);		initializeR600ClauseMergePassPass(*PR);
initializeR600ControlFlowFinalizerPass(*PR);		initializeR600ControlFlowFinalizerPass(*PR);
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeSIPostRABundlerPass(*PR);		initializeSIPostRABundlerPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
initializeAMDGPUPrintfRuntimeBindingPass(*PR);		initializeAMDGPUPrintfRuntimeBindingPass(*PR);
initializeGCNNSAReassignPass(*PR);		initializeGCNNSAReassignPass(*PR);
		initializeAMDGPUSortLDSGlobalsPass(*PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
return std::make_unique<AMDGPUTargetObjectFile>();		return std::make_unique<AMDGPUTargetObjectFile>();
}		}

static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {		static ScheduleDAGInstrs createR600MachineScheduler(MachineSchedContext C) {
return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());		return new ScheduleDAGMILive(C, std::make_unique<R600SchedStrategy>());
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	else if (getMCSubtargetInfo()->checkFeatures("+wavefrontsize32"))
MRI.reset(llvm::createGCNMCRegisterInfo(AMDGPUDwarfFlavour::Wave32));		MRI.reset(llvm::createGCNMCRegisterInfo(AMDGPUDwarfFlavour::Wave32));
}		}
}		}

bool AMDGPUTargetMachine::EnableLateStructurizeCFG = false;		bool AMDGPUTargetMachine::EnableLateStructurizeCFG = false;
bool AMDGPUTargetMachine::EnableFunctionCalls = false;		bool AMDGPUTargetMachine::EnableFunctionCalls = false;
bool AMDGPUTargetMachine::EnableFixedFunctionABI = false;		bool AMDGPUTargetMachine::EnableFixedFunctionABI = false;
bool AMDGPUTargetMachine::EnableLowerModuleLDS = true;		bool AMDGPUTargetMachine::EnableLowerModuleLDS = true;
		// FIXME: Ugly programming, find a suitable place for this data structure as
		// soon as possible.
		std::vector<GlobalVariable *> AMDGPUTargetMachine::SortedLDSGlobals =
		std::vector<GlobalVariable *>();

AMDGPUTargetMachine::~AMDGPUTargetMachine() = default;		AMDGPUTargetMachine::~AMDGPUTargetMachine() = default;

StringRef AMDGPUTargetMachine::getGPUName(const Function &F) const {		StringRef AMDGPUTargetMachine::getGPUName(const Function &F) const {
Attribute GPUAttr = F.getFnAttribute("target-cpu");		Attribute GPUAttr = F.getFnAttribute("target-cpu");
return GPUAttr.isValid() ? GPUAttr.getValueAsString() : getTargetCPU();		return GPUAttr.isValid() ? GPUAttr.getValueAsString() : getTargetCPU();
}		}

▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	PB.registerPipelineParsingCallback(
if (PassName == "amdgpu-always-inline") {		if (PassName == "amdgpu-always-inline") {
PM.addPass(AMDGPUAlwaysInlinePass());		PM.addPass(AMDGPUAlwaysInlinePass());
return true;		return true;
}		}
if (PassName == "amdgpu-lower-module-lds") {		if (PassName == "amdgpu-lower-module-lds") {
PM.addPass(AMDGPULowerModuleLDSPass());		PM.addPass(AMDGPULowerModuleLDSPass());
return true;		return true;
}		}
		if (PassName == "amdgpu-sort-lds-globals") {
		PM.addPass(AMDGPUSortLDSGlobalsPass());
		return true;
		}
return false;		return false;
});		});
PB.registerPipelineParsingCallback(		PB.registerPipelineParsingCallback(
[this](StringRef PassName, FunctionPassManager &PM,		[this](StringRef PassName, FunctionPassManager &PM,
ArrayRef<PassBuilder::PipelineElement>) {		ArrayRef<PassBuilder::PipelineElement>) {
if (PassName == "amdgpu-simplifylib") {		if (PassName == "amdgpu-simplifylib") {
PM.addPass(AMDGPUSimplifyLibCallsPass(*this));		PM.addPass(AMDGPUSimplifyLibCallsPass(*this));
return true;		return true;
▲ Show 20 Lines • Show All 411 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addIRPasses() {
// %0 = shl nsw %a, 2		// %0 = shl nsw %a, 2
// %1 = shl %a, 2		// %1 = shl %a, 2
//		//
// but EarlyCSE can do neither of them.		// but EarlyCSE can do neither of them.
if (EnableScalarIRPasses.getNumOccurrences()		if (EnableScalarIRPasses.getNumOccurrences()
? EnableScalarIRPasses		? EnableScalarIRPasses
: TM.getOptLevel() > CodeGenOpt::Less)		: TM.getOptLevel() > CodeGenOpt::Less)
addEarlyCSEOrGVNPass();		addEarlyCSEOrGVNPass();

		// This pass should always be run as last LLVM IR pass just before ISel pass.
		if (EnableSortLDSGlobals)
		addPass(createAMDGPUSortLDSGlobalsPass());
}		}

void AMDGPUPassConfig::addCodeGenPrepare() {		void AMDGPUPassConfig::addCodeGenPrepare() {
if (TM->getTargetTriple().getArch() == Triple::amdgcn)		if (TM->getTargetTriple().getArch() == Triple::amdgcn)
addPass(createAMDGPUAnnotateKernelFeaturesPass());		addPass(createAMDGPUAnnotateKernelFeaturesPass());

if (TM->getTargetTriple().getArch() == Triple::amdgcn &&		if (TM->getTargetTriple().getArch() == Triple::amdgcn &&
EnableLowerKernelArguments)		EnableLowerKernelArguments)
▲ Show 20 Lines • Show All 449 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
AMDGPUPreLegalizerCombiner.cpp		AMDGPUPreLegalizerCombiner.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPURegBankCombiner.cpp		AMDGPURegBankCombiner.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
		AMDGPUSortLDSGlobals.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
AMDILCFGStructurizer.cpp		AMDILCFGStructurizer.cpp
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

	Show All 25 Lines
	bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,			bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
	User *InitialUser);			User *InitialUser);

	std::vector<GlobalVariable *>			std::vector<GlobalVariable *>
	findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);			findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);

	SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);			SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);

				std::vector<llvm::GlobalVariable *> collectStaticLDSGlobals(Module &M);

	} // end namespace AMDGPU			} // end namespace AMDGPU

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H			#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {

TmpVec.clear();		TmpVec.clear();
collectUsedGlobalVariables(M, TmpVec, false);		collectUsedGlobalVariables(M, TmpVec, false);
UsedList.insert(TmpVec.begin(), TmpVec.end());		UsedList.insert(TmpVec.begin(), TmpVec.end());

return UsedList;		return UsedList;
}		}

		std::vector<llvm::GlobalVariable *> collectStaticLDSGlobals(Module &M) {
		std::vector<llvm::GlobalVariable *> StaticLDSGlobals;

		for (auto &GV : M.globals()) {
		if (GV.getAddressSpace() != AMDGPUAS::LOCAL_ADDRESS) {
		// Skip non LDS global.
		continue;
		}

		if (GV.hasExternalLinkage()) {
		// HIP uses an unsized array `extern __shared__ T s[]` or similar
		// zero-sized type in other languages to declare the dynamic shared
		// memory variable whose size is not known at the compile time. They will
		// be allocated by the runtime and placed directly after the statically
		// allocated ones. Skip such dynamic shared memory variables.
		assert(GV.getType()->isEmptyTy() &&
		"Expected dynamic shared memory variable type to be empty");
		continue;
		}

		if (!GV.hasInitializer()) {
		// Static LDS globals should have "UndefValue" as an initializer. Skip LDS
		// globals which do not have an initializer. ISEL pass will catch such
		// static LDS globals and report error accordingly.
		continue;
		}

		if (!isa<UndefValue>(GV.getInitializer())) {
		// Static LDS globals should have "UndefValue" as an initializer. Skip LDS
		// globals which have an initialzer but it is not "UndefValue". ISEL pass
		// will catch such static LDS globals and report error accordingly.
		continue;
		}

		if (GV.isConstant()) {
		// A constant LDS global cannot be allocated. Skip it.
		continue;
		}

		StaticLDSGlobals.push_back(&GV);
		}

		return StaticLDSGlobals;
		}

} // end namespace AMDGPU		} // end namespace AMDGPU

} // end namespace llvm		} // end namespace llvm