This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.
AbandonedPublic

Authored by hsmhsm on Apr 26 2021, 10:25 AM.

Download Raw Diff

Details

Reviewers

JonChesterfield
arsenm
b-sumner
t-tye
rampitec

Summary

The pass - "Lower Module LDS" supports use of LDS globals within non-kernel
functions by lowering LDS globals as follows. It packs within non-kernel used
LDS globals into a struct type, and creates an instance of that struct type
within every kernel at "address zero".

However, the pass - "Lower Module LDS" sometime wastes LDS memory depending
on the pattern of LDS globals use within the module.

Hence the current pass makes an attempt to aid the pass - "Lower Module LDS"
for efficient LDS memory usage. The idea behind current pass is as follows:

Instead of directly packing LDS globals into the struct as struct members, create global LDS pointers correspoding those LDS globals.
Initialize those global LDS pointers with their respective LDS globals.
Replace all the non-kernel function scope use of those original LDS globals by their respective pointer counter-parts.
Then the pass "Lower Module LDS" by the virtue of its implementation idea, lands-up packing only LDS pointers as struct members, which substentially reduces unnecessary LDS memory usage.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.Apr 26 2021, 10:25 AM

Herald added subscribers: kerbowa, jfb, hiraditya and 7 others. · View Herald TranscriptApr 26 2021, 10:25 AM

hsmhsm requested review of this revision.Apr 26 2021, 10:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 26 2021, 10:25 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

ping

Rebase to latest upstream main.

Harbormaster completed remote builds in B102167: Diff 342231.May 2 2021, 5:09 AM

I can't work out from the implementation, or the comments, what the condition is for deciding whether to transform an LDS variable. Exactly which cases is this intended to transform?

Emitting stores in the entry block is probably not sufficient to ensure the pointers have the right value when later read from. The stores will look dead and can be eliminated or moved across the function calls. One of the attractions of implementing LDS initializers in the back end is ensuring that such transforms cannot break the lowering. Alternatively, the lower module pass + backend coupling could be updated so that an object can be used instead of zero, at which point the IR passes will natively understand the connection.

In order to exercise this code at runtime, we need executable code that uses LDS from functions. I think that will be necessary to be confident that the transform proposed here works correctly. Possibly also a microbenchmark to determine which cases this makes faster as well.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
84	This is an optimisation. Instead of a fatal error, it can always leave the variable unchanged. Therefore it should not abort compilation.

In D101310#2743123, @JonChesterfield wrote:

I can't work out from the implementation, or the comments, what the condition is for deciding whether to transform an LDS variable. Exactly which cases is this intended to transform?

I in-fact implemented what you advocated-for in the abondened patch. that is,

for each LDS variable:

if should-transform
  create 16 bit integer in LDS
  initialize that global with (constexpr) address of variable
  replace all uses of variable with a (constexpr) access through new pointer

where
should-transform:
if (sizeof) < 8ish return false
if used by instruction in indirectly called function return false
if only used by kernels return false
probably other exclusions
return true

Emitting stores in the entry block is probably not sufficient to ensure the pointers have the right value when later read from. The stores will look dead and can be eliminated or moved across the function calls. One of the attractions of implementing LDS initializers in the back end is ensuring that such transforms cannot break the lowering. Alternatively, the lower module pass + backend coupling could be updated so that an object can be used instead of zero, at which point the IR passes will natively understand the connection.

In that case, I will abandon this patch, and stop spending any time on it. I hope that you will continue to strenthen your patch accordingly.

In order to exercise this code at runtime, we need executable code that uses LDS from functions. I think that will be necessary to be confident that the transform proposed here works correctly. Possibly also a microbenchmark to determine which cases this makes faster as well.

When we do not still have a common agreement on the implementation itself, threre is no point in microbenchmarking, and following the same anology your original patch needs it in the first place. So, I just stop working on it anymore by assuming that you will take care of updating your original patch when you feel that the optimal LDS usage required.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
84	call to reportReplaceLDSError() is being made, when the pass cannot proceed further for one or the other reason (internal error situation). And, it is perfectly valid for any optimiation pass to abonden and report an error when it faces some internal error kind of situation.

arsenm added inline comments.May 10 2021, 2:46 PM

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp
270–272	This is a weird place for pass documentation, just put this in the file header. The wording is also weird. Current pass doesn't mean anything in this context
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
12	No quotes around address zero?
17	s/current pass/this pass/
26	Typo substentially
28	Add a pseudocode example?
55–59	Don't see much point in this. If it's worth reporting a specific error code, it's worth adding a DiagnosticInfo type for it (but I don't think it is)
69–72	Shouldn't just have a generic error
74–80	I think you shouldn't error here, and just handle these assuming the external call could access the variable, wherever it may end up
84	This should also go through DiagnosticInfo rather than report_fatal_error
88–91	There's already an is_contained
105	getInstructions is overly generic. How about convertConstExprsToInstructions?
125–126	It feels wrong that you would need to explicitly test this, and this case would be naturally handled by your walk through the constantexpr
135	I'd prefer to avoid recursion to analyze constantxeprs
215	Don't see why you need to store this, can just loop through and handle each kernel as it appears
222	ValueMap? You do have value replacements occuring
226	DenseMap?
300–301	This is just what happens on first map access anyway
334–335	cast instead of assert on dyn_cast
363	Should not be using ptrtoint, keep everything as i16 and use GEP to index off the base
436	"Current threshold" part unnecessary
436–438	I don't understand why there would be a size threshold here
442–445	Directly return .empty()?
llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp
137–140	This function shouldn't exist, just use DL.getTypeAllocSize

We have taken a decision to implement pointer replacement algorithm within LowerModuleLDS pass itself instead of as a seperate pass. Hence abandoning this patch. Will be uploading new patch shortly.

Herald added a subscriber: foad. · View Herald TranscriptMay 25 2021, 11:21 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

10 lines

AMDGPULowerModuleLDSPass.cpp

13 lines

AMDGPUReplaceLDSUseWithPointer.cpp

598 lines

AMDGPUTargetMachine.cpp

14 lines

CMakeLists.txt

1 line

Utils/

AMDGPULDSUtils.h

4 lines

AMDGPULDSUtils.cpp

18 lines

test/

CodeGen/

AMDGPU/

lds_replace_by_pointer-call_diamond_shape.ll

74 lines

lds_replace_by_pointer-call_miscellaneous.ll

98 lines

lds_replace_by_pointer-indirect_call_diamond_shape.ll

78 lines

lds_replace_by_pointer-small_lds.ll

28 lines

lds_replace_by_pointer-use_both_within_kernel_and_func.ll

34 lines

lds_replace_by_pointer-use_only_within_func.ll

32 lines

lds_replace_by_pointer-use_only_within_kernel.ll

19 lines

lds_replace_by_pointer-use_within_both_global_and_func_scope.ll

36 lines

lds_replace_by_pointer-use_within_both_global_and_func_scope2.ll

30 lines

lds_replace_by_pointer-use_within_const_expr.ll

72 lines

lds_replace_by_pointer-use_within_const_expr2.ll

50 lines

lds_replace_by_pointer-use_within_const_expr3.ll

52 lines

lds_replace_by_pointer-use_within_not_rechable_func.ll

44 lines

promote-alloca-to-lds-constantexpr-use.ll

2 lines

Diff 342231

llvm/lib/Target/AMDGPU/AMDGPU.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );		FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );
FunctionPass *createAMDGPUUseNativeCallsPass();		FunctionPass *createAMDGPUUseNativeCallsPass();
FunctionPass *createAMDGPUCodeGenPreparePass();		FunctionPass *createAMDGPUCodeGenPreparePass();
FunctionPass *createAMDGPULateCodeGenPreparePass();		FunctionPass *createAMDGPULateCodeGenPreparePass();
FunctionPass *createAMDGPUMachineCFGStructurizerPass();		FunctionPass *createAMDGPUMachineCFGStructurizerPass();
FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );		FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );
ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );		ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );
FunctionPass *createAMDGPURewriteOutArgumentsPass();		FunctionPass *createAMDGPURewriteOutArgumentsPass();
		ModulePass *createAMDGPUReplaceLDSUseWithPointerPass();
ModulePass *createAMDGPULowerModuleLDSPass();		ModulePass *createAMDGPULowerModuleLDSPass();
FunctionPass *createSIModeRegisterPass();		FunctionPass *createSIModeRegisterPass();

struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {		struct AMDGPUSimplifyLibCallsPass : PassInfoMixin<AMDGPUSimplifyLibCallsPass> {
AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}		AMDGPUSimplifyLibCallsPass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

private:		private:
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	struct AMDGPUPropagateAttributesLatePass
: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {		: PassInfoMixin<AMDGPUPropagateAttributesLatePass> {
AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}		AMDGPUPropagateAttributesLatePass(TargetMachine &TM) : TM(TM) {}
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);

private:		private:
TargetMachine &TM;		TargetMachine &TM;
};		};

		void initializeAMDGPUReplaceLDSUseWithPointerPass(PassRegistry &);
		extern char &AMDGPUReplaceLDSUseWithPointerID;

		struct AMDGPUReplaceLDSUseWithPointerPass
		: PassInfoMixin<AMDGPUReplaceLDSUseWithPointerPass> {
		AMDGPUReplaceLDSUseWithPointerPass() {}
		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
		};

void initializeAMDGPULowerModuleLDSPass(PassRegistry &);		void initializeAMDGPULowerModuleLDSPass(PassRegistry &);
extern char &AMDGPULowerModuleLDSID;		extern char &AMDGPULowerModuleLDSID;

struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {		struct AMDGPULowerModuleLDSPass : PassInfoMixin<AMDGPULowerModuleLDSPass> {
PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);		PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
};		};

void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);		void initializeAMDGPURewriteOutArgumentsPass(PassRegistry &);
▲ Show 20 Lines • Show All 261 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines	public:
}		}
};		};

} // namespace		} // namespace
char AMDGPULowerModuleLDS::ID = 0;		char AMDGPULowerModuleLDS::ID = 0;

char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;		char &llvm::AMDGPULowerModuleLDSID = AMDGPULowerModuleLDS::ID;

INITIALIZE_PASS(AMDGPULowerModuleLDS, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(AMDGPULowerModuleLDS, DEBUG_TYPE,
"Lower uses of LDS variables from non-kernel functions", false,		"Lower uses of LDS variables from non-kernel functions",
false)		false, false)
		// Before runnning current LDS lower pass, replace LDS uses within non-kernel
		// functions by pointers so that the current pass minimizes the unnecessary per
		// kernel allocation of LDS memory.
		arsenmUnsubmitted Not Done Reply Inline Actions This is a weird place for pass documentation, just put this in the file header. The wording is also weird. Current pass doesn't mean anything in this context arsenm: This is a weird place for pass documentation, just put this in the file header. The wording is…
		INITIALIZE_PASS_DEPENDENCY(AMDGPUReplaceLDSUseWithPointer)
		INITIALIZE_PASS_END(AMDGPULowerModuleLDS, DEBUG_TYPE,
		"Lower uses of LDS variables from non-kernel functions",
		false, false)

ModulePass *llvm::createAMDGPULowerModuleLDSPass() {		ModulePass *llvm::createAMDGPULowerModuleLDSPass() {
return new AMDGPULowerModuleLDS();		return new AMDGPULowerModuleLDS();
}		}

PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,		PreservedAnalyses AMDGPULowerModuleLDSPass::run(Module &M,
ModuleAnalysisManager &) {		ModuleAnalysisManager &) {
return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()		return AMDGPULowerModuleLDS().runOnModule(M) ? PreservedAnalyses::none()
: PreservedAnalyses::all();		: PreservedAnalyses::all();
}		}

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

This file was added.

				//===-- AMDGPUReplaceLDSUseWithPointer.cpp --------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// The pass - "Lower Module LDS" supports use of LDS globals within non-kernel
				// functions by lowering LDS globals as follows. It packs within non-kernel used
				// LDS globals into a struct type, and creates an instance of that struct type
				// within every kernel at "address zero".
				arsenmUnsubmitted Not Done Reply Inline Actions No quotes around address zero? arsenm: No quotes around address zero?
				//
				// However, the pass - "Lower Module LDS" sometime wastes LDS memory depending
				// on the pattern of LDS globals use within the module.
				//
				// Hence the current pass makes an attempt to aid the pass - "Lower Module LDS"
				arsenmUnsubmitted Not Done Reply Inline Actions s/current pass/this pass/ arsenm: s/current pass/this pass/
				// for efficient LDS memory usage. The idea behind current pass is as follows:
				//
				// * Instead of directly packing LDS globals into the struct as struct members,
				// create global LDS pointers correspoding those LDS globals.
				// * Initialize those global LDS pointers with their respective LDS globals.
				// * Replace all the non-kernel function scope use of those original LDS globals
				// by their respective pointer counter-parts.
				// * Then the pass "Lower Module LDS" by the virtue of its implementation idea,
				// lands-up packing only LDS pointers as struct members, which substentially
				arsenmUnsubmitted Not Done Reply Inline Actions Typo substentially arsenm: Typo substentially
				// reduces unnecessary LDS memory usage.
				//
				arsenmUnsubmitted Not Done Reply Inline Actions Add a pseudocode example? arsenm: Add a pseudocode example?
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "Utils/AMDGPUBaseInfo.h"
				#include "Utils/AMDGPULDSUtils.h"
				#include "llvm/ADT/SCCIterator.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/Analysis/CallGraph.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/MDBuilder.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/ValueMap.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Transforms/Utils/Cloning.h"
				#include <algorithm>
				#include <map>
				#include <set>

				#define DEBUG_TYPE "amdgpu-replace-lds-use-with-pointer"

				using namespace llvm;

				namespace {

				// Error kinds for handling the errors within the context of current pass.
				enum ReplaceLDSErrorKind : uint32_t {
				LLEK_EndOfList = 0u,
				LLEK_InternalError = 2u,
				LLEK_NoCalleeDefinitionError = 3u
				};
				arsenmUnsubmitted Not Done Reply Inline Actions Don't see much point in this. If it's worth reporting a specific error code, it's worth adding a DiagnosticInfo type for it (but I don't think it is) arsenm: Don't see much point in this. If it's worth reporting a specific error code, it's worth adding…

				} // namespace

				// Report error within the context of current pass based on the error kind.
				static void reportReplaceLDSError(ReplaceLDSErrorKind EK, Value *V = nullptr) {
				std::string ErrStr("The pass \"Replace LDS Use With Pointer\" ");

				switch (EK) {
				default:
				case LLEK_InternalError: {
				ErrStr = ErrStr + std::string("has encountered an internal error.");
				break;
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Shouldn't just have a generic error arsenm: Shouldn't just have a generic error
				case LLEK_NoCalleeDefinitionError: {
				ErrStr =
				ErrStr +
				std::string("assumes that the definitions of both caller and callee "
				"appear within same module. But, definition for the "
				"callee \"") +
				V->getName().str() + std::string("\" not available.");
				break;
				arsenmUnsubmitted Not Done Reply Inline Actions I think you shouldn't error here, and just handle these assuming the external call could access the variable, wherever it may end up arsenm: I think you shouldn't error here, and just handle these assuming the external call could access…
				}
				}

				report_fatal_error(ErrStr);
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This is an optimisation. Instead of a fatal error, it can always leave the variable unchanged. Therefore it should not abort compilation. JonChesterfield: This is an optimisation. Instead of a fatal error, it can always leave the variable unchanged.
				hsmhsmAuthorUnsubmitted Done Reply Inline Actions call to reportReplaceLDSError() is being made, when the pass cannot proceed further for one or the other reason (internal error situation). And, it is perfectly valid for any optimiation pass to abonden and report an error when it faces some internal error kind of situation. hsmhsm: call to reportReplaceLDSError() is being made, when the pass cannot proceed further for one or…
				arsenmUnsubmitted Not Done Reply Inline Actions This should also go through DiagnosticInfo rather than report_fatal_error arsenm: This should also go through DiagnosticInfo rather than report_fatal_error
				}

				// Helper function around `ValueMap` to detect if an element exists within it.
				template <typename R, typename E>
				static bool contains(R &&VMap, const E &Element) {
				return VMap.find(Element) != VMap.end();
				}
				arsenmUnsubmitted Not Done Reply Inline Actions There's already an is_contained arsenm: There's already an is_contained

				// Within User `U` replace the use(s) of `OldValue` by `NewValue`.
				static void updateUserOperand(User U, Value OldValue, Value *NewValue) {
				unsigned Ind = 0;
				for (Use &UU : U->operands()) {
				if (UU.get() == OldValue)
				U->setOperand(Ind, NewValue);
				++Ind;
				}
				}

				// The instruction `I` contains const expressions(possibly nested) as its
				// operands, convert those const expressions into corresponding instructions.
				static void getInstructions(Instruction I, std::set<Value > &Operands,
				arsenmUnsubmitted Not Done Reply Inline Actions getInstructions is overly generic. How about convertConstExprsToInstructions? arsenm: getInstructions is overly generic. How about convertConstExprsToInstructions?
				std::set<Instruction *> &Insts) {
				for (auto *V : Operands) {
				auto *CE = dyn_cast<ConstantExpr>(V);
				if (!CE)
				continue;

				auto *NI = CE->getAsInstruction();
				NI->insertBefore(I);
				updateUserOperand(I, CE, NI);
				CE->removeDeadConstantUsers();
				Insts.insert(NI);

				std::set<Value *> Operands2;
				for (Use &UU : CE->operands())
				Operands2.insert(UU.get());
				getInstructions(NI, Operands2, Insts);
				}
				}

				// Check if const exprssion `CE2` holds const expression `CE`, and return true
				// if `CE` exist within `CE2`.
				arsenmUnsubmitted Not Done Reply Inline Actions It feels wrong that you would need to explicitly test this, and this case would be naturally handled by your walk through the constantexpr arsenm: It feels wrong that you would need to explicitly test this, and this case would be naturally…
				static bool isCEExist(ConstantExpr CE, ConstantExpr CE2) {
				if (CE == CE2)
				return true;

				bool CEExist = false;

				for (Use &UU : CE2->operands()) {
				if (auto *CE3 = dyn_cast<ConstantExpr>(UU.get()))
				CEExist = CEExist \| isCEExist(CE, CE3);
				arsenmUnsubmitted Not Done Reply Inline Actions I'd prefer to avoid recursion to analyze constantxeprs arsenm: I'd prefer to avoid recursion to analyze constantxeprs
				}

				return CEExist;
				}

				// Collect all const expression operands of `I` which use `CE`.
				static std::set<Value > getCEOperands(Instruction I, ConstantExpr *CE) {
				std::set<Value *> CEOperands;

				for (Use &UU : I->operands()) {
				auto *CE2 = dyn_cast<ConstantExpr>(UU.get());
				if (CE2 && isCEExist(CE, CE2))
				CEOperands.insert(UU.get());
				}

				return CEOperands;
				}

				// Collect all those non-kernel functions and all those instructions within
				// which `U` exist.
				static std::map<Function , std::set<Instruction >>
				getFunctionToInstsMap(User *U) {
				std::map<Function , std::set<Instruction >> FunctionToInsts;
				SmallVector<User *, 8> UserStack;
				SmallPtrSet<User *, 8> VisitedUsers;

				UserStack.push_back(U);

				while (!UserStack.empty()) {
				auto *UU = UserStack.pop_back_val();

				if (!VisitedUsers.insert(UU).second)
				continue;

				if (isa<GlobalVariable>(UU))
				continue;

				if (isa<Constant>(UU)) {
				append_range(UserStack, UU->users());
				continue;
				}

				if (auto *I = dyn_cast<Instruction>(UU)) {
				auto *F = I->getFunction();
				if (AMDGPU::isKernelCC(F))
				continue;
				if (!contains(FunctionToInsts, F))
				FunctionToInsts[F] = std::set<Instruction *>();
				FunctionToInsts[F].insert(I);
				}
				}

				return FunctionToInsts;
				}

				// Collect all call graph nodes which are reachable from the node `CGN`.
				static std::set<CallGraphNode *>
				collectReachableCallGraphNodes(CallGraphNode *CGN) {
				std::set<CallGraphNode *> ReachableCGNodes;

				for (scc_iterator<CallGraphNode *> I = scc_begin(CGN); !I.isAtEnd(); ++I) {
				const std::vector<CallGraphNode > &SCC = I;
				assert(!SCC.empty() && "SCC with no functions?");
				for (auto *CGNode : SCC)
				ReachableCGNodes.insert(CGNode);
				}

				return ReachableCGNodes;
				}

				namespace {

				class ReplaceLDSUseImpl {
				Module &M;
				LLVMContext &Ctx;
				const DataLayout &DL;
				Constant *LDSMemBaseAddr;

				// Holds all kernels defined within the module `M`.
				std::vector<Function *> Kernels;
				arsenmUnsubmitted Not Done Reply Inline Actions Don't see why you need to store this, can just loop through and handle each kernel as it appears arsenm: Don't see why you need to store this, can just loop through and handle each kernel as it appears

				// Holds all those LDS globals defined within the module `M` which require
				// pointer replacement.
				std::vector<GlobalVariable *> LDSGlobals;

				// Associates LDS global to a unique pointer which points to that LDS global.
				std::map<GlobalVariable , GlobalVariable > LDSToPointer;
				arsenmUnsubmitted Not Done Reply Inline Actions ValueMap? You do have value replacements occuring arsenm: ValueMap? You do have value replacements occuring

				// Associates kernel K to LDS pointers which are initialized to point to
				// corresponding LDS globals within K.
				std::map<Function , std::set<GlobalVariable >> KernelToLDSPointers;
				arsenmUnsubmitted Not Done Reply Inline Actions DenseMap? arsenm: DenseMap?

				// Associates non-kernel function to an LDS global to LDS global replacement
				// instruction.
				std::map<Function , std::map<GlobalVariable , Value *>> FunctionToLDSToInst;

				public:
				explicit ReplaceLDSUseImpl(Module &M)
				: M(M), Ctx(M.getContext()), DL(M.getDataLayout()) {
				// FIXME: At present, we are assuming that LDS memory virtual base address
				// starts from 0, though it is true, we should not make such assumptions at
				// IR level.
				LDSMemBaseAddr = Constant::getIntegerValue(
				PointerType::get(Type::getInt8Ty(M.getContext()),
				AMDGPUAS::LOCAL_ADDRESS),
				APInt(32, 0));
				}

				// Entry-point function.
				bool replace();

				private:
				// Create a set of replacement instructions which together replace `LDS`
				// within `F` by accessing `LDS` indirectly using `LDSPointer`.
				Value getReplacementInst(Function F, GlobalVariable *LDS,
				GlobalVariable *LDSPointer);

				// Replace all the uses of LDS global `LDS ` with the associated pointer
				// `LDSPointer`.
				void replaceUsesOfLDSGlobalByPointer(GlobalVariable *LDS,
				GlobalVariable *LDSPointer);

				// Initialize `LDSPointer` to point to `LDS` within kernel `K`.
				void initializeLDSPointer(Function K, GlobalVariable LDS,
				GlobalVariable *LDSPointer);

				// Insert new global LDS pointer which points to `LDS`.
				GlobalVariable createLDSPointer(GlobalVariable LDS);

				// For the lds global `LDS`, recursively visit its user list and find all
				// those non-kernel functions within which the `LDS` is being accessed.
				std::set<Function > collectNonKernelAccessorsOfLDS(GlobalVariable LDS);

				// Check if the pointer replacement of `LDS` is not required irrespective of
				// if it is used within non-kernel function or not.
				bool ignoreLDS(GlobalVariable LDS, std::set<Function > &LDSAccessors);

				// Traverse `CallGraph` starting from the `CallGraphNode` associated with each
				// kernel `K` and collect all callees which are reachable from K.
				std::map<Function , std::set<Function >> collectReachableCallees();
				};

				// Create a set of replacement instructions which together replace `LDS` within
				// `F` by accessing `LDS` indirectly using `LDSPointer`.
				Value ReplaceLDSUseImpl::getReplacementInst(Function F, GlobalVariable *LDS,
				GlobalVariable *LDSPointer) {
				// The instruction which replaces `LDS` within `F` already created.
				if (contains(FunctionToLDSToInst, F) && contains(FunctionToLDSToInst[F], LDS))
				return FunctionToLDSToInst[F][LDS];

				// Get the instruction insertion point within the beginning of the entry
				// block of current non-kernel function.
				auto EI = &((F->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);

				// Insert required set of instructions which replace `LDS` within `F`.
				auto *V = Builder.CreateBitCast(
				Builder.CreateGEP(
				LDSMemBaseAddr,
				Builder.CreateLoad(LDSPointer->getValueType(), LDSPointer)),
				LDS->getType());

				// Mark that the replacement instruction, which replace `LDS` within `F` is
				// created.
				if (!contains(FunctionToLDSToInst, F))
				FunctionToLDSToInst[F] = std::map<GlobalVariable , Value >();
				arsenmUnsubmitted Not Done Reply Inline Actions This is just what happens on first map access anyway arsenm: This is just what happens on first map access anyway
				FunctionToLDSToInst[F][LDS] = V;

				return V;
				}

				// Replace all the uses of LDS global `LDS ` with the associated pointer
				// `LDSPointer`.
				void ReplaceLDSUseImpl::replaceUsesOfLDSGlobalByPointer(
				GlobalVariable LDS, GlobalVariable LDSPointer) {
				SmallVector<User *, 16> LDSUsers(LDS->users());
				for (auto *U : LDSUsers) {
				// When `U` is a const expression, it is possible that same const expression
				// exists within multiple instructions, and within multiple non-kernel
				// functions. Collect all those non-kernel functions and all those
				// instructions within which `U` exist.
				auto FunctionToInsts = getFunctionToInstsMap(U);

				for (auto FI = FunctionToInsts.begin(), FE = FunctionToInsts.end();
				FI != FE; ++FI) {
				for (auto *I : FI->second) {
				// If `U` is a const expression, then we need to break the associated
				// instruction into a set of separate instructions by converting const
				// expressions into instructions.
				std::set<Instruction *> Insts;

				if (I == U) {
				// `U` is an instruction, conversion from const expressions to
				// instructions is not required.
				Insts.insert(I);
				} else {
				// `U` is a const expression, convert all associated const expressions
				// (including U) to corresponding instructions.
				auto *CE = dyn_cast<ConstantExpr>(U);
				assert(CE && "Expected constant expression.");
				arsenmUnsubmitted Not Done Reply Inline Actions cast instead of assert on dyn_cast arsenm: cast instead of assert on dyn_cast
				auto CEOperands = getCEOperands(I, CE);
				getInstructions(I, CEOperands, Insts);
				}

				// Go through all the instrutions, if `LDS` exist within them as an
				// operand, then replace it by `V`.
				for (auto *II : Insts) {
				auto *V = getReplacementInst(FI->first, LDS, LDSPointer);
				updateUserOperand(II, LDS, V);
				}
				}
				}
				}
				}

				// Initialize `LDSPointer` to point to `LDS` within kernel `K`.
				void ReplaceLDSUseImpl::initializeLDSPointer(Function K, GlobalVariable LDS,
				GlobalVariable *LDSPointer) {
				// `LDSPointer` is already initialized within `K`.
				if (contains(KernelToLDSPointers, K) &&
				contains(KernelToLDSPointers[K], LDSPointer))
				return;

				// Insert instructions at `EI` which initialize `LDSPointer` to point-to `LDS`
				// within `K`.
				auto EI = &((K->getEntryBlock().getFirstInsertionPt()));
				IRBuilder<> Builder(EI);
				Builder.CreateStore(Builder.CreatePtrToInt(LDS, Type::getInt16Ty(Ctx)),
				arsenmUnsubmitted Not Done Reply Inline Actions Should not be using ptrtoint, keep everything as i16 and use GEP to index off the base arsenm: Should not be using ptrtoint, keep everything as i16 and use GEP to index off the base
				LDSPointer);

				// Mark that `LDSPointer` is initialized within `K`.
				if (!contains(KernelToLDSPointers, K))
				KernelToLDSPointers[K] = std::set<GlobalVariable *>();
				KernelToLDSPointers[K].insert(LDSPointer);
				}

				// Insert new global LDS pointer which points to `LDS`.
				GlobalVariable ReplaceLDSUseImpl::createLDSPointer(GlobalVariable LDS) {
				// LDS pointer which points to `LDS` is already created.
				if (contains(LDSToPointer, LDS))
				return LDSToPointer[LDS];

				// Create new LDS pointer which points to `LDS`.
				auto *I16Ty = Type::getInt16Ty(Ctx);
				GlobalVariable *LDSPointer = new GlobalVariable(
				M, I16Ty, false, GlobalValue::InternalLinkage, UndefValue::get(I16Ty),
				LDS->getName() + Twine(".offset"), nullptr,
				GlobalVariable::NotThreadLocal, AMDGPUAS::LOCAL_ADDRESS);
				LDSPointer->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
				LDSPointer->setAlignment(AMDGPU::getAlign(M.getDataLayout(), LDSPointer));

				// Mark that an associated LDS pointer is created for `LDS`.
				LDSToPointer[LDS] = LDSPointer;

				return LDSPointer;
				}

				// For the lds global `LDS`, recursively visit its user list and find all those
				// non-kernel functions within which the `LDS` is being accessed.
				std::set<Function *>
				ReplaceLDSUseImpl::collectNonKernelAccessorsOfLDS(GlobalVariable *LDS) {
				std::set<Function *> LDSAccessors;
				std::set<User *> VisitedUsers;
				SmallVector<User *, 16> UserStack(LDS->users());

				while (!UserStack.empty()) {
				auto *U = UserStack.pop_back_val();

				// `U` is already visited? continue to next one.
				if (!VisitedUsers.insert(U).second)
				continue;

				// `U` is a global variable which is initialized with `LDS`. Ignore LDS.
				if (isa<GlobalVariable>(U))
				return std::set<Function *>();

				// `U` is `Constant`. Push-back users of `U`, and continue further
				// exploring the stack until an `Instruction` is found.
				if (isa<Constant>(U)) {
				append_range(UserStack, U->users());
				continue;
				}

				// `U` should be an instruction, if it belongs to a non-kernel function F,
				// then collect F.
				if (auto *I = dyn_cast<Instruction>(U)) {
				auto *F = I->getFunction();
				if (!AMDGPU::isKernelCC(F))
				LDSAccessors.insert(F);
				} else
				reportReplaceLDSError(LLEK_InternalError);
				}

				return LDSAccessors;
				}

				// Check if the pointer replacement of `LDS` is not required irrespective of
				// if it is used within non-kernel function or not.
				bool ReplaceLDSUseImpl::ignoreLDS(GlobalVariable *LDS,
				std::set<Function *> &LDSAccessors) {
				// Ignore `LDS` if its size is too small. Current threshold is 8 bytes.
				arsenmUnsubmitted Not Done Reply Inline Actions "Current threshold" part unnecessary arsenm: "Current threshold" part unnecessary
				if (AMDGPU::getLDSGlobalSizeInBytes(M, LDS) <= 8)
				return true;
				arsenmUnsubmitted Not Done Reply Inline Actions I don't understand why there would be a size threshold here arsenm: I don't understand why there would be a size threshold here

				// There are no non-kernel functions which access `LDS` OR LDS is used
				// within global scope in addition to non-kernel function scope. Ignore `LDS`.
				if (LDSAccessors.empty())
				return true;

				return false;
				arsenmUnsubmitted Not Done Reply Inline Actions Directly return .empty()? arsenm: Directly return .empty()?
				}

				// Traverse `CallGraph` starting from the `CallGraphNode` associated with each
				// kernel `K` and collect all the callees which are reachable from K (including
				// indirectly called callees).
				std::map<Function , std::set<Function >>
				ReplaceLDSUseImpl::collectReachableCallees() {
				// Associates kernel to a list of non-kernel functions which are reachable
				// from that kernel.
				std::map<Function , std::set<Function >> KernelToCallees;

				// Create the call graph `CG` of the module `M`, collect all the address taken
				// functions, and explore `CG` to collect all the reachable callees (including
				// indirectly called callees) from all kernels.
				CallGraph CG = CallGraph(M);

				for (auto *K : Kernels) {
				// Get `CallGraphNode` representing kernel `K`.
				auto *KernCGNode = CG[K];

				// Collect all call graph nodes which are reachable from `KernCGNode`.
				std::set<CallGraphNode *> ReachableCGNodes =
				collectReachableCallGraphNodes(KernCGNode);

				// Remove `CallGraphNode` representing kernel `K` from reachable node set.
				ReachableCGNodes.erase(KernCGNode);

				// Collect all reachable callees from K.
				std::set<Function *> ReachableCallees;
				for (auto *CGNode : ReachableCGNodes) {
				if (auto *Callee = CGNode->getFunction())
				ReachableCallees.insert(Callee);
				}

				KernelToCallees[K] = ReachableCallees;
				}

				return KernelToCallees;
				}

				// Entry-point function.
				bool ReplaceLDSUseImpl::replace() {
				// Track if this pass update the module.
				bool Changed = false;

				// If there are no kernels defined within the module, or if there are no
				// LDS globals which actually require pointer replacement, then nothing to do.
				Kernels = AMDGPU::collectKernels(M);
				LDSGlobals = AMDGPU::findVariablesToLower(M, AMDGPU::getUsedList(M));
				if (Kernels.empty() \|\| LDSGlobals.empty())
				return false;

				// Traverse `CallGraph` starting from the `CallGraphNode` associated with each
				// kernel `K` and collect all callees which are reachable from K.
				std::map<Function , std::set<Function >> KernelToCallees =
				collectReachableCallees();

				// If there are no non-kernel functions which are reachable from any of the
				// kernels, then nothing to do.
				if (KernelToCallees.empty())
				return false;

				// For each collected LDS global, if required, create an associated global LDS
				// pointer, initialize it within all relavent kernels, and finally replace all
				// uses of original LDS globals by their pointer counter-parts.
				for (auto *LDS : LDSGlobals) {
				// For the lds global `LDS`, recursively visit its user list and find all
				// those non-kernel functions within which the `LDS` is being accessed.
				std::set<Function *> LDSAccessors = collectNonKernelAccessorsOfLDS(LDS);

				// Check if the pointer replacement of `LDS` is not required irrespective
				// of if it is used within non-kernel functions or not.
				if (ignoreLDS(LDS, LDSAccessors))
				continue;

				// The global LDS pointer which points to `LDS` and replaces all the uses of
				// `LDS`.
				GlobalVariable *LDSPointer = nullptr;

				// Traverse through each kernel `K`, check and if required, initialize the
				// `LDSPointer` to point to `LDS` within `K`.
				for (auto KI = KernelToCallees.begin(), KE = KernelToCallees.end();
				KI != KE; ++KI) {
				Function *K = KI->first;
				std::set<Function *> ReachableCallees = KI->second;

				std::set<Function *> ReachableAndLDSUsedCallees;
				std::set_intersection(LDSAccessors.begin(), LDSAccessors.end(),
				ReachableCallees.begin(), ReachableCallees.end(),
				std::inserter(ReachableAndLDSUsedCallees,
				ReachableAndLDSUsedCallees.begin()));

				// None of the LDS accessing non-kernel functions are reachable from
				// kernel `K`. Hence, no need to initialize `LDSPointer` within `K`.
				if (ReachableAndLDSUsedCallees.empty())
				continue;

				// If it is first time encoutered, create a new global LDS pointer which
				// points to `LDS`.
				LDSPointer = createLDSPointer(LDS);

				// Initialize `LDSPointer` to point to `LDS` within kernel `K`.
				initializeLDSPointer(K, LDS, LDSPointer);
				}

				// Replace all the uses of LDS global `LDS ` with the associated pointer
				// `LDSPointer`.
				if (LDSPointer) {
				replaceUsesOfLDSGlobalByPointer(LDS, LDSPointer);
				Changed = true;
				}
				}

				return Changed;
				}

				class AMDGPUReplaceLDSUseWithPointer : public ModulePass {
				public:
				static char ID;

				AMDGPUReplaceLDSUseWithPointer() : ModulePass(ID) {
				initializeAMDGPUReplaceLDSUseWithPointerPass(
				*PassRegistry::getPassRegistry());
				}

				bool runOnModule(Module &M) override;
				};

				} // namespace

				char AMDGPUReplaceLDSUseWithPointer::ID = 0;
				char &llvm::AMDGPUReplaceLDSUseWithPointerID =
				AMDGPUReplaceLDSUseWithPointer::ID;

				INITIALIZE_PASS(AMDGPUReplaceLDSUseWithPointer, DEBUG_TYPE,
				"Replace within non-kernel function use of LDS with pointer",
				false /only look at the cfg/, false /analysis pass/)

				bool AMDGPUReplaceLDSUseWithPointer::runOnModule(Module &M) {
				ReplaceLDSUseImpl LDSReplacer{M};
				return LDSReplacer.replace();
				}

				ModulePass *llvm::createAMDGPUReplaceLDSUseWithPointerPass() {
				return new AMDGPUReplaceLDSUseWithPointer();
				}

				PreservedAnalyses
				AMDGPUReplaceLDSUseWithPointerPass::run(Module &M, ModuleAnalysisManager &AM) {
				ReplaceLDSUseImpl LDSReplacer{M};
				LDSReplacer.replace();
				return PreservedAnalyses::all();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableScalarIRPasses(
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

static cl::opt<bool> EnableStructurizerWorkarounds(		static cl::opt<bool> EnableStructurizerWorkarounds(
"amdgpu-enable-structurizer-workarounds",		"amdgpu-enable-structurizer-workarounds",
cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),		cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool> EnableLDSReplaceWithPointer(
		"amdgpu-enable-lds-replace-with-pointer",
		cl::desc("Enable LDS replace with pointer pass"), cl::init(true),
		cl::Hidden);

static cl::opt<bool, true> EnableLowerModuleLDS(		static cl::opt<bool, true> EnableLowerModuleLDS(
"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),		"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),
cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),		cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),
cl::Hidden);		cl::Hidden);

extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {		extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
Show All 31 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPUPreLegalizerCombinerPass(*PR);		initializeAMDGPUPreLegalizerCombinerPass(*PR);
initializeAMDGPURegBankCombinerPass(*PR);		initializeAMDGPURegBankCombinerPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUPromoteAllocaToVectorPass(*PR);		initializeAMDGPUPromoteAllocaToVectorPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
initializeAMDGPULateCodeGenPreparePass(*PR);		initializeAMDGPULateCodeGenPreparePass(*PR);
initializeAMDGPUPropagateAttributesEarlyPass(*PR);		initializeAMDGPUPropagateAttributesEarlyPass(*PR);
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
		initializeAMDGPUReplaceLDSUseWithPointerPass(*PR);
initializeAMDGPULowerModuleLDSPass(*PR);		initializeAMDGPULowerModuleLDSPass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertHardClausesPass(*PR);		initializeSIInsertHardClausesPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
▲ Show 20 Lines • Show All 250 Lines • ▼ Show 20 Lines	PB.registerPipelineParsingCallback(
if (PassName == "amdgpu-printf-runtime-binding") {		if (PassName == "amdgpu-printf-runtime-binding") {
PM.addPass(AMDGPUPrintfRuntimeBindingPass());		PM.addPass(AMDGPUPrintfRuntimeBindingPass());
return true;		return true;
}		}
if (PassName == "amdgpu-always-inline") {		if (PassName == "amdgpu-always-inline") {
PM.addPass(AMDGPUAlwaysInlinePass());		PM.addPass(AMDGPUAlwaysInlinePass());
return true;		return true;
}		}
		if (PassName == "amdgpu-replace-lds-use-with-pointer") {
		PM.addPass(AMDGPUReplaceLDSUseWithPointerPass());
		return true;
		}
if (PassName == "amdgpu-lower-module-lds") {		if (PassName == "amdgpu-lower-module-lds") {
PM.addPass(AMDGPULowerModuleLDSPass());		PM.addPass(AMDGPULowerModuleLDSPass());
return true;		return true;
}		}
return false;		return false;
});		});
PB.registerPipelineParsingCallback(		PB.registerPipelineParsingCallback(
[this](StringRef PassName, FunctionPassManager &PM,		[this](StringRef PassName, FunctionPassManager &PM,
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addIRPasses() {

// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.		// Handle uses of OpenCL image2d_t, image3d_t and sampler_t arguments.
if (TM.getTargetTriple().getArch() == Triple::r600)		if (TM.getTargetTriple().getArch() == Triple::r600)
addPass(createR600OpenCLImageTypeLoweringPass());		addPass(createR600OpenCLImageTypeLoweringPass());

// Replace OpenCL enqueued block function pointers with global variables.		// Replace OpenCL enqueued block function pointers with global variables.
addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());		addPass(createAMDGPUOpenCLEnqueuedBlockLoweringPass());

		// This pass need to be run before "amdgpu-lower-module-lds" pass.
		if (EnableLDSReplaceWithPointer)
		addPass(createAMDGPUReplaceLDSUseWithPointerPass());

// Can increase LDS used by kernel so runs before PromoteAlloca		// Can increase LDS used by kernel so runs before PromoteAlloca
if (EnableLowerModuleLDS)		if (EnableLowerModuleLDS)
addPass(createAMDGPULowerModuleLDSPass());		addPass(createAMDGPULowerModuleLDSPass());

if (TM.getOptLevel() > CodeGenOpt::None) {		if (TM.getOptLevel() > CodeGenOpt::None) {
addPass(createInferAddressSpacesPass());		addPass(createInferAddressSpacesPass());
addPass(createAMDGPUPromoteAlloca());		addPass(createAMDGPUPromoteAlloca());

▲ Show 20 Lines • Show All 485 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUMIRFormatter.cpp		AMDGPUMIRFormatter.cpp
AMDGPUOpenCLEnqueuedBlockLowering.cpp		AMDGPUOpenCLEnqueuedBlockLowering.cpp
AMDGPUPostLegalizerCombiner.cpp		AMDGPUPostLegalizerCombiner.cpp
AMDGPUPreLegalizerCombiner.cpp		AMDGPUPreLegalizerCombiner.cpp
AMDGPUPromoteAlloca.cpp		AMDGPUPromoteAlloca.cpp
AMDGPUPropagateAttributes.cpp		AMDGPUPropagateAttributes.cpp
AMDGPURegBankCombiner.cpp		AMDGPURegBankCombiner.cpp
AMDGPURegisterBankInfo.cpp		AMDGPURegisterBankInfo.cpp
		AMDGPUReplaceLDSUseWithPointer.cpp
AMDGPURewriteOutArguments.cpp		AMDGPURewriteOutArguments.cpp
AMDGPUSubtarget.cpp		AMDGPUSubtarget.cpp
AMDGPUTargetMachine.cpp		AMDGPUTargetMachine.cpp
AMDGPUTargetObjectFile.cpp		AMDGPUTargetObjectFile.cpp
AMDGPUTargetTransformInfo.cpp		AMDGPUTargetTransformInfo.cpp
AMDGPUUnifyDivergentExitNodes.cpp		AMDGPUUnifyDivergentExitNodes.cpp
AMDGPUUnifyMetadata.cpp		AMDGPUUnifyMetadata.cpp
AMDGPUPerfHintAnalysis.cpp		AMDGPUPerfHintAnalysis.cpp
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.h

	Show All 23 Lines
	Align getAlign(DataLayout const &DL, const GlobalVariable *GV);			Align getAlign(DataLayout const &DL, const GlobalVariable *GV);

	bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,			bool userRequiresLowering(const SmallPtrSetImpl<GlobalValue *> &UsedList,
	User *InitialUser);			User *InitialUser);

	std::vector<GlobalVariable *>			std::vector<GlobalVariable *>
	findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);			findVariablesToLower(Module &M, const SmallPtrSetImpl<GlobalValue *> &UsedList);

				std::vector<Function *> collectKernels(Module &M);

	SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);			SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M);

				unsigned getLDSGlobalSizeInBytes(Module &M, const GlobalVariable *LDS);

	} // end namespace AMDGPU			} // end namespace AMDGPU

	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H			#endif // LLVM_LIB_TARGET_AMDGPU_UTILS_AMDGPULDSUTILS_H

llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	if (std::none_of(GV.user_begin(), GV.user_end(), [&](User *U) {
})) {		})) {
continue;		continue;
}		}
LocalVars.push_back(&GV);		LocalVars.push_back(&GV);
}		}
return LocalVars;		return LocalVars;
}		}

		std::vector<Function *> collectKernels(Module &M) {
		std::vector<Function *> Kernels;
		for (auto &F : M.functions()) {
		// Collect `F` if it is a definition of an entry point function.
		if (!F.isDeclaration() && AMDGPU::isKernelCC(&F))
		Kernels.push_back(&F);
		}

		return Kernels;
		}

SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {		SmallPtrSet<GlobalValue *, 32> getUsedList(Module &M) {
SmallPtrSet<GlobalValue *, 32> UsedList;		SmallPtrSet<GlobalValue *, 32> UsedList;

SmallVector<GlobalValue *, 32> TmpVec;		SmallVector<GlobalValue *, 32> TmpVec;
collectUsedGlobalVariables(M, TmpVec, true);		collectUsedGlobalVariables(M, TmpVec, true);
UsedList.insert(TmpVec.begin(), TmpVec.end());		UsedList.insert(TmpVec.begin(), TmpVec.end());

TmpVec.clear();		TmpVec.clear();
collectUsedGlobalVariables(M, TmpVec, false);		collectUsedGlobalVariables(M, TmpVec, false);
UsedList.insert(TmpVec.begin(), TmpVec.end());		UsedList.insert(TmpVec.begin(), TmpVec.end());

return UsedList;		return UsedList;
}		}

		unsigned getLDSGlobalSizeInBytes(Module &M, const GlobalVariable *LDS) {
		auto *Ty = LDS->getValueType();
		auto SizeInBits = M.getDataLayout().getTypeSizeInBits(Ty).getFixedSize();
		auto SizeInBytes = SizeInBits / 8;
		return SizeInBytes;
		arsenmUnsubmitted Not Done Reply Inline Actions This function shouldn't exist, just use DL.getTypeAllocSize arsenm: This function shouldn't exist, just use DL.getTypeAllocSize
		}

} // end namespace AMDGPU		} // end namespace AMDGPU

} // end namespace llvm		} // end namespace llvm

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-call_diamond_shape.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4

				; New global LDS pointers which point to original LDS globals must have been created.
				; CHECK: @lds_used_within_func.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Use of LDS globals within this function must have replaced by pointer counter-parts.
				define internal void @func_uses_lds() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_func.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_func, i32 0, i32 0
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_3() {
				; CHECK: entry:
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_2() {
				; CHECK: entry:
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_1() {
				; CHECK: entry:
				; CHECK: call void @func_does_not_use_lds_2()
				; CHECK: call void @func_does_not_use_lds_3()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_2()
				call void @func_does_not_use_lds_3()
				ret void
				}

				; There is a call graph path from this kernel to `@func_uses_lds`, where LDS is being accessed, hence this kernel
				; must do LDS pointer initialization.
				define protected amdgpu_kernel void @reachable_kernel() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_func to i16), i16 addrspace(3)* @lds_used_within_func.offset, align 2
				; CHECK: call void @func_does_not_use_lds_1()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_1()
				ret void
				}

				; There is NO call graph path from this kernel to `@func_uses_lds` where LDS is being accessed, hence this kernel
				; remains unchanged.
				define protected amdgpu_kernel void @not_reachable_kernel() {
				; CHECK: entry:
				; CHECK: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-call_miscellaneous.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_1 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_2 = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function_3 = internal addrspace(3) global [4 x i32] undef, align 4

				; New global LDS pointers which point to original LDS globals must have been created.
				; CHECK: @lds_used_within_function_1.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_2.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @lds_used_within_function_3.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Use of LDS globals within this function must have replaced by pointer counter-parts.
				define internal void @function_3() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_3.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_3, i32 0, i32 0
				ret void
				}

				; Use of LDS globals within this function must have replaced by pointer counter-parts.
				define internal void @function_2() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_2.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_2, i32 0, i32 0
				ret void
				}

				; Use of LDS globals within this function must have replaced by pointer counter-parts.
				define internal void @function_1() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function_1.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function_1, i32 0, i32 0
				ret void
				}

				; This kernel calls functions 3 and 1, hence only lds globals which are used within functions 3 and 1 are
				; considered here for corresponding pointer initialization.
				define protected amdgpu_kernel void @kernel_calls_function_3_and_1() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.offset, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.offset, align 2
				; CHECK: call void @function_3()
				; CHECK: call void @function_1()
				; CHECK: ret void
				entry:
				call void @function_3()
				call void @function_1()
				ret void
				}

				; This kernel calls functions 2 and 3, hence only lds globals which are used within functions 2 and 3 are
				; considered here for corresponding pointer initialization.
				define protected amdgpu_kernel void @kernel_calls_function_2_and_3() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_3 to i16), i16 addrspace(3)* @lds_used_within_function_3.offset, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.offset, align 2
				; CHECK: call void @function_2()
				; CHECK: call void @function_3()
				; CHECK: ret void
				entry:
				call void @function_2()
				call void @function_3()
				ret void
				}

				; This kernel calls functions 1 and 2, hence only lds globals which are used within functions 1 and 2 are
				; considered here for corresponding pointer initialization.
				define protected amdgpu_kernel void @kernel_calls_function_1_and_2() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_2 to i16), i16 addrspace(3)* @lds_used_within_function_2.offset, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function_1 to i16), i16 addrspace(3)* @lds_used_within_function_1.offset, align 2
				; CHECK: call void @function_1()
				; CHECK: call void @function_2()
				; CHECK: ret void
				entry:
				call void @function_1()
				call void @function_2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-indirect_call_diamond_shape.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_func = internal addrspace(3) global [4 x i32] undef, align 4

				; New global LDS pointer should NOT have been created.
				; CHECK-NOT: @lds_used_within_func.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Other global variables should exist as it is.
				; CHECK: @ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds, align 8
				@ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds, align 8

				; Uses of LDS globals within this non-kernel function remains unchanged since this function is INDIRECTLY called.
				define internal void @func_uses_lds() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_func, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_func, i32 0, i32 0
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_3() {
				; CHECK: entry:
				; CHECK: %fptr = load void (), void ()* @ptr_to_func, align 8
				; CHECK: call void %fptr()
				; CHECK: ret void
				entry:
				%fptr = load void (), void ()* @ptr_to_func, align 8
				call void %fptr()
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_2() {
				; CHECK: entry:
				; CHECK: %fptr = load void (), void ()* @ptr_to_func, align 8
				; CHECK: call void %fptr()
				; CHECK: ret void
				entry:
				%fptr = load void (), void ()* @ptr_to_func, align 8
				call void %fptr()
				ret void
				}

				; This function remains unchanged
				define internal void @func_does_not_use_lds_1() {
				; CHECK: entry:
				; CHECK: call void @func_does_not_use_lds_2()
				; CHECK: call void @func_does_not_use_lds_3()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_2()
				call void @func_does_not_use_lds_3()
				ret void
				}

				; There is a call graph path from this kernel to `@func_uses_lds`, where LDS is being accessed, but `@func_uses_lds`
				; is called INDIRECTLY and hence it is not reachable, and hence pointer replacement does not take place.
				define protected amdgpu_kernel void @reachable_kernel() {
				; CHECK: entry:
				; CHECK: call void @func_does_not_use_lds_1()
				; CHECK: ret void
				entry:
				call void @func_does_not_use_lds_1()
				ret void
				}

				; There is NO call graph path from this kernel to `@func_uses_lds` where LDS is being accessed, hence this kernel
				; remains unchanged.
				define protected amdgpu_kernel void @not_reachable_kernel() {
				; CHECK: entry:
				; CHECK: ret void
				entry:
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-small_lds.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @small_lds = addrspace(3) global float undef, align 8
				@small_lds = addrspace(3) global float undef, align 8

				; New global LDS pointers is not expected to be created.
				; CHECK-NOT: @small_lds.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; This function uses LDS global `@small_lds`, and is reachable from `@kern`, but since `@small_lds` too small
				; for pointer replacement, it is ignored.
				define void @func() {
				; CHECK: entry:
				; CHECK: %dec = atomicrmw fsub float addrspace(3)* @small_lds, float 1.000000e+00 monotonic, align 4
				; CHECK: ret void
				entry:
				%dec = atomicrmw fsub float addrspace(3)* @small_lds, float 1.0 monotonic
				ret void
				}

				define amdgpu_kernel void @kern() {
				; CHECK: entry:
				; CHECK: call void @func()
				; CHECK: ret void
				entry:
				call void @func()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_both_within_kernel_and_func.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is
				; CHECK: @lds_used_both_within_kernel_and_function = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_both_within_kernel_and_function = internal addrspace(3) global [4 x i32] undef, align 4

				; New global LDS pointers which point to original LDS globals must have been created.
				; CHECK: @lds_used_both_within_kernel_and_function.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Uses of LDS globals within this non-kernel function should be replaced by pointers.
				define internal void @func_uses_lds() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_both_within_kernel_and_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_both_within_kernel_and_function, i32 0, i32 0
				ret void
				}

				; Pointers should be initialized within this kernel, but, uses of original LDS within in this kernel NO need to be replaced.
				define protected amdgpu_kernel void @kernel_uses_lds() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_both_within_kernel_and_function to i16), i16 addrspace(3)* @lds_used_both_within_kernel_and_function.offset, align 2
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_both_within_kernel_and_function, i32 0, i32 0
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_both_within_kernel_and_function, i32 0, i32 0
				call void @func_uses_lds()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_only_within_func.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is
				; CHECK: @lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4

				; New global LDS pointers which point to original LDS globals must have been created.
				; CHECK: @lds_used_within_function.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Uses of LDS globals within this non-kernel function should be replaced by pointers.
				define internal void @func_uses_lds() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0
				ret void
				}

				; Pointers should be initialized within this kernel.
				define protected amdgpu_kernel void @kernel_uses_lds() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function to i16), i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: call void @func_uses_lds()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_only_within_kernel.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @lds_used_within_kernel = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_kernel = internal addrspace(3) global [4 x i32] undef, align 4

				; Since lds global is used only within kernel, there is no pointer replacement of lds global required, hence
				; global pointer should NOT have created.
				; CHECK-NOT: @lds_used_within_kernel.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Kernel remains unchanged.
				define protected amdgpu_kernel void @kernel_uses_lds() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_kernel, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_kernel, i32 0, i32 0
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_both_global_and_func_scope.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original globals should exist as it is.
				; CHECK: @ignored1 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @ignored2 = addrspace(1) global i64 0
				; CHECK: @llvm.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @ignored1 to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored2 to i8 addrspace(1)) to i8)], section "llvm.metadata"
				; CHECK: @llvm.compiler.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @ignored1 to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored2 to i8 addrspace(1)) to i8)], section "llvm.metadata"
				@ignored1 = internal addrspace(3) global [4 x i32] undef, align 4
				@ignored2 = addrspace(1) global i64 0
				@llvm.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @ignored1 to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored2 to i8 addrspace(1)) to i8)], section "llvm.metadata"
				@llvm.compiler.used = appending global [2 x i8] [i8 addrspacecast (i8 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @ignored1 to i8 addrspace(3)) to i8), i8* addrspacecast (i8 addrspace(1)* bitcast (i64 addrspace(1)* @ignored2 to i8 addrspace(1)) to i8)], section "llvm.metadata"

				; New global LDS pointers is not expected to be created.
				; CHECK-NOT: @@ignored1.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; This function uses LDS global `@ignored1`, and is reachable from `@kernel`, but since `@ignored1` is
				; also used within global scope, pointer replacement is ignored.
				define void @func() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @ignored1, i32 0, i32 0
				; CHECK: %unused0 = atomicrmw add i64 addrspace(1)* @ignored2, i64 1 monotonic
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @ignored1, i32 0, i32 0
				%unused0 = atomicrmw add i64 addrspace(1)* @ignored2, i64 1 monotonic
				ret void
				}

				define protected amdgpu_kernel void @kernel() {
				; CHECK: entry:
				; CHECK: call void @func()
				; CHECK: ret void
				entry:
				call void @func()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_both_global_and_func_scope2.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original globals should exist as it is.
				; CHECK: @ignored1 = internal addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @ignored2 = addrspace(1) global float* addrspacecast (float addrspace(3)* bitcast ([4 x i32] addrspace(3)* @ignored1 to float addrspace(3)) to float), align 8
				@ignored1 = internal addrspace(3) global [4 x i32] undef, align 4
				@ignored2 = addrspace(1) global float* addrspacecast ([4 x i32] addrspace(3)* @ignored1 to float*), align 8

				; New global LDS pointer is not expected to be created.
				; CHECK-NOT: @@ignored1.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; This function uses LDS global `@ignored1`, and is reachable from `@kernel`, but since `@ignored1` is
				; also used within global scope, pointer replacement is ignored.
				define void @func() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @ignored1, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @ignored1, i32 0, i32 0
				ret void
				}

				define protected amdgpu_kernel void @kernel() {
				; CHECK: entry:
				; CHECK: call void @func()
				; CHECK: ret void
				entry:
				call void @func()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_const_expr.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				@used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4
				@used_within_both_func_and_kern = addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4
				; CHECK: @used_within_both_func_and_kern = addrspace(3) global [4 x i32] undef, align 4

				; LDS pointers should be created for vars `@used_only_within_func` and `@used_within_both_func_and_kern`
				; since both of them are used within non-kernel functions, but not for var `@used_only_within_kern` since
				; it is used only within kernel.
				; CHECK: @used_only_within_func.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK: @used_within_both_func_and_kern.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2
				; CHECK-NOT: @used_only_within_kern.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; pointer replacement is required for `@used_only_within_func`
				define i32 @get() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @used_only_within_func.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i64
				; CHECK: %6 = add i64 %5, %5
				; CHECK: %7 = inttoptr i64 %6 to i32*
				; CHECK: %8 = load i32, i32* %7, align 4
				; CHECK: ret i32 %8
				entry:
				%0 = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret i32 %0
				}

				; pointer replacement is required for `@used_within_both_func_and_kern`
				define void @set(i32 %x) {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @used_within_both_func_and_kern.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 0
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i64
				; CHECK: %6 = add i64 %5, %5
				; CHECK: %7 = inttoptr i64 %6 to i32*
				; CHECK: store i32 %x, i32* %7, align 4
				; CHECK: ret void
				entry:
				store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_within_both_func_and_kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_within_both_func_and_kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				ret void
				}

				; pointer replacement is not required for `@used_only_within_kern`
				define amdgpu_kernel void @timestwo() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_within_both_func_and_kern to i16), i16 addrspace(3)* @used_within_both_func_and_kern.offset, align 2
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.offset, align 2
				; CHECK: %ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_within_both_func_and_kern, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: %mul = mul i32 %ld, 2
				; CHECK: store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_only_within_kern, i32 0, i32 0) to i32) to i64), i64 ptrtoint (i32 addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @used_within_both_func_and_kern, i32 0, i32 0) to i32) to i64)) to i32), align 4
				; CHECK: call void @set(i32 0)
				; CHECK: %0 = call i32 @get()
				; CHECK: ret void
				entry:
				%ld = load i32, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_within_both_func_and_kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				%mul = mul i32 %ld, 2
				store i32 %mul, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_kern to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_within_both_func_and_kern to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
				call void @set(i32 0)
				%0 = call i32 @get()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_const_expr2.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				@lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4
				@global_pointer = addrspace(1) global i32 undef, align 4

				; LDS pointer should be created for `@lds_used_within_function`
				; CHECK: @lds_used_within_function.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; pointer replacement for `@lds_used_within_function` required
				define internal void @func_uses_lds2() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = ptrtoint [4 x i32] addrspace(3)* %2 to i32
				; CHECK: %4 = add i32 %3, %3
				; CHECK: store i32 %4, i32 addrspace(1)* @global_pointer, align 4
				; CHECK: ret void
				entry:
				store i32 add (i32 ptrtoint (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0) to i32), i32 ptrtoint (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0) to i32)), i32 addrspace(1)* @global_pointer, align 4
				ret void
				}

				; pointer replacement for `@lds_used_within_function` required
				define internal void @func_uses_lds() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = ptrtoint [4 x i32] addrspace(3)* %2 to i32
				; CHECK: %4 = add i32 %3, %3
				; CHECK: store i32 %4, i32 addrspace(1)* @global_pointer, align 4
				; CHECK: ret void
				entry:
				store i32 add (i32 ptrtoint (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0) to i32), i32 ptrtoint (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0) to i32)), i32 addrspace(1)* @global_pointer, align 4
				ret void
				}

				define protected amdgpu_kernel void @kernel_uses_lds() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function to i16), i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: call void @func_uses_lds()
				; CHECK: call void @func_uses_lds2()
				; CHECK: ret void
				entry:
				call void @func_uses_lds()
				call void @func_uses_lds2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_const_expr3.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				@lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4
				@global_var = internal addrspace(1) global [4 x i32] undef, align 4

				; LDS pointer should be created for `@lds_used_within_function`
				; CHECK: @lds_used_within_function.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; pointer replacement for `@lds_used_within_function` required
				define internal void @func_uses_lds2() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 2
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i32
				; CHECK: %6 = add i32 %5, ptrtoint (i32 addrspace(1)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				; CHECK: ret void
				entry:
				%0 = add i32 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 2) to i32) to i32), ptrtoint (i32 addrspace(1) getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				ret void
				}

				; pointer replacement for `@lds_used_within_function` required
				define internal void @func_uses_lds1() {
				; CHECK: entry:
				; CHECK: %0 = load i16, i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
				; CHECK: %2 = bitcast i8 addrspace(3)* %1 to [4 x i32] addrspace(3)*
				; CHECK: %3 = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* %2, i32 0, i32 2
				; CHECK: %4 = addrspacecast i32 addrspace(3)* %3 to i32*
				; CHECK: %5 = ptrtoint i32* %4 to i32
				; CHECK: %6 = add i32 %5, ptrtoint (i32 addrspace(1)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				; CHECK: ret void
				entry:
				%0 = add i32 ptrtoint (i32* addrspacecast (i32 addrspace(3)* getelementptr inbounds ([4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 2) to i32) to i32), ptrtoint (i32 addrspace(1) getelementptr inbounds ([4 x i32], [4 x i32] addrspace(1)* @global_var, i32 0, i32 2) to i32)
				ret void
				}

				define protected amdgpu_kernel void @kernel_uses_lds() {
				; CHECK: entry:
				; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @lds_used_within_function to i16), i16 addrspace(3)* @lds_used_within_function.offset, align 2
				; CHECK: call void @func_uses_lds1()
				; CHECK: call void @func_uses_lds2()
				; CHECK: ret void
				entry:
				call void @func_uses_lds1()
				call void @func_uses_lds2()
				ret void
				}

llvm/test/CodeGen/AMDGPU/lds_replace_by_pointer-use_within_not_rechable_func.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-replace-lds-use-with-pointer -S < %s \| FileCheck %s

				; Original LDS globals should exist as it is.
				; CHECK: @lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4
				@lds_used_within_function = internal addrspace(3) global [4 x i32] undef, align 4

				; Other global variables should exist as it is.
				; CHECK: @ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds_2, align 8
				@ptr_to_func = internal local_unnamed_addr externally_initialized global void ()* @func_uses_lds_2, align 8

				; New global LDS pointers is not expected to be created.
				; CHECK-NOT: @lds_used_within_function.offset = internal unnamed_addr addrspace(3) global i16 undef, align 2

				; Uses of LDS globals within this non-kernel function remains unchanged since this function is not called.
				define internal void @func_uses_lds_1() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0
				ret void
				}

				; Uses of LDS globals within this non-kernel function remains unchanged since this function is INDIRECTLY called.
				define internal void @func_uses_lds_2() {
				; CHECK: entry:
				; CHECK: %gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0
				; CHECK: ret void
				entry:
				%gep = getelementptr inbounds [4 x i32], [4 x i32] addrspace(3)* @lds_used_within_function, i32 0, i32 0
				ret void
				}

				; Kernel remains unchanged.
				define protected amdgpu_kernel void @kernel() {
				; CHECK: entry:
				; CHECK: %fptr = load void (), void ()* @ptr_to_func, align 8
				; CHECK: call void %fptr()
				; CHECK: ret void
				entry:
				%fptr = load void (), void ()* @ptr_to_func, align 8
				call void %fptr()
				ret void
				}

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-constantexpr-use.ll

	; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s			; RUN: opt -S -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
	; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=ASM %s			; RUN: llc -disable-promote-alloca-to-vector -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-lds-replace-with-pointer=false -amdgpu-enable-lower-module-lds=false < %s \| FileCheck -check-prefix=ASM %s

	target datalayout = "A5"			target datalayout = "A5"

	@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4			@all_lds = internal unnamed_addr addrspace(3) global [16384 x i32] undef, align 4
	@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4			@some_lds = internal unnamed_addr addrspace(3) global [32 x i32] undef, align 4

	@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4			@initializer_user_some = addrspace(1) global i32 ptrtoint ([32 x i32] addrspace(3)* @some_lds to i32), align 4
	@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4			@initializer_user_all = addrspace(1) global i32 ptrtoint ([16384 x i32] addrspace(3)* @all_lds to i32), align 4
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines