This is an archive of the discontinued LLVM Phabricator instance.

Scalarization for global uniform loads
ClosedPublic

Authored by alex-t on Nov 21 2016, 8:55 AM.

Download Raw Diff

Details

Reviewers

rampitec
nhaustov
• tstellarAMD
vpykhtin
arsenm

Commits

rG18009560c59d: [AMDGPU] Scalarization of global uniform loads.
rL289076: [AMDGPU] Scalarization of global uniform loads.

Summary

LC can currently select scalar load for uniform memory access basing on readonly memory address space only. This restriction originated from the fact that in HW prior to VI vector and scalar caches are not coherent. With MemoryDependenceAnalysis we can check that the memory location corresponding to the memory operand of the LOAD is not clobbered along the all paths from the function entry.

Diff Detail

Event Timeline

alex-t updated this revision to Diff 78729.Nov 21 2016, 8:55 AM

alex-t retitled this revision from to Scalarization for global uniform loads.

alex-t updated this object.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptNov 21 2016, 8:55 AM

Herald added subscribers: nhaehnle, arsenm. · View Herald Transcript

alex-t added reviewers: rampitec, nhaustov, vpykhtin, arsenm.Nov 21 2016, 8:56 AM

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

This is not valid to run on just any function. A value shall not be written to memory starting from the kernel. We currently inline everything, but when we have calls that will be an error.

rampitec added inline comments.Nov 21 2016, 10:50 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
173	What is wrong with Def?
lib/Target/AMDGPU/SIISelLowering.cpp
535	I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to identify a kernarg than UndefValue. Especially because later the very same logic will be used for user's non-inlined functions.
536	I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber?
lib/Target/AMDGPU/SMInstructions.td
227	It also has to be uniform.
test/CodeGen/AMDGPU/global_smrd.ll
2	Please add -verify-machineinstrs

fixes according the reviewers requests

Herald added a subscriber: wdng. · View Herald TranscriptNov 22 2016, 7:33 AM

alex-t added inline comments.Nov 22 2016, 7:33 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
173	Since I could not invent the example where the instructions that defines pointer may be of vector (non-uniform) kind I just deleted "isDef()" as you suggested.
lib/Target/AMDGPU/SIISelLowering.cpp
535	I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The code itself obviously copied from the SITargetLowering::isMemOpUniform() BTW, what check would you suggest for specifically kernel arguments?
536	This is not a "logic" - just "copy-paste" )

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

Also, this needs more tests. You can borrow the ones from the patch I mentioned earlier.

In D26917#601398, @tstellarAMD wrote:

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

Tom, your patch is cool. The only thing I don't like about it is the fact that you have to change address space of "not-clobberable" pointers. I cannot take into account all possible passes that may (or may not) leverage on the correct (unchanged) address space. Let's imagine that further somebody invent a very cool optimization that is legal for distinctly read only but not legal for the global (even not-clobberable).

The problem with this patch is that I have to change a huge amount of tests.
I looked into the several failed lit tests.

The reason is as follows:

most of the tests are intended to be as simple as possible that's why they don't use divergent intrinsic if this is not necessary for the test. As a result they use uniform loads to retrieve the data.
any arithmetic instructions taking this uniform data as operands become uniform as well. Since we use ISel to deduce the scalar/vector form of operation, we'll have most of the instruction flow scalar.

For example this simple input

%b_ptr = getelementptr <2 x i32>, <2 x i32> addrspace(1)* %in, i32 1
%a = load <2 x i32>, <2 x i32> addrspace(1) * %in
%b = load <2 x i32>, <2 x i32> addrspace(1) * %b_ptr
%result = and <2 x i32> %a, %b
store <2 x i32> %result, <2 x i32> addrspace(1)* %out

will produce mostly scalar flow:

s_load_dwordx2 s[2:3], s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
s_nop 0
s_waitcnt lgkmcnt(0)
s_load_dwordx2 s[4:5], s[2:3], 0x0
v_mov_b32_e32 v3, s1
s_load_dwordx2 s[2:3], s[2:3], 0x8
v_mov_b32_e32 v2, s0
s_waitcnt lgkmcnt(0)
s_and_b32 s3, s5, s3
s_and_b32 s2, s4, s2
v_mov_b32_e32 v0, s2
v_mov_b32_e32 v1, s3
flat_store_dwordx2 v[2:3], v[0:1]

approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

I just meant that the latter adds information to IR but the former loses information from IR.

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

This seems right to me, but it shall only run on kernel functions.

lib/Target/AMDGPU/SIISelLowering.cpp
535	Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue.

In D26917#602851, @rampitec wrote:

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

lib/Target/AMDGPU/SIISelLowering.cpp
536	Do we actually need all this code here? Isn't it enough just to check for the metadata?

In D26917#602882, @tstellarAMD wrote:

In D26917#602851, @rampitec wrote:

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

In D26917#609035, @alex-t wrote:

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

to

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

I do not think this is an issue. We used this with HSAIL for a long time and seen no problems. Moreover, with call support the same will be needed for the uniformness metadata as well.

This is improved implementation of the global memory scalarization. It checks if the memory location is clobbered along the CFG to the Function boundary. This approach is restricted by the FunctionPass capabilities and is not allowed to go beyond the current Function. So we cannot check accesses to Module level variables outside the Function. That's why analysis is restricted to kernel only given that any function calls (when implemented) will be considered as clobbers for any memory location.
This implementation relies on the existing Divergence analysis and does not attempt to improve it's results.

Global loads scalarization is SWITCHED OFF by default.
To enable use: "-amdgpu-scalarize-global-loads=true" LLC option.

Further work is planned to improve current implementation.
Namely:
Constant expression as a pointer operand support
Caching the results of the DFG along the CFG for clobbering memory accesses to shorten the search path and improve compile time on the large CFGs.
There is no currently any DFS depth limit since I had relevant experience in HSAIL backend and did not observe serious compile time impact even on large source files. If somebody feel it is necessary I can add depth limitation.

rampitec added inline comments.Dec 2 2016, 2:27 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26	Includes should be alphabetically sorted.
82	Could you please move brace in line with the for loop or remove braces?
96	Could you please capitalize names "checklist" and "load"? Same for other variables.
100	Would be nice to have spacing around "*" consistent.
113	Can you avoid const_cast and use const_iterator?
116	Inconsistent indent.
141	Inconsistent indent.
147	I would suggest checking for kernel only once in runOnFunction.
152	I suppose this condition can never happen, as you have replaced all uses of this Ptr below in the else block.

rampitec added inline comments.Dec 2 2016, 2:33 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
152	Please disregard this comment, the code is indeed reachable.

Style fixed

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26	Which symbol of the full file path should be used as a sorting key?
113	As you probably know there is no consistent strategy regarding "const" in LLVM. MemoryDependenceAnalysis::getSimplePointerDependencyFrom, as welll as other MDA interface methods, accepts non-const iterator while it does not change anything. That's why the only way is to make all the parameters and methods in whole call stack non-const. I removed "const" modifier every where along the code. No const_casts any longer. Also I changed getSimplePointerDependencyFrom to getPointerDependencyFrom that queries invariant.load metadata stuff and thus potentially provides better alias granularity.

rampitec added inline comments.Dec 5 2016, 9:49 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26	You did it right now ;) The key is a full string.
183	That is really better just to return right here if it is not kernel.

alex-t added inline comments.Dec 5 2016, 11:01 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183	Really? What's about the loads from the readonly memory? Aren't they still valid even in non-kernel?

rampitec added inline comments.Dec 5 2016, 11:08 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183	OK, I see your point. But then check for isKernelFunc first in the condition before doing the expensive DFS.

alex-t updated this revision to Diff 80302.Dec 5 2016, 11:46 AM

alex-t marked an inline comment as done.

LGTM

This revision is now accepted and ready to land.Dec 5 2016, 12:02 PM

arsenm added inline comments.Dec 5 2016, 3:31 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
20–21	Alphabetize
37	*LI
88–93	C++ style comments
lib/Target/AMDGPU/SIISelLowering.cpp
2620	Previous line
test/CodeGen/AMDGPU/global_smrd_cfg.ll
2	You don't need the -O2 since that's the default. Can you also change the check prefixes to GCN, and also run instnamer (same for the other tests)

arsenm added inline comments.Dec 5 2016, 3:32 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
136–137	Weird formatting
150–151	else on same line as previous }
153–154	Asterisks to right

alex-t updated this revision to Diff 80390.Dec 6 2016, 2:23 AM

alex-t edited edge metadata.

Closed by commit rL289076: [AMDGPU] Scalarization of global uniform loads. (authored by alex-t). · Explain WhyDec 8 2016, 9:39 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUAnnotateUniformValues.cpp

98 lines

AMDGPUSubtarget.h

4 lines

AMDGPUSubtarget.cpp

1 line

AMDGPUTargetMachine.cpp

10 lines

SIISelLowering.h

1 line

SIISelLowering.cpp

18 lines

SMInstructions.td

8 lines

test/

CodeGen/

AMDGPU/

global_smrd.ll

126 lines

global_smrd_cfg.ll

91 lines

Diff 80260

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp

	Show All 9 Lines
	/// \file			/// \file
	/// This pass adds amdgpu.uniform metadata to IR values so this information			/// This pass adds amdgpu.uniform metadata to IR values so this information
	/// can be used during instruction selection.			/// can be used during instruction selection.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"			#include "AMDGPU.h"
	#include "AMDGPUIntrinsicInfo.h"			#include "AMDGPUIntrinsicInfo.h"
				#include "llvm/ADT/SetVector.h"
	#include "llvm/Analysis/DivergenceAnalysis.h"			#include "llvm/Analysis/DivergenceAnalysis.h"
				#include "llvm/Analysis/MemoryDependenceAnalysis.h"
				#include "llvm/Analysis/LoopInfo.h"
				arsenmUnsubmitted Not Done Reply Inline Actions Alphabetize arsenm: Alphabetize
	#include "llvm/IR/InstVisitor.h"			#include "llvm/IR/InstVisitor.h"
	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"

				rampitecUnsubmitted Done Reply Inline Actions Includes should be alphabetically sorted. rampitec: Includes should be alphabetically sorted.
				alex-tAuthorUnsubmitted Not Done Reply Inline Actions Which symbol of the full file path should be used as a sorting key? alex-t: Which symbol of the full file path should be used as a sorting key?
				rampitecUnsubmitted Not Done Reply Inline Actions You did it right now ;) The key is a full string. rampitec: You did it right now ;) The key is a full string.
	#define DEBUG_TYPE "amdgpu-annotate-uniform"			#define DEBUG_TYPE "amdgpu-annotate-uniform"

	using namespace llvm;			using namespace llvm;

	namespace {			namespace {

	class AMDGPUAnnotateUniformValues : public FunctionPass,			class AMDGPUAnnotateUniformValues : public FunctionPass,
	public InstVisitor<AMDGPUAnnotateUniformValues> {			public InstVisitor<AMDGPUAnnotateUniformValues> {
	DivergenceAnalysis *DA;			DivergenceAnalysis *DA;
				MemoryDependenceResults * MDR;
				LoopInfo * LI;
				arsenmUnsubmitted Not Done Reply Inline Actions LI arsenm:* *LI
				DenseMap<Value, GetElementPtrInst> noClobberClones;
				bool isKernelFunc;

	public:			public:
	static char ID;			static char ID;
	AMDGPUAnnotateUniformValues() :			AMDGPUAnnotateUniformValues() :
	FunctionPass(ID) { }			FunctionPass(ID) { }
	bool doInitialization(Module &M) override;			bool doInitialization(Module &M) override;
	bool runOnFunction(Function &F) override;			bool runOnFunction(Function &F) override;
	StringRef getPassName() const override {			StringRef getPassName() const override {
	return "AMDGPU Annotate Uniform Values";			return "AMDGPU Annotate Uniform Values";
	}			}
	void getAnalysisUsage(AnalysisUsage &AU) const override {			void getAnalysisUsage(AnalysisUsage &AU) const override {
	AU.addRequired<DivergenceAnalysis>();			AU.addRequired<DivergenceAnalysis>();
				AU.addRequired<MemoryDependenceWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
	AU.setPreservesAll();			AU.setPreservesAll();
	}			}

	void visitBranchInst(BranchInst &I);			void visitBranchInst(BranchInst &I);
	void visitLoadInst(LoadInst &I);			void visitLoadInst(LoadInst &I);
				bool isClobberedInFunction(LoadInst * Load);
	};			};

	} // End anonymous namespace			} // End anonymous namespace

	INITIALIZE_PASS_BEGIN(AMDGPUAnnotateUniformValues, DEBUG_TYPE,			INITIALIZE_PASS_BEGIN(AMDGPUAnnotateUniformValues, DEBUG_TYPE,
	"Add AMDGPU uniform metadata", false, false)			"Add AMDGPU uniform metadata", false, false)
	INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)			INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)
				INITIALIZE_PASS_DEPENDENCY(MemoryDependenceWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
	INITIALIZE_PASS_END(AMDGPUAnnotateUniformValues, DEBUG_TYPE,			INITIALIZE_PASS_END(AMDGPUAnnotateUniformValues, DEBUG_TYPE,
	"Add AMDGPU uniform metadata", false, false)			"Add AMDGPU uniform metadata", false, false)

	char AMDGPUAnnotateUniformValues::ID = 0;			char AMDGPUAnnotateUniformValues::ID = 0;

	static void setUniformMetadata(Instruction *I) {			static void setUniformMetadata(Instruction *I) {
	I->setMetadata("amdgpu.uniform", MDNode::get(I->getContext(), {}));			I->setMetadata("amdgpu.uniform", MDNode::get(I->getContext(), {}));
	}			}
				static void setNoClobberMetadata(Instruction *I) {
				I->setMetadata("amdgpu.noclobber", MDNode::get(I->getContext(), {}));
				}

				static void DFS(BasicBlock * Root, SetVector<BasicBlock*> & Set) {
				for (auto I : predecessors(Root))
				rampitecUnsubmitted Done Reply Inline Actions Could you please move brace in line with the for loop or remove braces? rampitec: Could you please move brace in line with the for loop or remove braces?
				if (Set.insert(I))
				DFS(I, Set);
				}

				bool AMDGPUAnnotateUniformValues::isClobberedInFunction(LoadInst * Load) {
				/*
				1. get Loop for the Load->getparent();
				2. if it exists, collect all the BBs from the most outer
				loop and check for the writes. If NOT - start DFS over all preds.
				3. Start DFS over all preds from the most outer loop header.
				*/
				arsenmUnsubmitted Not Done Reply Inline Actions C++ style comments arsenm: C++ style comments
				SetVector<BasicBlock *> Checklist;
				BasicBlock * Start = Load->getParent();
				Checklist.insert(Start);
				rampitecUnsubmitted Done Reply Inline Actions Could you please capitalize names "checklist" and "load"? Same for other variables. rampitec: Could you please capitalize names "checklist" and "load"? Same for other variables.
				const Value * Ptr = Load->getPointerOperand();
				const Loop * L = LI->getLoopFor(Start);
				if (L) {
				const Loop * P = L;
				rampitecUnsubmitted Done Reply Inline Actions Would be nice to have spacing around "" consistent. rampitec:* Would be nice to have spacing around "*" consistent.
				do {
				L = P;
				P = P->getParentLoop();
				} while (P);
				Checklist.insert(L->block_begin(), L->block_end());
				Start = L->getHeader();
				}

				DFS(Start, Checklist);
				for (auto BB : Checklist) {
				BasicBlock::iterator StartIt = (BB == Load->getParent()) ?
				BasicBlock::iterator(Load) : BB->end();
				if (MDR->getPointerDependencyFrom(MemoryLocation(Ptr),
				rampitecUnsubmitted Done Reply Inline Actions Can you avoid const_cast and use const_iterator? rampitec: Can you avoid const_cast and use const_iterator?
				alex-tAuthorUnsubmitted Not Done Reply Inline Actions As you probably know there is no consistent strategy regarding "const" in LLVM. MemoryDependenceAnalysis::getSimplePointerDependencyFrom, as welll as other MDA interface methods, accepts non-const iterator while it does not change anything. That's why the only way is to make all the parameters and methods in whole call stack non-const. I removed "const" modifier every where along the code. No const_casts any longer. Also I changed getSimplePointerDependencyFrom to getPointerDependencyFrom that queries invariant.load metadata stuff and thus potentially provides better alias granularity. alex-t: As you probably know there is no consistent strategy regarding "const" in LLVM.
				true, StartIt, BB, Load).isClobber())
				return true;
				}
				rampitecUnsubmitted Done Reply Inline Actions Inconsistent indent. rampitec: Inconsistent indent.
				return false;
				}

	void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {			void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {
	if (I.isUnconditional())			if (I.isUnconditional())
	return;			return;

	Value *Cond = I.getCondition();			Value *Cond = I.getCondition();
	if (!DA->isUniform(Cond))			if (!DA->isUniform(Cond))
	return;			return;

	setUniformMetadata(I.getParent()->getTerminator());			setUniformMetadata(I.getParent()->getTerminator());
	}			}

	void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {			void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {
	Value *Ptr = I.getPointerOperand();			Value *Ptr = I.getPointerOperand();
	if (!DA->isUniform(Ptr))			if (!DA->isUniform(Ptr))
	return;			return;
				auto isGlobalLoad = [](LoadInst & Load)->bool {
				return Load.getPointerAddressSpace()
				== AMDGPUAS::GLOBAL_ADDRESS;
				arsenmUnsubmitted Not Done Reply Inline Actions Weird formatting arsenm: Weird formatting
				};
				// We're tracking up to the Function boundaries
				// We cannot go beyond because of FunctionPass restrictions
				// Thus we can ensure that memory not clobbered for memory
				rampitecUnsubmitted Done Reply Inline Actions Inconsistent indent. rampitec: Inconsistent indent.
				// operations that live in kernel only.
				bool NotClobbered = !isClobberedInFunction(&I) && isKernelFunc;
				Instruction *PtrI = dyn_cast<Instruction>(Ptr);
				if (!PtrI && NotClobbered && isGlobalLoad(I)) {
				if (isa<Argument>(Ptr) \|\| isa<GlobalValue>(Ptr)) {
				// Lookup for the existing GEP
				rampitecUnsubmitted Not Done Reply Inline Actions I would suggest checking for kernel only once in runOnFunction. rampitec: I would suggest checking for kernel only once in runOnFunction.
				if (noClobberClones.count(Ptr)) {
				PtrI = noClobberClones[Ptr];
				}
				else {
				arsenmUnsubmitted Not Done Reply Inline Actions else on same line as previous } arsenm: else on same line as previous }
				// Create GEP of the Value
				rampitecUnsubmitted Not Done Reply Inline Actions I suppose this condition can never happen, as you have replaced all uses of this Ptr below in the else block. rampitec: I suppose this condition can never happen, as you have replaced all uses of this Ptr below in…
				rampitecUnsubmitted Not Done Reply Inline Actions Please disregard this comment, the code is indeed reachable. rampitec: Please disregard this comment, the code is indeed reachable.
				Function * F = I.getParent()->getParent();
				Value * Idx = Constant::getIntegerValue(
				arsenmUnsubmitted Not Done Reply Inline Actions Asterisks to right arsenm: Asterisks to right
				Type::getInt32Ty(Ptr->getContext()), APInt(64, 0));
				// Insert GEP at the entry to make it dominate all uses
				PtrI = GetElementPtrInst::Create(
				Ptr->getType()->getPointerElementType(), Ptr,
				ArrayRef<Value*>(Idx), Twine(""), F->getEntryBlock().getFirstNonPHI());
				}
				I.replaceUsesOfWith(Ptr, PtrI);
				}
				}

	if (Instruction *PtrI = dyn_cast<Instruction>(Ptr))			if (PtrI) {
	setUniformMetadata(PtrI);			setUniformMetadata(PtrI);
				if (NotClobbered)
				setNoClobberMetadata(PtrI);
				}
	}			}

	bool AMDGPUAnnotateUniformValues::doInitialization(Module &M) {			bool AMDGPUAnnotateUniformValues::doInitialization(Module &M) {
	return false;			return false;
				rampitecUnsubmitted Done Reply Inline Actions What is wrong with Def? rampitec: What is wrong with Def?
				alex-tAuthorUnsubmitted Not Done Reply Inline Actions Since I could not invent the example where the instructions that defines pointer may be of vector (non-uniform) kind I just deleted "isDef()" as you suggested. alex-t: Since I could not invent the example where the instructions that defines pointer may be of…
	}			}

	bool AMDGPUAnnotateUniformValues::runOnFunction(Function &F) {			bool AMDGPUAnnotateUniformValues::runOnFunction(Function &F) {
	if (skipFunction(F))			if (skipFunction(F))
	return false;			return false;

	DA = &getAnalysis<DivergenceAnalysis>();			DA = &getAnalysis<DivergenceAnalysis>();
	visit(F);			MDR = &getAnalysis<MemoryDependenceWrapperPass>().getMemDep();
				LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				isKernelFunc = F.getCallingConv() == CallingConv::AMDGPU_KERNEL;
				rampitecUnsubmitted Not Done Reply Inline Actions That is really better just to return right here if it is not kernel. rampitec: That is really better just to return right here if it is not kernel.
				alex-tAuthorUnsubmitted Not Done Reply Inline Actions Really? What's about the loads from the readonly memory? Aren't they still valid even in non-kernel? alex-t: Really? What's about the loads from the readonly memory? Aren't they still valid even in non…
				rampitecUnsubmitted Done Reply Inline Actions OK, I see your point. But then check for isKernelFunc first in the condition before doing the expensive DFS. rampitec: OK, I see your point. But then check for isKernelFunc first in the condition before doing the…

				visit(F);
				noClobberClones.clear();
	return true;			return true;
	}			}

	FunctionPass *			FunctionPass *
	llvm::createAMDGPUAnnotateUniformValues() {			llvm::createAMDGPUAnnotateUniformValues() {
	return new AMDGPUAnnotateUniformValues();			return new AMDGPUAnnotateUniformValues();
	}			}

lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	protected:
bool HasScalarStores;		bool HasScalarStores;
bool HasInv2PiInlineImm;		bool HasInv2PiInlineImm;
bool FlatAddressSpace;		bool FlatAddressSpace;
bool R600ALUInst;		bool R600ALUInst;
bool CaymanISA;		bool CaymanISA;
bool CFALUBug;		bool CFALUBug;
bool HasVertexCache;		bool HasVertexCache;
short TexVTXClauseSize;		short TexVTXClauseSize;
		bool ScalarizeGlobal;

// Dummy feature to use for assembler in tablegen.		// Dummy feature to use for assembler in tablegen.
bool FeatureDisable;		bool FeatureDisable;

InstrItineraryData InstrItins;		InstrItineraryData InstrItins;
SelectionDAGTargetInfo TSInfo;		SelectionDAGTargetInfo TSInfo;

public:		public:
▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	unsigned getMaxFlatWorkGroupSize() const {
return 2048;		return 2048;
}		}

/// \returns Number of waves per work group given the flat work group size.		/// \returns Number of waves per work group given the flat work group size.
unsigned getWavesPerWorkGroup(unsigned FlatWorkGroupSize) const {		unsigned getWavesPerWorkGroup(unsigned FlatWorkGroupSize) const {
return alignTo(FlatWorkGroupSize, getWavefrontSize()) / getWavefrontSize();		return alignTo(FlatWorkGroupSize, getWavefrontSize()) / getWavefrontSize();
}		}

		void setScalarizeGlobalBehavior(bool b) { ScalarizeGlobal = b;}
		bool getScalarizeGlobalBehavior() const { return ScalarizeGlobal;}

/// \returns Subtarget's default pair of minimum/maximum flat work group sizes		/// \returns Subtarget's default pair of minimum/maximum flat work group sizes
/// for function \p F, or minimum/maximum flat work group sizes explicitly		/// for function \p F, or minimum/maximum flat work group sizes explicitly
/// requested using "amdgpu-flat-work-group-size" attribute attached to		/// requested using "amdgpu-flat-work-group-size" attribute attached to
/// function \p F.		/// function \p F.
///		///
/// \returns Subtarget's default values if explicitly requested values cannot		/// \returns Subtarget's default values if explicitly requested values cannot
/// be converted to integer, or violate subtarget's specifications.		/// be converted to integer, or violate subtarget's specifications.
std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) const;		std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) const;
▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	: AMDGPUGenSubtargetInfo(TT, GPU, FS),
HasInv2PiInlineImm(false),		HasInv2PiInlineImm(false),
FlatAddressSpace(false),		FlatAddressSpace(false),

R600ALUInst(false),		R600ALUInst(false),
CaymanISA(false),		CaymanISA(false),
CFALUBug(false),		CFALUBug(false),
HasVertexCache(false),		HasVertexCache(false),
TexVTXClauseSize(0),		TexVTXClauseSize(0),
		ScalarizeGlobal(false),

FeatureDisable(false),		FeatureDisable(false),
InstrItins(getInstrItineraryForCPU(GPU)),		InstrItins(getInstrItineraryForCPU(GPU)),
TSInfo() {		TSInfo() {
initializeSubtargetDependencies(TT, GPU, FS);		initializeSubtargetDependencies(TT, GPU, FS);
}		}

// FIXME: These limits are for SI. Did they change with the larger maximum LDS		// FIXME: These limits are for SI. Did they change with the larger maximum LDS
▲ Show 20 Lines • Show All 238 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

// Option to disable vectorizer for tests.		// Option to disable vectorizer for tests.
static cl::opt<bool> EnableLoadStoreVectorizer(		static cl::opt<bool> EnableLoadStoreVectorizer(
"amdgpu-load-store-vectorizer",		"amdgpu-load-store-vectorizer",
cl::desc("Enable load store vectorizer"),		cl::desc("Enable load store vectorizer"),
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

		// Option to to control global loads scalarization
		static cl::opt<bool> ScalarizeGlobal(
		"amdgpu-scalarize-global-loads",
		cl::desc("Enable global load scalarization"),
		cl::init(false),
		cl::Hidden);


extern "C" void LLVMInitializeAMDGPUTarget() {		extern "C" void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFixSGPRCopiesPass(*PR);		initializeSIFixSGPRCopiesPass(*PR);
▲ Show 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	#else
SIGISelActualAccessor *GISel = new SIGISelActualAccessor();		SIGISelActualAccessor *GISel = new SIGISelActualAccessor();
GISel->CallLoweringInfo.reset(		GISel->CallLoweringInfo.reset(
new AMDGPUCallLowering(*I->getTargetLowering()));		new AMDGPUCallLowering(*I->getTargetLowering()));
#endif		#endif

I->setGISelAccessor(*GISel);		I->setGISelAccessor(*GISel);
}		}

		I->setScalarizeGlobalBehavior(ScalarizeGlobal);

return I.get();		return I.get();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AMDGPU Pass Setup		// AMDGPU Pass Setup
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
▲ Show 20 Lines • Show All 340 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	public:

EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,		EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
unsigned SrcAlign, bool IsMemset,		unsigned SrcAlign, bool IsMemset,
bool ZeroMemset,		bool ZeroMemset,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const override;		MachineFunction &MF) const override;

bool isMemOpUniform(const SDNode *N) const;		bool isMemOpUniform(const SDNode *N) const;
		bool isMemOpHasNoClobberedMemOperand(const SDNode *N) const;
bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;		bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
getPreferredVectorAction(EVT VT) const override;		getPreferredVectorAction(EVT VT) const override;

bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const override;		Type *Ty) const override;

▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 518 Lines • ▼ Show 20 Lines	return AS == AMDGPUAS::GLOBAL_ADDRESS \|\|
AS == AMDGPUAS::CONSTANT_ADDRESS;		AS == AMDGPUAS::CONSTANT_ADDRESS;
}		}

bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,		bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,
unsigned DestAS) const {		unsigned DestAS) const {
return isFlatGlobalAddrSpace(SrcAS) && isFlatGlobalAddrSpace(DestAS);		return isFlatGlobalAddrSpace(SrcAS) && isFlatGlobalAddrSpace(DestAS);
}		}

		bool SITargetLowering::isMemOpHasNoClobberedMemOperand(const SDNode *N) const {
		const MemSDNode *MemNode = cast<MemSDNode>(N);
		const Value *Ptr = MemNode->getMemOperand()->getValue();
		const Instruction *I = dyn_cast<Instruction>(Ptr);
		return I && I->getMetadata("amdgpu.noclobber");
		}

bool SITargetLowering::isMemOpUniform(const SDNode *N) const {		bool SITargetLowering::isMemOpUniform(const SDNode *N) const {
const MemSDNode *MemNode = cast<MemSDNode>(N);		const MemSDNode *MemNode = cast<MemSDNode>(N);
		rampitecUnsubmitted Not Done Reply Inline Actions I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to identify a kernarg than UndefValue. Especially because later the very same logic will be used for user's non-inlined functions. rampitec: I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to…
		alex-tAuthorUnsubmitted Not Done Reply Inline Actions I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The code itself obviously copied from the SITargetLowering::isMemOpUniform() BTW, what check would you suggest for specifically kernel arguments? alex-t: I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The…
		rampitecUnsubmitted Not Done Reply Inline Actions Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue. rampitec: Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue.
const Value *Ptr = MemNode->getMemOperand()->getValue();		const Value *Ptr = MemNode->getMemOperand()->getValue();
		rampitecUnsubmitted Done Reply Inline Actions I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber? rampitec: I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber?
		alex-tAuthorUnsubmitted Not Done Reply Inline Actions This is not a "logic" - just "copy-paste" ) alex-t: This is not a "logic" - just "copy-paste" )
		tstellarAMDUnsubmitted Not Done Reply Inline Actions Do we actually need all this code here? Isn't it enough just to check for the metadata? tstellarAMD: Do we actually need all this code here? Isn't it enough just to check for the metadata?

// UndefValue means this is a load of a kernel input. These are uniform.		// UndefValue means this is a load of a kernel input. These are uniform.
// Sometimes LDS instructions have constant pointers.		// Sometimes LDS instructions have constant pointers.
// If Ptr is null, then that means this mem operand contains a		// If Ptr is null, then that means this mem operand contains a
// PseudoSourceValue like GOT.		// PseudoSourceValue like GOT.
if (!Ptr \|\| isa<UndefValue>(Ptr) \|\| isa<Argument>(Ptr) \|\|		if (!Ptr \|\| isa<UndefValue>(Ptr) \|\| isa<Argument>(Ptr) \|\|
isa<Constant>(Ptr) \|\| isa<GlobalValue>(Ptr))		isa<Constant>(Ptr) \|\| isa<GlobalValue>(Ptr))
return true;		return true;
▲ Show 20 Lines • Show All 2,062 Lines • ▼ Show 20 Lines	AS = MFI->hasFlatScratchInit() ?
AMDGPUAS::PRIVATE_ADDRESS : AMDGPUAS::GLOBAL_ADDRESS;		AMDGPUAS::PRIVATE_ADDRESS : AMDGPUAS::GLOBAL_ADDRESS;

unsigned NumElements = MemVT.getVectorNumElements();		unsigned NumElements = MemVT.getVectorNumElements();
switch (AS) {		switch (AS) {
case AMDGPUAS::CONSTANT_ADDRESS:		case AMDGPUAS::CONSTANT_ADDRESS:
if (isMemOpUniform(Load))		if (isMemOpUniform(Load))
return SDValue();		return SDValue();
// Non-uniform loads will be selected to MUBUF instructions, so they		// Non-uniform loads will be selected to MUBUF instructions, so they
// have the same legalization requires ments as global and private		// have the same legalization requirements as global and private
// loads.		// loads.
//		//
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case AMDGPUAS::GLOBAL_ADDRESS:		case AMDGPUAS::GLOBAL_ADDRESS:
		{
		arsenmUnsubmitted Not Done Reply Inline Actions Previous line arsenm: Previous line
		if (isMemOpUniform(Load) && isMemOpHasNoClobberedMemOperand(Load))
		return SDValue();
		// Non-uniform loads will be selected to MUBUF instructions, so they
		// have the same legalization requirements as global and private
		// loads.
		//
		}
		LLVM_FALLTHROUGH;
case AMDGPUAS::FLAT_ADDRESS:		case AMDGPUAS::FLAT_ADDRESS:
if (NumElements > 4)		if (NumElements > 4)
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);
// v4 loads are supported for private and global memory.		// v4 loads are supported for private and global memory.
return SDValue();		return SDValue();
case AMDGPUAS::PRIVATE_ADDRESS: {		case AMDGPUAS::PRIVATE_ADDRESS: {
// Depending on the setting of the private_element_size field in the		// Depending on the setting of the private_element_size field in the
// resource descriptor, we can only make private accesses up to a certain		// resource descriptor, we can only make private accesses up to a certain
▲ Show 20 Lines • Show All 1,477 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SMInstructions.td

	Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines
	} // SubtargetPredicate = isVI			} // SubtargetPredicate = isVI



	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Scalar Memory Patterns			// Scalar Memory Patterns
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//


	def smrd_load : PatFrag <(ops node:$ptr), (load node:$ptr), [{			def smrd_load : PatFrag <(ops node:$ptr), (load node:$ptr), [{
	auto Ld = cast<LoadSDNode>(N);			auto Ld = cast<LoadSDNode>(N);
	return Ld->getAlignment() >= 4 &&			return Ld->getAlignment() >= 4 &&
	Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS &&			((Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS &&
	static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N);			static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N)) \|\|
				(Subtarget->getScalarizeGlobalBehavior() && Ld->getAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS &&
				static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N) &&
				rampitecUnsubmitted Done Reply Inline Actions It also has to be uniform. rampitec: It also has to be uniform.
				static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpHasNoClobberedMemOperand(N)));
	}]>;			}]>;

	def SMRDImm : ComplexPattern<i64, 2, "SelectSMRDImm">;			def SMRDImm : ComplexPattern<i64, 2, "SelectSMRDImm">;
	def SMRDImm32 : ComplexPattern<i64, 2, "SelectSMRDImm32">;			def SMRDImm32 : ComplexPattern<i64, 2, "SelectSMRDImm32">;
	def SMRDSgpr : ComplexPattern<i64, 2, "SelectSMRDSgpr">;			def SMRDSgpr : ComplexPattern<i64, 2, "SelectSMRDSgpr">;
	def SMRDBufferImm : ComplexPattern<i32, 1, "SelectSMRDBufferImm">;			def SMRDBufferImm : ComplexPattern<i32, 1, "SelectSMRDBufferImm">;
	def SMRDBufferImm32 : ComplexPattern<i32, 1, "SelectSMRDBufferImm32">;			def SMRDBufferImm32 : ComplexPattern<i32, 1, "SelectSMRDBufferImm32">;
	def SMRDBufferSgpr : ComplexPattern<i32, 1, "SelectSMRDBufferSgpr">;			def SMRDBufferSgpr : ComplexPattern<i32, 1, "SelectSMRDBufferSgpr">;
	▲ Show 20 Lines • Show All 285 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/global_smrd.ll

This file was added.

				; RUN: llc -O2 -mtriple amdgcn--amdhsa -mcpu=fiji -amdgpu-scalarize-global-loads=true -verify-machineinstrs < %s \| FileCheck %s

				rampitecUnsubmitted Not Done Reply Inline Actions Please add -verify-machineinstrs rampitec: Please add -verify-machineinstrs
				; uniform loads
				; CHECK-LABEL: @uniform_load
				; CHECK: s_load_dwordx4
				; CHECK-NOT: flat_load_dword

				define amdgpu_kernel void @uniform_load(float addrspace(1)* %arg, float addrspace(1)* %arg1) {
				bb:
				%tmp2 = load float, float addrspace(1)* %arg, align 4, !tbaa !8
				%tmp3 = fadd float %tmp2, 0.000000e+00
				%tmp4 = getelementptr inbounds float, float addrspace(1)* %arg, i64 1
				%tmp5 = load float, float addrspace(1)* %tmp4, align 4, !tbaa !8
				%tmp6 = fadd float %tmp3, %tmp5
				%tmp7 = getelementptr inbounds float, float addrspace(1)* %arg, i64 2
				%tmp8 = load float, float addrspace(1)* %tmp7, align 4, !tbaa !8
				%tmp9 = fadd float %tmp6, %tmp8
				%tmp10 = getelementptr inbounds float, float addrspace(1)* %arg, i64 3
				%tmp11 = load float, float addrspace(1)* %tmp10, align 4, !tbaa !8
				%tmp12 = fadd float %tmp9, %tmp11
				%tmp13 = getelementptr inbounds float, float addrspace(1)* %arg1
				store float %tmp12, float addrspace(1)* %tmp13, align 4, !tbaa !8
				ret void
				}

				; non-uniform loads
				; CHECK-LABEL: @non-uniform_load
				; CHECK: flat_load_dword
				; CHECK-NOT: s_load_dwordx4

				define amdgpu_kernel void @non-uniform_load(float addrspace(1)* %arg, float addrspace(1)* %arg1) #0 {
				bb:
				%tmp = call i32 @llvm.amdgcn.workitem.id.x() #1
				%tmp2 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp
				%tmp3 = load float, float addrspace(1)* %tmp2, align 4, !tbaa !8
				%tmp4 = fadd float %tmp3, 0.000000e+00
				%tmp5 = add i32 %tmp, 1
				%tmp6 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp5
				%tmp7 = load float, float addrspace(1)* %tmp6, align 4, !tbaa !8
				%tmp8 = fadd float %tmp4, %tmp7
				%tmp9 = add i32 %tmp, 2
				%tmp10 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp9
				%tmp11 = load float, float addrspace(1)* %tmp10, align 4, !tbaa !8
				%tmp12 = fadd float %tmp8, %tmp11
				%tmp13 = add i32 %tmp, 3
				%tmp14 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp13
				%tmp15 = load float, float addrspace(1)* %tmp14, align 4, !tbaa !8
				%tmp16 = fadd float %tmp12, %tmp15
				%tmp17 = getelementptr inbounds float, float addrspace(1)* %arg1, i32 %tmp
				store float %tmp16, float addrspace(1)* %tmp17, align 4, !tbaa !8
				ret void
				}


				; uniform load dominated by no-alias store - scalarize
				; CHECK-LABEL: @no_memdep_alias_arg
				; CHECK: flat_store_dword
				; CHECK: s_load_dword [[SVAL:s[0-9]+]]
				; CHECK: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[SVAL]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]

				define amdgpu_kernel void @no_memdep_alias_arg(i32 addrspace(1)* noalias %in, i32 addrspace(1)* %out0, i32 addrspace(1)* %out1) {
				store i32 0, i32 addrspace(1)* %out0
				%val = load i32, i32 addrspace(1)* %in
				store i32 %val, i32 addrspace(1)* %out1
				ret void
				}

				; uniform load dominated by alias store - vector
				; CHECK-LABEL: {{^}}memdep:
				; CHECK: flat_store_dword
				; CHECK: flat_load_dword [[VVAL:v[0-9]+]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]
				define amdgpu_kernel void @memdep(i32 addrspace(1)* %in, i32 addrspace(1)* %out0, i32 addrspace(1)* %out1) {
				store i32 0, i32 addrspace(1)* %out0
				%val = load i32, i32 addrspace(1)* %in
				store i32 %val, i32 addrspace(1)* %out1
				ret void
				}

				; uniform load from global array
				; CHECK-LABEL: @global_array
				; CHECK: s_load_dwordx2 [[A_ADDR:s\[[0-9]+:[0-9]+\]]]
				; CHECK: s_load_dwordx2 [[A_ADDR1:s\[[0-9]+:[0-9]+\]]], [[A_ADDR]], 0x0
				; CHECK: s_load_dword [[SVAL:s[0-9]+]], [[A_ADDR1]], 0x0
				; CHECK: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[SVAL]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]

				@A = common local_unnamed_addr addrspace(1) global i32 addrspace(1)* null, align 4

				define amdgpu_kernel void @global_array(i32 addrspace(1)* nocapture %out) {
				entry:
				%0 = load i32 addrspace(1), i32 addrspace(1) addrspace(1)* @A, align 4
				%1 = load i32, i32 addrspace(1)* %0, align 4
				store i32 %1, i32 addrspace(1)* %out, align 4
				ret void
				}


				; uniform load from global array dominated by alias store
				; CHECK-LABEL: @global_array_alias_store
				; CHECK: flat_store_dword
				; CHECK: v_mov_b32_e32 v[[ADDR_LO:[0-9]+]], s{{[0-9]+}}
				; CHECK: v_mov_b32_e32 v[[ADDR_HI:[0-9]+]], s{{[0-9]+}}
				; CHECK: flat_load_dwordx2 [[A_ADDR:v\[[0-9]+:[0-9]+\]]], v{{\[}}[[ADDR_LO]]:[[ADDR_HI]]{{\]}}
				; CHECK: flat_load_dword [[VVAL:v[0-9]+]], [[A_ADDR]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]
				define amdgpu_kernel void @global_array_alias_store(i32 addrspace(1)* nocapture %out, i32 %n) {
				entry:
				%gep = getelementptr i32, i32 addrspace(1) * %out, i32 %n
				store i32 12, i32 addrspace(1) * %gep
				%0 = load i32 addrspace(1), i32 addrspace(1) addrspace(1)* @A, align 4
				%1 = load i32, i32 addrspace(1)* %0, align 4
				store i32 %1, i32 addrspace(1)* %out, align 4
				ret void
				}


				declare i32 @llvm.amdgcn.workitem.id.x() #1

				attributes #1 = { nounwind readnone }

				!8 = !{!9, !9, i64 0}
				!9 = !{!"float", !10, i64 0}
				!10 = !{!"omnipotent char", !11, i64 0}
				!11 = !{!"Simple C/C++ TBAA"}

test/CodeGen/AMDGPU/global_smrd_cfg.ll

This file was added.

				; RUN: llc -O2 -mtriple amdgcn--amdhsa -mcpu=fiji -amdgpu-scalarize-global-loads=true -verify-machineinstrs < %s \| FileCheck %s
				; CHECK-LABEL: %entry
				arsenmUnsubmitted Not Done Reply Inline Actions You don't need the -O2 since that's the default. Can you also change the check prefixes to GCN, and also run instnamer (same for the other tests) arsenm: You don't need the -O2 since that's the default. Can you also change the check prefixes to GCN…
				; CHECK: s_load_dwordx2 s{{\[}}[[REG_IN_LO:[0-9]+]]:[[REG_IN_HI:[0-9]+]]{{\]}}, s[4:5], 0x0
				; CHECK: s_load_dwordx2 s{{\[}}[[REG_OUT_LO:[0-9]+]]:[[REG_OUT_HI:[0-9]+]]{{\]}}, s[4:5], 0x8
				; CHECK-LABEL: %for.body.preheader
				; CHECK-DAG: v_mov_b32_e32 v[[ADDR_IN_LO:[0-9]+]], s[[REG_IN_LO]]
				; CHECK-DAG: v_mov_b32_e32 v[[ADDR_IN_HI:[0-9]+]], s[[REG_IN_HI]]
				; CHECK-DAG: v_mov_b32_e32 v[[ADDR_OUT_LO:[0-9]+]], s[[REG_OUT_LO]]
				; CHECK-DAG: v_mov_b32_e32 v[[ADDR_OUT_HI:[0-9]+]], s[[REG_OUT_HI]]

				; #####################################################################

				; CHECK-LABEL: %for.body

				; Load from %in in a Loop body has alias store

				; CHECK: flat_load_dword

				; CHECK-LABEL: %if.then
				; CHECK: flat_store_dword v{{\[}}[[ADDR_OUT_LO]]:[[ADDR_OUT_HI]]{{\]}}

				; #####################################################################

				; CHECK-LABEL: %if.end

				; Load from %in has alias store in Loop

				; CHECK: flat_load_dword v{{[0-9]+}}, v{{\[}}[[ADDR_IN_LO]]:[[ADDR_IN_HI]]{{\]}}

				; #####################################################################

				; CHECK: v_readfirstlane_b32 s[[SREG_LO:[0-9]+]], v[[ADDR_OUT_LO]]
				; CHECK: v_readfirstlane_b32 s[[SREG_HI:[0-9]+]], v[[ADDR_OUT_HI]]

				; Load from %out has no-alias store in Loop - out[i+1] never alias out[i]

				; CHECK: s_load_dword s{{[0-9]+}}, s{{\[}}[[SREG_LO]]:[[SREG_HI]]{{\]}}, 0x4

				define amdgpu_kernel void @cfg(i32 addrspace(1)* nocapture readonly %in, i32 addrspace(1)* nocapture %out, i32 %n) {
				entry:
				%idxprom = sext i32 %n to i64
				%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %idxprom
				%0 = load i32, i32 addrspace(1)* %arrayidx, align 4, !tbaa !7
				%cmp30 = icmp sgt i32 %0, 0
				br i1 %cmp30, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %if.end
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add11, %for.cond.cleanup.loopexit ]
				%arrayidx13 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 %idxprom
				store i32 %sum.0.lcssa, i32 addrspace(1)* %arrayidx13, align 4, !tbaa !7
				ret void

				for.body: ; preds = %if.end, %for.body.preheader
				%sum.032 = phi i32 [ %add11, %if.end ], [ 0, %for.body.preheader ]
				%i.031 = phi i32 [ %add, %if.end ], [ 0, %for.body.preheader ]
				%rem = srem i32 %i.031, %n
				%idxprom1 = sext i32 %rem to i64
				%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %idxprom1
				%1 = load i32, i32 addrspace(1)* %arrayidx2, align 4, !tbaa !7
				%cmp3 = icmp sgt i32 %1, 100
				%idxprom4 = sext i32 %i.031 to i64
				br i1 %cmp3, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%arrayidx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 %idxprom4
				store i32 0, i32 addrspace(1)* %arrayidx5, align 4, !tbaa !7
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%arrayidx7 = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %idxprom4
				%2 = load i32, i32 addrspace(1)* %arrayidx7, align 4, !tbaa !7
				%add = add nuw nsw i32 %i.031, 1
				%idxprom8 = sext i32 %add to i64
				%arrayidx9 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 %idxprom8
				%3 = load i32, i32 addrspace(1)* %arrayidx9, align 4, !tbaa !7
				%add10 = add i32 %2, %sum.032
				%add11 = add i32 %add10, %3
				%exitcond = icmp eq i32 %add, %0
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

				!7 = !{!8, !8, i64 0}
				!8 = !{!"int", !9, i64 0}
				!9 = !{!"omnipotent char", !10, i64 0}
				!10 = !{!"Simple C/C++ TBAA"}