This is an archive of the discontinued LLVM Phabricator instance.

Scalarization for global uniform loads
ClosedPublic

Authored by alex-t on Nov 21 2016, 8:55 AM.

Download Raw Diff

Details

Reviewers

rampitec
nhaustov
• tstellarAMD
vpykhtin
arsenm

Commits

rG18009560c59d: [AMDGPU] Scalarization of global uniform loads.
rL289076: [AMDGPU] Scalarization of global uniform loads.

Summary

LC can currently select scalar load for uniform memory access basing on readonly memory address space only. This restriction originated from the fact that in HW prior to VI vector and scalar caches are not coherent. With MemoryDependenceAnalysis we can check that the memory location corresponding to the memory operand of the LOAD is not clobbered along the all paths from the function entry.

Diff Detail

Repository: rL LLVM

Event Timeline

alex-t updated this revision to Diff 78729.Nov 21 2016, 8:55 AM

alex-t retitled this revision from to Scalarization for global uniform loads.

alex-t updated this object.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptNov 21 2016, 8:55 AM

Herald added subscribers: nhaehnle, arsenm. · View Herald Transcript

alex-t added reviewers: rampitec, nhaustov, vpykhtin, arsenm.Nov 21 2016, 8:56 AM

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

This is not valid to run on just any function. A value shall not be written to memory starting from the kernel. We currently inline everything, but when we have calls that will be an error.

rampitec added inline comments.Nov 21 2016, 10:50 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
94 ↗	(On Diff #78729)	What is wrong with Def?
lib/Target/AMDGPU/SIISelLowering.cpp
535 ↗	(On Diff #78729)	I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to identify a kernarg than UndefValue. Especially because later the very same logic will be used for user's non-inlined functions.
536 ↗	(On Diff #78729)	I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber?
lib/Target/AMDGPU/SMInstructions.td
226 ↗	(On Diff #78729)	It also has to be uniform.
test/CodeGen/AMDGPU/global_smrd.ll
1 ↗	(On Diff #78729)	Please add -verify-machineinstrs

fixes according the reviewers requests

Herald added a subscriber: wdng. · View Herald TranscriptNov 22 2016, 7:33 AM

alex-t added inline comments.Nov 22 2016, 7:33 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
94 ↗	(On Diff #78729)	Since I could not invent the example where the instructions that defines pointer may be of vector (non-uniform) kind I just deleted "isDef()" as you suggested.
lib/Target/AMDGPU/SIISelLowering.cpp
535 ↗	(On Diff #78729)	I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The code itself obviously copied from the SITargetLowering::isMemOpUniform() BTW, what check would you suggest for specifically kernel arguments?
536 ↗	(On Diff #78729)	This is not a "logic" - just "copy-paste" )

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

Also, this needs more tests. You can borrow the ones from the patch I mentioned earlier.

In D26917#601398, @tstellarAMD wrote:

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

Tom, your patch is cool. The only thing I don't like about it is the fact that you have to change address space of "not-clobberable" pointers. I cannot take into account all possible passes that may (or may not) leverage on the correct (unchanged) address space. Let's imagine that further somebody invent a very cool optimization that is legal for distinctly read only but not legal for the global (even not-clobberable).

The problem with this patch is that I have to change a huge amount of tests.
I looked into the several failed lit tests.

The reason is as follows:

most of the tests are intended to be as simple as possible that's why they don't use divergent intrinsic if this is not necessary for the test. As a result they use uniform loads to retrieve the data.
any arithmetic instructions taking this uniform data as operands become uniform as well. Since we use ISel to deduce the scalar/vector form of operation, we'll have most of the instruction flow scalar.

For example this simple input

%b_ptr = getelementptr <2 x i32>, <2 x i32> addrspace(1)* %in, i32 1
%a = load <2 x i32>, <2 x i32> addrspace(1) * %in
%b = load <2 x i32>, <2 x i32> addrspace(1) * %b_ptr
%result = and <2 x i32> %a, %b
store <2 x i32> %result, <2 x i32> addrspace(1)* %out

will produce mostly scalar flow:

s_load_dwordx2 s[2:3], s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
s_nop 0
s_waitcnt lgkmcnt(0)
s_load_dwordx2 s[4:5], s[2:3], 0x0
v_mov_b32_e32 v3, s1
s_load_dwordx2 s[2:3], s[2:3], 0x8
v_mov_b32_e32 v2, s0
s_waitcnt lgkmcnt(0)
s_and_b32 s3, s5, s3
s_and_b32 s2, s4, s2
v_mov_b32_e32 v0, s2
v_mov_b32_e32 v1, s3
flat_store_dwordx2 v[2:3], v[0:1]

approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

I just meant that the latter adds information to IR but the former loses information from IR.

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

This seems right to me, but it shall only run on kernel functions.

lib/Target/AMDGPU/SIISelLowering.cpp
535 ↗	(On Diff #78729)	Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue.

In D26917#602851, @rampitec wrote:

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

lib/Target/AMDGPU/SIISelLowering.cpp
536 ↗	(On Diff #78729)	Do we actually need all this code here? Isn't it enough just to check for the metadata?

In D26917#602882, @tstellarAMD wrote:

In D26917#602851, @rampitec wrote:

In D26917#602618, @tstellarAMD wrote:

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

In D26917#609035, @alex-t wrote:

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

to

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

I do not think this is an issue. We used this with HSAIL for a long time and seen no problems. Moreover, with call support the same will be needed for the uniformness metadata as well.

This is improved implementation of the global memory scalarization. It checks if the memory location is clobbered along the CFG to the Function boundary. This approach is restricted by the FunctionPass capabilities and is not allowed to go beyond the current Function. So we cannot check accesses to Module level variables outside the Function. That's why analysis is restricted to kernel only given that any function calls (when implemented) will be considered as clobbers for any memory location.
This implementation relies on the existing Divergence analysis and does not attempt to improve it's results.

Global loads scalarization is SWITCHED OFF by default.
To enable use: "-amdgpu-scalarize-global-loads=true" LLC option.

Further work is planned to improve current implementation.
Namely:
Constant expression as a pointer operand support
Caching the results of the DFG along the CFG for clobbering memory accesses to shorten the search path and improve compile time on the large CFGs.
There is no currently any DFS depth limit since I had relevant experience in HSAIL backend and did not observe serious compile time impact even on large source files. If somebody feel it is necessary I can add depth limitation.

rampitec added inline comments.Dec 2 2016, 2:27 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
25 ↗	(On Diff #80098)	Includes should be alphabetically sorted.
81 ↗	(On Diff #80098)	Could you please move brace in line with the for loop or remove braces?
95 ↗	(On Diff #80098)	Could you please capitalize names "checklist" and "load"? Same for other variables.
99 ↗	(On Diff #80098)	Would be nice to have spacing around "*" consistent.
112 ↗	(On Diff #80098)	Can you avoid const_cast and use const_iterator?
115 ↗	(On Diff #80098)	Inconsistent indent.
143 ↗	(On Diff #80098)	Inconsistent indent.
149 ↗	(On Diff #80098)	I would suggest checking for kernel only once in runOnFunction.
154 ↗	(On Diff #80098)	I suppose this condition can never happen, as you have replaced all uses of this Ptr below in the else block.

rampitec added inline comments.Dec 2 2016, 2:33 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
154 ↗	(On Diff #80098)	Please disregard this comment, the code is indeed reachable.

Style fixed

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
25 ↗	(On Diff #80098)	Which symbol of the full file path should be used as a sorting key?
112 ↗	(On Diff #80098)	As you probably know there is no consistent strategy regarding "const" in LLVM. MemoryDependenceAnalysis::getSimplePointerDependencyFrom, as welll as other MDA interface methods, accepts non-const iterator while it does not change anything. That's why the only way is to make all the parameters and methods in whole call stack non-const. I removed "const" modifier every where along the code. No const_casts any longer. Also I changed getSimplePointerDependencyFrom to getPointerDependencyFrom that queries invariant.load metadata stuff and thus potentially provides better alias granularity.

rampitec added inline comments.Dec 5 2016, 9:49 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183 ↗	(On Diff #80260)	That is really better just to return right here if it is not kernel.
25 ↗	(On Diff #80098)	You did it right now ;) The key is a full string.

alex-t added inline comments.Dec 5 2016, 11:01 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183 ↗	(On Diff #80260)	Really? What's about the loads from the readonly memory? Aren't they still valid even in non-kernel?

rampitec added inline comments.Dec 5 2016, 11:08 AM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183 ↗	(On Diff #80260)	OK, I see your point. But then check for isKernelFunc first in the condition before doing the expensive DFS.

alex-t updated this revision to Diff 80302.Dec 5 2016, 11:46 AM

alex-t marked an inline comment as done.

LGTM

This revision is now accepted and ready to land.Dec 5 2016, 12:02 PM

arsenm added inline comments.Dec 5 2016, 3:31 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
20–21 ↗	(On Diff #80302)	Alphabetize
37 ↗	(On Diff #80302)	*LI
88–93 ↗	(On Diff #80302)	C++ style comments
lib/Target/AMDGPU/SIISelLowering.cpp
2620 ↗	(On Diff #80302)	Previous line
test/CodeGen/AMDGPU/global_smrd_cfg.ll
1 ↗	(On Diff #80302)	You don't need the -O2 since that's the default. Can you also change the check prefixes to GCN, and also run instnamer (same for the other tests)

arsenm added inline comments.Dec 5 2016, 3:32 PM

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
136–137 ↗	(On Diff #80302)	Weird formatting
150–151 ↗	(On Diff #80302)	else on same line as previous }
153–154 ↗	(On Diff #80302)	Asterisks to right

alex-t updated this revision to Diff 80390.Dec 6 2016, 2:23 AM

alex-t edited edge metadata.

Closed by commit rL289076: [AMDGPU] Scalarization of global uniform loads. (authored by alex-t). · Explain WhyDec 8 2016, 9:39 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUAnnotateUniformValues.cpp

94 lines

AMDGPUSubtarget.h

4 lines

AMDGPUSubtarget.cpp

1 line

AMDGPUTargetMachine.cpp

10 lines

SIISelLowering.h

1 line

SIISelLowering.cpp

19 lines

SMInstructions.td

8 lines

test/

CodeGen/

AMDGPU/

global_smrd.ll

126 lines

global_smrd_cfg.ll

80 lines

Diff 80777

llvm/trunk/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp

	Show All 9 Lines
	/// \file			/// \file
	/// This pass adds amdgpu.uniform metadata to IR values so this information			/// This pass adds amdgpu.uniform metadata to IR values so this information
	/// can be used during instruction selection.			/// can be used during instruction selection.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"			#include "AMDGPU.h"
	#include "AMDGPUIntrinsicInfo.h"			#include "AMDGPUIntrinsicInfo.h"
				#include "llvm/ADT/SetVector.h"
	#include "llvm/Analysis/DivergenceAnalysis.h"			#include "llvm/Analysis/DivergenceAnalysis.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/MemoryDependenceAnalysis.h"
	#include "llvm/IR/InstVisitor.h"			#include "llvm/IR/InstVisitor.h"
	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/Support/Debug.h"			#include "llvm/Support/Debug.h"
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"

	#define DEBUG_TYPE "amdgpu-annotate-uniform"			#define DEBUG_TYPE "amdgpu-annotate-uniform"

	using namespace llvm;			using namespace llvm;

	namespace {			namespace {

	class AMDGPUAnnotateUniformValues : public FunctionPass,			class AMDGPUAnnotateUniformValues : public FunctionPass,
	public InstVisitor<AMDGPUAnnotateUniformValues> {			public InstVisitor<AMDGPUAnnotateUniformValues> {
	DivergenceAnalysis *DA;			DivergenceAnalysis *DA;
				MemoryDependenceResults *MDR;
				LoopInfo *LI;
				DenseMap<Value, GetElementPtrInst> noClobberClones;
				bool isKernelFunc;

	public:			public:
	static char ID;			static char ID;
	AMDGPUAnnotateUniformValues() :			AMDGPUAnnotateUniformValues() :
	FunctionPass(ID) { }			FunctionPass(ID) { }
	bool doInitialization(Module &M) override;			bool doInitialization(Module &M) override;
	bool runOnFunction(Function &F) override;			bool runOnFunction(Function &F) override;
	StringRef getPassName() const override {			StringRef getPassName() const override {
	return "AMDGPU Annotate Uniform Values";			return "AMDGPU Annotate Uniform Values";
	}			}
	void getAnalysisUsage(AnalysisUsage &AU) const override {			void getAnalysisUsage(AnalysisUsage &AU) const override {
	AU.addRequired<DivergenceAnalysis>();			AU.addRequired<DivergenceAnalysis>();
				AU.addRequired<MemoryDependenceWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
	AU.setPreservesAll();			AU.setPreservesAll();
	}			}

	void visitBranchInst(BranchInst &I);			void visitBranchInst(BranchInst &I);
	void visitLoadInst(LoadInst &I);			void visitLoadInst(LoadInst &I);
				bool isClobberedInFunction(LoadInst * Load);
	};			};

	} // End anonymous namespace			} // End anonymous namespace

	INITIALIZE_PASS_BEGIN(AMDGPUAnnotateUniformValues, DEBUG_TYPE,			INITIALIZE_PASS_BEGIN(AMDGPUAnnotateUniformValues, DEBUG_TYPE,
	"Add AMDGPU uniform metadata", false, false)			"Add AMDGPU uniform metadata", false, false)
	INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)			INITIALIZE_PASS_DEPENDENCY(DivergenceAnalysis)
				INITIALIZE_PASS_DEPENDENCY(MemoryDependenceWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
	INITIALIZE_PASS_END(AMDGPUAnnotateUniformValues, DEBUG_TYPE,			INITIALIZE_PASS_END(AMDGPUAnnotateUniformValues, DEBUG_TYPE,
	"Add AMDGPU uniform metadata", false, false)			"Add AMDGPU uniform metadata", false, false)

	char AMDGPUAnnotateUniformValues::ID = 0;			char AMDGPUAnnotateUniformValues::ID = 0;

	static void setUniformMetadata(Instruction *I) {			static void setUniformMetadata(Instruction *I) {
	I->setMetadata("amdgpu.uniform", MDNode::get(I->getContext(), {}));			I->setMetadata("amdgpu.uniform", MDNode::get(I->getContext(), {}));
	}			}
				static void setNoClobberMetadata(Instruction *I) {
				I->setMetadata("amdgpu.noclobber", MDNode::get(I->getContext(), {}));
				}

				static void DFS(BasicBlock Root, SetVector<BasicBlock> & Set) {
				for (auto I : predecessors(Root))
				if (Set.insert(I))
				DFS(I, Set);
				}

				bool AMDGPUAnnotateUniformValues::isClobberedInFunction(LoadInst * Load) {
				// 1. get Loop for the Load->getparent();
				// 2. if it exists, collect all the BBs from the most outer
				// loop and check for the writes. If NOT - start DFS over all preds.
				// 3. Start DFS over all preds from the most outer loop header.
				SetVector<BasicBlock *> Checklist;
				BasicBlock *Start = Load->getParent();
				Checklist.insert(Start);
				const Value *Ptr = Load->getPointerOperand();
				const Loop *L = LI->getLoopFor(Start);
				if (L) {
				const Loop *P = L;
				do {
				L = P;
				P = P->getParentLoop();
				} while (P);
				Checklist.insert(L->block_begin(), L->block_end());
				Start = L->getHeader();
				}

				DFS(Start, Checklist);
				for (auto &BB : Checklist) {
				BasicBlock::iterator StartIt = (BB == Load->getParent()) ?
				BasicBlock::iterator(Load) : BB->end();
				if (MDR->getPointerDependencyFrom(MemoryLocation(Ptr),
				true, StartIt, BB, Load).isClobber())
				return true;
				}
				return false;
				}

	void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {			void AMDGPUAnnotateUniformValues::visitBranchInst(BranchInst &I) {
	if (I.isUnconditional())			if (I.isUnconditional())
	return;			return;

	Value *Cond = I.getCondition();			Value *Cond = I.getCondition();
	if (!DA->isUniform(Cond))			if (!DA->isUniform(Cond))
	return;			return;

	setUniformMetadata(I.getParent()->getTerminator());			setUniformMetadata(I.getParent()->getTerminator());
	}			}

	void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {			void AMDGPUAnnotateUniformValues::visitLoadInst(LoadInst &I) {
	Value *Ptr = I.getPointerOperand();			Value *Ptr = I.getPointerOperand();
	if (!DA->isUniform(Ptr))			if (!DA->isUniform(Ptr))
	return;			return;
				auto isGlobalLoad = [](LoadInst &Load)->bool {
				return Load.getPointerAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS;
				};
				// We're tracking up to the Function boundaries
				// We cannot go beyond because of FunctionPass restrictions
				// Thus we can ensure that memory not clobbered for memory
				// operations that live in kernel only.
				bool NotClobbered = isKernelFunc && !isClobberedInFunction(&I);
				Instruction *PtrI = dyn_cast<Instruction>(Ptr);
				if (!PtrI && NotClobbered && isGlobalLoad(I)) {
				if (isa<Argument>(Ptr) \|\| isa<GlobalValue>(Ptr)) {
				// Lookup for the existing GEP
				if (noClobberClones.count(Ptr)) {
				PtrI = noClobberClones[Ptr];
				} else {
				// Create GEP of the Value
				Function *F = I.getParent()->getParent();
				Value *Idx = Constant::getIntegerValue(
				Type::getInt32Ty(Ptr->getContext()), APInt(64, 0));
				// Insert GEP at the entry to make it dominate all uses
				PtrI = GetElementPtrInst::Create(
				Ptr->getType()->getPointerElementType(), Ptr,
				ArrayRef<Value*>(Idx), Twine(""), F->getEntryBlock().getFirstNonPHI());
				}
				I.replaceUsesOfWith(Ptr, PtrI);
				}
				}

	if (Instruction *PtrI = dyn_cast<Instruction>(Ptr))			if (PtrI) {
	setUniformMetadata(PtrI);			setUniformMetadata(PtrI);
				if (NotClobbered)
				setNoClobberMetadata(PtrI);
				}
	}			}

	bool AMDGPUAnnotateUniformValues::doInitialization(Module &M) {			bool AMDGPUAnnotateUniformValues::doInitialization(Module &M) {
	return false;			return false;
	}			}

	bool AMDGPUAnnotateUniformValues::runOnFunction(Function &F) {			bool AMDGPUAnnotateUniformValues::runOnFunction(Function &F) {
	if (skipFunction(F))			if (skipFunction(F))
	return false;			return false;

	DA = &getAnalysis<DivergenceAnalysis>();			DA = &getAnalysis<DivergenceAnalysis>();
	visit(F);			MDR = &getAnalysis<MemoryDependenceWrapperPass>().getMemDep();
				LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				isKernelFunc = F.getCallingConv() == CallingConv::AMDGPU_KERNEL;

				visit(F);
				noClobberClones.clear();
	return true;			return true;
	}			}

	FunctionPass *			FunctionPass *
	llvm::createAMDGPUAnnotateUniformValues() {			llvm::createAMDGPUAnnotateUniformValues() {
	return new AMDGPUAnnotateUniformValues();			return new AMDGPUAnnotateUniformValues();
	}			}

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	protected:
bool HasScalarStores;		bool HasScalarStores;
bool HasInv2PiInlineImm;		bool HasInv2PiInlineImm;
bool FlatAddressSpace;		bool FlatAddressSpace;
bool R600ALUInst;		bool R600ALUInst;
bool CaymanISA;		bool CaymanISA;
bool CFALUBug;		bool CFALUBug;
bool HasVertexCache;		bool HasVertexCache;
short TexVTXClauseSize;		short TexVTXClauseSize;
		bool ScalarizeGlobal;

// Dummy feature to use for assembler in tablegen.		// Dummy feature to use for assembler in tablegen.
bool FeatureDisable;		bool FeatureDisable;

InstrItineraryData InstrItins;		InstrItineraryData InstrItins;
SelectionDAGTargetInfo TSInfo;		SelectionDAGTargetInfo TSInfo;

public:		public:
▲ Show 20 Lines • Show All 271 Lines • ▼ Show 20 Lines	unsigned getMaxFlatWorkGroupSize() const {
return 2048;		return 2048;
}		}

/// \returns Number of waves per work group given the flat work group size.		/// \returns Number of waves per work group given the flat work group size.
unsigned getWavesPerWorkGroup(unsigned FlatWorkGroupSize) const {		unsigned getWavesPerWorkGroup(unsigned FlatWorkGroupSize) const {
return alignTo(FlatWorkGroupSize, getWavefrontSize()) / getWavefrontSize();		return alignTo(FlatWorkGroupSize, getWavefrontSize()) / getWavefrontSize();
}		}

		void setScalarizeGlobalBehavior(bool b) { ScalarizeGlobal = b;}
		bool getScalarizeGlobalBehavior() const { return ScalarizeGlobal;}

/// \returns Subtarget's default pair of minimum/maximum flat work group sizes		/// \returns Subtarget's default pair of minimum/maximum flat work group sizes
/// for function \p F, or minimum/maximum flat work group sizes explicitly		/// for function \p F, or minimum/maximum flat work group sizes explicitly
/// requested using "amdgpu-flat-work-group-size" attribute attached to		/// requested using "amdgpu-flat-work-group-size" attribute attached to
/// function \p F.		/// function \p F.
///		///
/// \returns Subtarget's default values if explicitly requested values cannot		/// \returns Subtarget's default values if explicitly requested values cannot
/// be converted to integer, or violate subtarget's specifications.		/// be converted to integer, or violate subtarget's specifications.
std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) const;		std::pair<unsigned, unsigned> getFlatWorkGroupSizes(const Function &F) const;
▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	: AMDGPUGenSubtargetInfo(TT, GPU, FS),
HasInv2PiInlineImm(false),		HasInv2PiInlineImm(false),
FlatAddressSpace(false),		FlatAddressSpace(false),

R600ALUInst(false),		R600ALUInst(false),
CaymanISA(false),		CaymanISA(false),
CFALUBug(false),		CFALUBug(false),
HasVertexCache(false),		HasVertexCache(false),
TexVTXClauseSize(0),		TexVTXClauseSize(0),
		ScalarizeGlobal(false),

FeatureDisable(false),		FeatureDisable(false),
InstrItins(getInstrItineraryForCPU(GPU)),		InstrItins(getInstrItineraryForCPU(GPU)),
TSInfo() {		TSInfo() {
initializeSubtargetDependencies(TT, GPU, FS);		initializeSubtargetDependencies(TT, GPU, FS);
}		}

// FIXME: These limits are for SI. Did they change with the larger maximum LDS		// FIXME: These limits are for SI. Did they change with the larger maximum LDS
▲ Show 20 Lines • Show All 238 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines

// Option to disable vectorizer for tests.		// Option to disable vectorizer for tests.
static cl::opt<bool> EnableLoadStoreVectorizer(		static cl::opt<bool> EnableLoadStoreVectorizer(
"amdgpu-load-store-vectorizer",		"amdgpu-load-store-vectorizer",
cl::desc("Enable load store vectorizer"),		cl::desc("Enable load store vectorizer"),
cl::init(true),		cl::init(true),
cl::Hidden);		cl::Hidden);

		// Option to to control global loads scalarization
		static cl::opt<bool> ScalarizeGlobal(
		"amdgpu-scalarize-global-loads",
		cl::desc("Enable global load scalarization"),
		cl::init(false),
		cl::Hidden);


extern "C" void LLVMInitializeAMDGPUTarget() {		extern "C" void LLVMInitializeAMDGPUTarget() {
// Register the target		// Register the target
RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());		RegisterTargetMachine<R600TargetMachine> X(getTheAMDGPUTarget());
RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());		RegisterTargetMachine<GCNTargetMachine> Y(getTheGCNTarget());

PassRegistry *PR = PassRegistry::getPassRegistry();		PassRegistry *PR = PassRegistry::getPassRegistry();
initializeSILowerI1CopiesPass(*PR);		initializeSILowerI1CopiesPass(*PR);
initializeSIFixSGPRCopiesPass(*PR);		initializeSIFixSGPRCopiesPass(*PR);
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	#else
SIGISelActualAccessor *GISel = new SIGISelActualAccessor();		SIGISelActualAccessor *GISel = new SIGISelActualAccessor();
GISel->CallLoweringInfo.reset(		GISel->CallLoweringInfo.reset(
new AMDGPUCallLowering(*I->getTargetLowering()));		new AMDGPUCallLowering(*I->getTargetLowering()));
#endif		#endif

I->setGISelAccessor(*GISel);		I->setGISelAccessor(*GISel);
}		}

		I->setScalarizeGlobalBehavior(ScalarizeGlobal);

return I.get();		return I.get();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AMDGPU Pass Setup		// AMDGPU Pass Setup
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
▲ Show 20 Lines • Show All 348 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	public:

EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,		EVT getOptimalMemOpType(uint64_t Size, unsigned DstAlign,
unsigned SrcAlign, bool IsMemset,		unsigned SrcAlign, bool IsMemset,
bool ZeroMemset,		bool ZeroMemset,
bool MemcpyStrSrc,		bool MemcpyStrSrc,
MachineFunction &MF) const override;		MachineFunction &MF) const override;

bool isMemOpUniform(const SDNode *N) const;		bool isMemOpUniform(const SDNode *N) const;
		bool isMemOpHasNoClobberedMemOperand(const SDNode *N) const;
bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;		bool isNoopAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;
bool isCheapAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;		bool isCheapAddrSpaceCast(unsigned SrcAS, unsigned DestAS) const override;

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
getPreferredVectorAction(EVT VT) const override;		getPreferredVectorAction(EVT VT) const override;

bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const override;		Type *Ty) const override;
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

Show First 20 Lines • Show All 604 Lines • ▼ Show 20 Lines	return AS == AMDGPUAS::GLOBAL_ADDRESS \|\|
AS == AMDGPUAS::CONSTANT_ADDRESS;		AS == AMDGPUAS::CONSTANT_ADDRESS;
}		}

bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,		bool SITargetLowering::isNoopAddrSpaceCast(unsigned SrcAS,
unsigned DestAS) const {		unsigned DestAS) const {
return isFlatGlobalAddrSpace(SrcAS) && isFlatGlobalAddrSpace(DestAS);		return isFlatGlobalAddrSpace(SrcAS) && isFlatGlobalAddrSpace(DestAS);
}		}

		bool SITargetLowering::isMemOpHasNoClobberedMemOperand(const SDNode *N) const {
		const MemSDNode *MemNode = cast<MemSDNode>(N);
		const Value *Ptr = MemNode->getMemOperand()->getValue();
		const Instruction *I = dyn_cast<Instruction>(Ptr);
		return I && I->getMetadata("amdgpu.noclobber");
		}

bool SITargetLowering::isCheapAddrSpaceCast(unsigned SrcAS,		bool SITargetLowering::isCheapAddrSpaceCast(unsigned SrcAS,
unsigned DestAS) const {		unsigned DestAS) const {
// Flat -> private/local is a simple truncate.		// Flat -> private/local is a simple truncate.
// Flat -> global is no-op		// Flat -> global is no-op
if (SrcAS == AMDGPUAS::FLAT_ADDRESS)		if (SrcAS == AMDGPUAS::FLAT_ADDRESS)
return true;		return true;

return isNoopAddrSpaceCast(SrcAS, DestAS);		return isNoopAddrSpaceCast(SrcAS, DestAS);
▲ Show 20 Lines • Show All 2,147 Lines • ▼ Show 20 Lines	AS = MFI->hasFlatScratchInit() ?
AMDGPUAS::PRIVATE_ADDRESS : AMDGPUAS::GLOBAL_ADDRESS;		AMDGPUAS::PRIVATE_ADDRESS : AMDGPUAS::GLOBAL_ADDRESS;

unsigned NumElements = MemVT.getVectorNumElements();		unsigned NumElements = MemVT.getVectorNumElements();
switch (AS) {		switch (AS) {
case AMDGPUAS::CONSTANT_ADDRESS:		case AMDGPUAS::CONSTANT_ADDRESS:
if (isMemOpUniform(Load))		if (isMemOpUniform(Load))
return SDValue();		return SDValue();
// Non-uniform loads will be selected to MUBUF instructions, so they		// Non-uniform loads will be selected to MUBUF instructions, so they
// have the same legalization requires ments as global and private		// have the same legalization requirements as global and private
// loads.		// loads.
//		//
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
case AMDGPUAS::GLOBAL_ADDRESS:		case AMDGPUAS::GLOBAL_ADDRESS: {
		if (isMemOpUniform(Load) && isMemOpHasNoClobberedMemOperand(Load))
		return SDValue();
		// Non-uniform loads will be selected to MUBUF instructions, so they
		// have the same legalization requirements as global and private
		// loads.
		//
		}
		LLVM_FALLTHROUGH;
case AMDGPUAS::FLAT_ADDRESS:		case AMDGPUAS::FLAT_ADDRESS:
if (NumElements > 4)		if (NumElements > 4)
return SplitVectorLoad(Op, DAG);		return SplitVectorLoad(Op, DAG);
// v4 loads are supported for private and global memory.		// v4 loads are supported for private and global memory.
return SDValue();		return SDValue();
case AMDGPUAS::PRIVATE_ADDRESS: {		case AMDGPUAS::PRIVATE_ADDRESS: {
// Depending on the setting of the private_element_size field in the		// Depending on the setting of the private_element_size field in the
// resource descriptor, we can only make private accesses up to a certain		// resource descriptor, we can only make private accesses up to a certain
▲ Show 20 Lines • Show All 1,586 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SMInstructions.td

	Show First 20 Lines • Show All 216 Lines • ▼ Show 20 Lines
	} // SubtargetPredicate = isVI			} // SubtargetPredicate = isVI



	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Scalar Memory Patterns			// Scalar Memory Patterns
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//


	def smrd_load : PatFrag <(ops node:$ptr), (load node:$ptr), [{			def smrd_load : PatFrag <(ops node:$ptr), (load node:$ptr), [{
	auto Ld = cast<LoadSDNode>(N);			auto Ld = cast<LoadSDNode>(N);
	return Ld->getAlignment() >= 4 &&			return Ld->getAlignment() >= 4 &&
	Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS &&			((Ld->getAddressSpace() == AMDGPUAS::CONSTANT_ADDRESS &&
	static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N);			static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N)) \|\|
				(Subtarget->getScalarizeGlobalBehavior() && Ld->getAddressSpace() == AMDGPUAS::GLOBAL_ADDRESS &&
				static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpUniform(N) &&
				static_cast<const SITargetLowering *>(getTargetLowering())->isMemOpHasNoClobberedMemOperand(N)));
	}]>;			}]>;

	def SMRDImm : ComplexPattern<i64, 2, "SelectSMRDImm">;			def SMRDImm : ComplexPattern<i64, 2, "SelectSMRDImm">;
	def SMRDImm32 : ComplexPattern<i64, 2, "SelectSMRDImm32">;			def SMRDImm32 : ComplexPattern<i64, 2, "SelectSMRDImm32">;
	def SMRDSgpr : ComplexPattern<i64, 2, "SelectSMRDSgpr">;			def SMRDSgpr : ComplexPattern<i64, 2, "SelectSMRDSgpr">;
	def SMRDBufferImm : ComplexPattern<i32, 1, "SelectSMRDBufferImm">;			def SMRDBufferImm : ComplexPattern<i32, 1, "SelectSMRDBufferImm">;
	def SMRDBufferImm32 : ComplexPattern<i32, 1, "SelectSMRDBufferImm32">;			def SMRDBufferImm32 : ComplexPattern<i32, 1, "SelectSMRDBufferImm32">;
	def SMRDBufferSgpr : ComplexPattern<i32, 1, "SelectSMRDBufferSgpr">;			def SMRDBufferSgpr : ComplexPattern<i32, 1, "SelectSMRDBufferSgpr">;
	▲ Show 20 Lines • Show All 294 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/global_smrd.ll

				; RUN: llc -O2 -mtriple amdgcn--amdhsa -mcpu=fiji -amdgpu-scalarize-global-loads=true -verify-machineinstrs < %s \| FileCheck %s

				; uniform loads
				; CHECK-LABEL: @uniform_load
				; CHECK: s_load_dwordx4
				; CHECK-NOT: flat_load_dword

				define amdgpu_kernel void @uniform_load(float addrspace(1)* %arg, float addrspace(1)* %arg1) {
				bb:
				%tmp2 = load float, float addrspace(1)* %arg, align 4, !tbaa !8
				%tmp3 = fadd float %tmp2, 0.000000e+00
				%tmp4 = getelementptr inbounds float, float addrspace(1)* %arg, i64 1
				%tmp5 = load float, float addrspace(1)* %tmp4, align 4, !tbaa !8
				%tmp6 = fadd float %tmp3, %tmp5
				%tmp7 = getelementptr inbounds float, float addrspace(1)* %arg, i64 2
				%tmp8 = load float, float addrspace(1)* %tmp7, align 4, !tbaa !8
				%tmp9 = fadd float %tmp6, %tmp8
				%tmp10 = getelementptr inbounds float, float addrspace(1)* %arg, i64 3
				%tmp11 = load float, float addrspace(1)* %tmp10, align 4, !tbaa !8
				%tmp12 = fadd float %tmp9, %tmp11
				%tmp13 = getelementptr inbounds float, float addrspace(1)* %arg1
				store float %tmp12, float addrspace(1)* %tmp13, align 4, !tbaa !8
				ret void
				}

				; non-uniform loads
				; CHECK-LABEL: @non-uniform_load
				; CHECK: flat_load_dword
				; CHECK-NOT: s_load_dwordx4

				define amdgpu_kernel void @non-uniform_load(float addrspace(1)* %arg, float addrspace(1)* %arg1) #0 {
				bb:
				%tmp = call i32 @llvm.amdgcn.workitem.id.x() #1
				%tmp2 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp
				%tmp3 = load float, float addrspace(1)* %tmp2, align 4, !tbaa !8
				%tmp4 = fadd float %tmp3, 0.000000e+00
				%tmp5 = add i32 %tmp, 1
				%tmp6 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp5
				%tmp7 = load float, float addrspace(1)* %tmp6, align 4, !tbaa !8
				%tmp8 = fadd float %tmp4, %tmp7
				%tmp9 = add i32 %tmp, 2
				%tmp10 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp9
				%tmp11 = load float, float addrspace(1)* %tmp10, align 4, !tbaa !8
				%tmp12 = fadd float %tmp8, %tmp11
				%tmp13 = add i32 %tmp, 3
				%tmp14 = getelementptr inbounds float, float addrspace(1)* %arg, i32 %tmp13
				%tmp15 = load float, float addrspace(1)* %tmp14, align 4, !tbaa !8
				%tmp16 = fadd float %tmp12, %tmp15
				%tmp17 = getelementptr inbounds float, float addrspace(1)* %arg1, i32 %tmp
				store float %tmp16, float addrspace(1)* %tmp17, align 4, !tbaa !8
				ret void
				}


				; uniform load dominated by no-alias store - scalarize
				; CHECK-LABEL: @no_memdep_alias_arg
				; CHECK: flat_store_dword
				; CHECK: s_load_dword [[SVAL:s[0-9]+]]
				; CHECK: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[SVAL]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]

				define amdgpu_kernel void @no_memdep_alias_arg(i32 addrspace(1)* noalias %in, i32 addrspace(1)* %out0, i32 addrspace(1)* %out1) {
				store i32 0, i32 addrspace(1)* %out0
				%val = load i32, i32 addrspace(1)* %in
				store i32 %val, i32 addrspace(1)* %out1
				ret void
				}

				; uniform load dominated by alias store - vector
				; CHECK-LABEL: {{^}}memdep:
				; CHECK: flat_store_dword
				; CHECK: flat_load_dword [[VVAL:v[0-9]+]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]
				define amdgpu_kernel void @memdep(i32 addrspace(1)* %in, i32 addrspace(1)* %out0, i32 addrspace(1)* %out1) {
				store i32 0, i32 addrspace(1)* %out0
				%val = load i32, i32 addrspace(1)* %in
				store i32 %val, i32 addrspace(1)* %out1
				ret void
				}

				; uniform load from global array
				; CHECK-LABEL: @global_array
				; CHECK: s_load_dwordx2 [[A_ADDR:s\[[0-9]+:[0-9]+\]]]
				; CHECK: s_load_dwordx2 [[A_ADDR1:s\[[0-9]+:[0-9]+\]]], [[A_ADDR]], 0x0
				; CHECK: s_load_dword [[SVAL:s[0-9]+]], [[A_ADDR1]], 0x0
				; CHECK: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[SVAL]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]

				@A = common local_unnamed_addr addrspace(1) global i32 addrspace(1)* null, align 4

				define amdgpu_kernel void @global_array(i32 addrspace(1)* nocapture %out) {
				entry:
				%0 = load i32 addrspace(1), i32 addrspace(1) addrspace(1)* @A, align 4
				%1 = load i32, i32 addrspace(1)* %0, align 4
				store i32 %1, i32 addrspace(1)* %out, align 4
				ret void
				}


				; uniform load from global array dominated by alias store
				; CHECK-LABEL: @global_array_alias_store
				; CHECK: flat_store_dword
				; CHECK: v_mov_b32_e32 v[[ADDR_LO:[0-9]+]], s{{[0-9]+}}
				; CHECK: v_mov_b32_e32 v[[ADDR_HI:[0-9]+]], s{{[0-9]+}}
				; CHECK: flat_load_dwordx2 [[A_ADDR:v\[[0-9]+:[0-9]+\]]], v{{\[}}[[ADDR_LO]]:[[ADDR_HI]]{{\]}}
				; CHECK: flat_load_dword [[VVAL:v[0-9]+]], [[A_ADDR]]
				; CHECK: flat_store_dword v[{{[0-9]+:[0-9]+}}], [[VVAL]]
				define amdgpu_kernel void @global_array_alias_store(i32 addrspace(1)* nocapture %out, i32 %n) {
				entry:
				%gep = getelementptr i32, i32 addrspace(1) * %out, i32 %n
				store i32 12, i32 addrspace(1) * %gep
				%0 = load i32 addrspace(1), i32 addrspace(1) addrspace(1)* @A, align 4
				%1 = load i32, i32 addrspace(1)* %0, align 4
				store i32 %1, i32 addrspace(1)* %out, align 4
				ret void
				}


				declare i32 @llvm.amdgcn.workitem.id.x() #1

				attributes #1 = { nounwind readnone }

				!8 = !{!9, !9, i64 0}
				!9 = !{!"float", !10, i64 0}
				!10 = !{!"omnipotent char", !11, i64 0}
				!11 = !{!"Simple C/C++ TBAA"}

llvm/trunk/test/CodeGen/AMDGPU/global_smrd_cfg.ll

				; RUN: llc -mtriple amdgcn--amdhsa -mcpu=fiji -amdgpu-scalarize-global-loads=true -verify-machineinstrs < %s \| FileCheck %s

				; CHECK-LABEL: %bb11

				; Load from %arg in a Loop body has alias store

				; CHECK: flat_load_dword

				; CHECK-LABEL: %bb20
				; CHECK: flat_store_dword

				; #####################################################################

				; CHECK-LABEL: %bb22

				; Load from %arg has alias store in Loop

				; CHECK: flat_load_dword

				; #####################################################################

				; Load from %arg1 has no-alias store in Loop - arg1[i+1] never alias arg1[i]

				; CHECK: s_load_dword

				define amdgpu_kernel void @cfg(i32 addrspace(1)* nocapture readonly %arg, i32 addrspace(1)* nocapture %arg1, i32 %arg2) #0 {
				bb:
				%tmp = sext i32 %arg2 to i64
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4, !tbaa !0
				%tmp5 = icmp sgt i32 %tmp4, 0
				br i1 %tmp5, label %bb6, label %bb8

				bb6: ; preds = %bb
				br label %bb11

				bb7: ; preds = %bb22
				br label %bb8

				bb8: ; preds = %bb7, %bb
				%tmp9 = phi i32 [ 0, %bb ], [ %tmp30, %bb7 ]
				%tmp10 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i64 %tmp
				store i32 %tmp9, i32 addrspace(1)* %tmp10, align 4, !tbaa !0
				ret void

				bb11: ; preds = %bb22, %bb6
				%tmp12 = phi i32 [ %tmp30, %bb22 ], [ 0, %bb6 ]
				%tmp13 = phi i32 [ %tmp25, %bb22 ], [ 0, %bb6 ]
				%tmp14 = srem i32 %tmp13, %arg2
				%tmp15 = sext i32 %tmp14 to i64
				%tmp16 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp15
				%tmp17 = load i32, i32 addrspace(1)* %tmp16, align 4, !tbaa !0
				%tmp18 = icmp sgt i32 %tmp17, 100
				%tmp19 = sext i32 %tmp13 to i64
				br i1 %tmp18, label %bb20, label %bb22

				bb20: ; preds = %bb11
				%tmp21 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i64 %tmp19
				store i32 0, i32 addrspace(1)* %tmp21, align 4, !tbaa !0
				br label %bb22

				bb22: ; preds = %bb20, %bb11
				%tmp23 = getelementptr inbounds i32, i32 addrspace(1)* %arg, i64 %tmp19
				%tmp24 = load i32, i32 addrspace(1)* %tmp23, align 4, !tbaa !0
				%tmp25 = add nuw nsw i32 %tmp13, 1
				%tmp26 = sext i32 %tmp25 to i64
				%tmp27 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i64 %tmp26
				%tmp28 = load i32, i32 addrspace(1)* %tmp27, align 4, !tbaa !0
				%tmp29 = add i32 %tmp24, %tmp12
				%tmp30 = add i32 %tmp29, %tmp28
				%tmp31 = icmp eq i32 %tmp25, %tmp4
				br i1 %tmp31, label %bb7, label %bb11
				}

				attributes #0 = { "target-cpu"="fiji" }

				!0 = !{!1, !1, i64 0}
				!1 = !{!"int", !2, i64 0}
				!2 = !{!"omnipotent char", !3, i64 0}
				!3 = !{!"Simple C/C++ TBAA"}