This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions.
ClosedPublic

Authored by FarhanaAleen on Dec 10 2018, 6:20 PM.

Download Raw Diff

Details

Reviewers

Commits

rGce095c564aad: [AMDGPU] Promote constant offset to the immediate by finding a new base with…
rL349196: [AMDGPU] Promote constant offset to the immediate by finding a new base with…

Summary

Promote constant offset to immediate by recomputing the relative 13bit offset from nearby instructions.

E.g.
s_movk_i32 s0, 0x1800
v_add_co_u32_e32 v0, vcc, s0, v2
v_addc_co_u32_e32 v1, vcc, 0, v6, vcc

s_movk_i32 s0, 0x1000
v_add_co_u32_e32 v5, vcc, s0, v2
v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
global_load_dwordx2 v[5:6], v[5:6], off
global_load_dwordx2 v[0:1], v[0:1], off
=>
s_movk_i32 s0, 0x1000
v_add_co_u32_e32 v5, vcc, s0, v2
v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
global_load_dwordx2 v[5:6], v[5:6], off
global_load_dwordx2 v[0:1], v[5:6], off offset:2048

Diff Detail

Repository: rL LLVM

Event Timeline

FarhanaAleen created this revision.Dec 10 2018, 6:20 PM

Herald added subscribers: t-tye, tpr, dstuttard and 6 others. · View Herald TranscriptDec 10 2018, 6:20 PM

Why aren't these matched in the first place? These shouldn't have gotten this far

In D55539#1326636, @arsenm wrote:

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

In D55539#1327081, @FarhanaAleen wrote:

In D55539#1326636, @arsenm wrote:

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

In D55539#1327264, @arsenm wrote:

In D55539#1327081, @FarhanaAleen wrote:

In D55539#1326636, @arsenm wrote:

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

I think even if the same is implemented at selection there always be cases only visible that late. I.e. both may be needed.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
986 ↗	(On Diff #177643)	OffsetOp != ...
1007 ↗	(On Diff #177643)	Do you expect it to be negative? If so this should not be safe.
1073 ↗	(On Diff #177643)	Can you avoid explicit dynamic allocation?
1088 ↗	(On Diff #177643)	As far as can see they can be signed int<12> as well. Also this can be target dependent. I suggest using SITargetLowering::isLegalGlobalAddressingMode().
1109 ↗	(On Diff #177643)	Same here.
promote-constOffset-to-imm.ll
1 ↗	(On Diff #177643)	I think we will need to regenerate these checks too often. Can you factor out just specific checks?

Can we have a mir test with more than two loads? I want to see a situation where 3 loads are foldable with the same offset, but lowest address is in the middle. I.e.:

load a[1000];
load a[800];
load a[1800];

Here we could compute &a[800] and use as a base for all three. If we start with &a[1000] that will not be possible. It is interesting to see if a number of address calculations will actually go down in this case, or at least will not increase.

The other interesting situation is when a negative offset can be used:

load a[1000];
load a[3000];
load a[5000];

In general we can set base to &a[3000] and use negative offset for the first load. I think your pass does not handle it, but it is worth a test to see an ISA.

FarhanaAleen retitled this revision from [AMDGPU] Promote offset to immediate. to [AMDGPU] Promote constant offset to the immediate by finding a new base with 13bit constant offset from the nearby instructions. .Dec 12 2018, 2:36 PM

Updated with the reviewer's comments.

In D55539#1327613, @rampitec wrote:

Can we have a mir test with more than two loads? I want to see a situation where 3 loads are foldable with the same offset, but lowest address is in the middle. I.e.:

Yes, I added two more mir tests called LowestInMiddle and NegativeDistance.

load a[1000];
load a[800];
load a[1800]; << I changed this to 1400 in order to fit 13bit immediate.

Here we could compute &a[800] and use as a base for all three. If we start with &a[1000] that will not be possible. It is interesting to see if a number of address calculations will actually go down in this case, or at least will not increase.

In the best case, we should consider &a[1000] as Anchor and compute the offset. So, it should generate:
%32:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, -1600, 0, 0, implicit $exec :: (load 8 from %ir.addr1, addrspace 1)
%37:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, 0, 0, 0, implicit $exec :: (load 8 from %ir.addr2, addrspace 1)
%42:vreg_64 = GLOBAL_LOAD_DWORDX2 %138:vreg_64, 3200, 0, 0, implicit $exec :: (load 8 from %ir.addr3, addrspace 1)

But the current heuristic traverse first-come-first serve basis. Starting from a[1000], it looks for the longest distanced base. It will consider &a[800] as an anchor and compute &a[1000] from the anchor but will miss &a[1400] because the distance between (a[1000], a[1400]) does not fit in 13bit immediate. But the heuristic will never increase instruction count.

The other interesting situation is when a negative offset can be used:

load a[1000];
load a[3000];
load a[5000];

In general we can set base to &a[3000] and use negative offset for the first load. I think your pass does not handle it, but it is worth a test to see an ISA.

The algorithm does handle negative distance. I added a mir test called NegativeDistance, I had to modify the index count in order to maintain the 13bit distance.

In D55539#1327264, @arsenm wrote:

In D55539#1327081, @FarhanaAleen wrote:

In D55539#1326636, @arsenm wrote:

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

In D55539#1327264, @arsenm wrote:

In D55539#1327081, @FarhanaAleen wrote:

In D55539#1326636, @arsenm wrote:

Why aren't these matched in the first place? These shouldn't have gotten this far

The offsets are promoted to immediate during the instruction selection if they are 13bit which is the allowed size for globals. This patch is trying to find a 13bit offset from base-address of the nearby instructions by recomputing the relative offset from nearby base address. So, it's more of a global optimization where it needs to traverse the whole program (currently it's limited to the basic block). The optimization also caches all the information to save the compile time. Are you suggesting to do this optimization during instruction selection phase? Wouldn't it be too complex to do during instruction selection?

OK, so this is the case where the offset > 13-bits and you are folding the extra low bits. We already do something similar during selection and put the low bits into the immediate offsets and materialize the high bits as a separate constant for some other loads (e.g. see SelectMUBUFScratchOffen). Usually the common high bits get CSEd in the DAG directly or during MachineCSE. Do you get the same result if you extend that to global/flat operations for this?

This a good approach and has a very low overhead. But without the information of available bases, this approach will generate some higher bits which won't get CSEd.

Also can you reword the commit message to make it clearer what the difference is from the current offset matching?

Yes, changed it.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
986 ↗	(On Diff #177643)	So, we need to store the return value of extractConstOffset which is used later and also checking if extractConstOffset returns null. But if the checking is done suggested way, it will not store the return value.

rampitec added inline comments.Dec 12 2018, 4:30 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
1102 ↗	(On Diff #177965)	80 chars per line.
1112 ↗	(On Diff #177965)	80 chars per line.
test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
1 ↗	(On Diff #177965)	We usually use -check-prefixes=GCN,GFX8 and then use a common GCN-LABEL.

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
219–222 ↗	(On Diff #177965)	This test can be reduced a lot. For example you can also use -run-pass=none to compact the register numbers
230 ↗	(On Diff #177965)	It would be better to avoid a call in a test that doesn't need to specifically test calls

Maintained 80 chars per line, added GCN-LABEL, reduced mir tests.

In D55539#1329142, @arsenm wrote:

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

Sure, we can implement the other approach. But it will take some time.

And, I am not seeing any reason why this one needs to wait on the other one. If the benefit of this approach is not clear to you, I can give you an example extracted from a real world application. This approach takes into account of available base computations, therefore, can optimize away more computations.

load1 = load(&a + 4096)
load2 = load(&a + 6144)
load3 = load(&a + 8192)
load4 = load(&a + 10240)
load5 = load(&a + 12288)
load6 = load(&a + 14336)
load7 = load(&a + 16384)
load8 = load(&a + 18432)

The other approach will generate:
load1 = load(t1: &a + 4096, 0)
load2 = load(t1, 2048)

load3 = load(t2: &a + 8192, 0)
load4 = load(t2, 2048)

load5 = load(t3: &a + 12288, 0)
load6 = load(t3, 2048)

load7 = load(t4: &a + 16384, 0)
load8 = load(t4, 2048)

This patch will generate:
load1 = load(t1: &a + 8192, -4096)
load2 = load(t1, -2048)
load3 = load(t1, 0)
load4 = load(t1, 2048)

load5 = load(t2: &a + 16384, -4096)
load6 = load(t2, -2048)
load7 = load(t2: 0)
load8 = load(t2, 2048)

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗	(On Diff #177965)	Removing the call does not generate the expected IR.

rampitec added inline comments.Dec 13 2018, 3:53 PM

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗	(On Diff #177965)	Call to get_global_id shall never survive until MI. You probably need to redesign the test.

Removed calls from mir tests.

rampitec added inline comments.Dec 13 2018, 10:47 PM

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
79 ↗	(On Diff #178182)	Do you still need adjust stack? This is just a mir test running a single pass. It does not have to be complete. Even the store at the end is not needed, only relevant instructions.

Removed adjust stack and stores.

LGTM

This revision is now accepted and ready to land.Dec 13 2018, 11:13 PM

In D55539#1330568, @FarhanaAleen wrote:

In D55539#1329142, @arsenm wrote:

Can you try implementing the other approach first, and then applying this on top of it to show the difference more clearly?

Sure, we can implement the other approach. But it will take some time.

And, I am not seeing any reason why this one needs to wait on the other one. If the benefit of this approach is not clear to you, I can give you an example extracted from a real world application. This approach takes into account of available base computations, therefore, can optimize away more computations.

I mean if you implement the simpler version first, you'll see the improvements in the test diff from the more complex version

test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir
230 ↗	(On Diff #177965)	You only care about loads and the addressing code, so you can construct a case that doesn't have a call

Closed by commit rL349196: [AMDGPU] Promote constant offset to the immediate by finding a new base with… (authored by faaleen). · Explain WhyDec 14 2018, 1:16 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptDec 14 2018, 1:16 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIISelLowering.h

2 lines

SILoadStoreOptimizer.cpp

361 lines

test/

CodeGen/

AMDGPU/

promote-constOffset-to-imm.ll

485 lines

promote-constOffset-to-imm.mir

154 lines

Diff 178271

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	private:
SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFMACombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performFMACombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;

bool isLegalFlatAddressingMode(const AddrMode &AM) const;		bool isLegalFlatAddressingMode(const AddrMode &AM) const;
bool isLegalGlobalAddressingMode(const AddrMode &AM) const;
bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;		bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;

unsigned isCFIntrinsic(const SDNode *Intr) const;		unsigned isCFIntrinsic(const SDNode *Intr) const;

void createDebuggerPrologueStackObjects(MachineFunction &MF) const;		void createDebuggerPrologueStackObjects(MachineFunction &MF) const;

/// \returns True if fixup needs to be emitted for given global value \p GV,		/// \returns True if fixup needs to be emitted for given global value \p GV,
/// false otherwise.		/// false otherwise.
Show All 25 Lines	public:
bool getTgtMemIntrinsic(IntrinsicInfo &, const CallInst &,		bool getTgtMemIntrinsic(IntrinsicInfo &, const CallInst &,
MachineFunction &MF,		MachineFunction &MF,
unsigned IntrinsicID) const override;		unsigned IntrinsicID) const override;

bool getAddrModeArguments(IntrinsicInst * /I/,		bool getAddrModeArguments(IntrinsicInst * /I/,
SmallVectorImpl<Value> &/Ops*/,		SmallVectorImpl<Value> &/Ops*/,
Type &/AccessTy*/) const override;		Type &/AccessTy*/) const override;

		bool isLegalGlobalAddressingMode(const AddrMode &AM) const;
bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM, Type *Ty,		bool isLegalAddressingMode(const DataLayout &DL, const AddrMode &AM, Type *Ty,
unsigned AS,		unsigned AS,
Instruction *I = nullptr) const override;		Instruction *I = nullptr) const override;

bool canMergeStoresTo(unsigned AS, EVT MemVT,		bool canMergeStoresTo(unsigned AS, EVT MemVT,
const SelectionDAG &DAG) const override;		const SelectionDAG &DAG) const override;

bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS,		bool allowsMisalignedMemoryAccesses(EVT VT, unsigned AS,
▲ Show 20 Lines • Show All 137 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show All 14 Lines
// ds_read2_b32 v[0:1], v2, offset0:4 offset1:8		// ds_read2_b32 v[0:1], v2, offset0:4 offset1:8
//		//
// The same is done for certain SMEM and VMEM opcodes, e.g.:		// The same is done for certain SMEM and VMEM opcodes, e.g.:
// s_buffer_load_dword s4, s[0:3], 4		// s_buffer_load_dword s4, s[0:3], 4
// s_buffer_load_dword s5, s[0:3], 8		// s_buffer_load_dword s5, s[0:3], 8
// ==>		// ==>
// s_buffer_load_dwordx2 s[4:5], s[0:3], 4		// s_buffer_load_dwordx2 s[4:5], s[0:3], 4
//		//
		// This pass also tries to promote constant offset to the immediate by
		// adjusting the base. It tries to use a base from the nearby instructions that
		// allows it to have a 13bit constant offset and then promotes the 13bit offset
		// to the immediate.
		// E.g.
		// s_movk_i32 s0, 0x1800
		// v_add_co_u32_e32 v0, vcc, s0, v2
		// v_addc_co_u32_e32 v1, vcc, 0, v6, vcc
		//
		// s_movk_i32 s0, 0x1000
		// v_add_co_u32_e32 v5, vcc, s0, v2
		// v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
		// global_load_dwordx2 v[5:6], v[5:6], off
		// global_load_dwordx2 v[0:1], v[0:1], off
		// =>
		// s_movk_i32 s0, 0x1000
		// v_add_co_u32_e32 v5, vcc, s0, v2
		// v_addc_co_u32_e32 v6, vcc, 0, v6, vcc
		// global_load_dwordx2 v[5:6], v[5:6], off
		// global_load_dwordx2 v[0:1], v[5:6], off offset:2048
//		//
// Future improvements:		// Future improvements:
//		//
// - This currently relies on the scheduler to place loads and stores next to		// - This currently relies on the scheduler to place loads and stores next to
// each other, and then only merges adjacent pairs of instructions. It would		// each other, and then only merges adjacent pairs of instructions. It would
// be good to be more flexible with interleaved instructions, and possibly run		// be good to be more flexible with interleaved instructions, and possibly run
// before scheduling. It currently missing stores of constants because loading		// before scheduling. It currently missing stores of constants because loading
// the constant into the data register is placed between the stores, although		// the constant into the data register is placed between the stores, although
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	struct CombineInfo {
bool GLC0;		bool GLC0;
bool GLC1;		bool GLC1;
bool SLC0;		bool SLC0;
bool SLC1;		bool SLC1;
bool UseST64;		bool UseST64;
SmallVector<MachineInstr *, 8> InstsToMove;		SmallVector<MachineInstr *, 8> InstsToMove;
};		};

		struct BaseRegisters {
		unsigned LoReg = 0;
		unsigned HiReg = 0;

		unsigned LoSubReg = 0;
		unsigned HiSubReg = 0;
		};

		struct MemAddress {
		BaseRegisters Base;
		int64_t Offset = 0;
		};

		using MemInfoMap = DenseMap<MachineInstr *, MemAddress>;

private:		private:
const GCNSubtarget *STM = nullptr;		const GCNSubtarget *STM = nullptr;
const SIInstrInfo *TII = nullptr;		const SIInstrInfo *TII = nullptr;
const SIRegisterInfo *TRI = nullptr;		const SIRegisterInfo *TRI = nullptr;
MachineRegisterInfo *MRI = nullptr;		MachineRegisterInfo *MRI = nullptr;
AliasAnalysis *AA = nullptr;		AliasAnalysis *AA = nullptr;
bool OptimizeAgain;		bool OptimizeAgain;

Show All 14 Lines	private:

unsigned write2Opcode(unsigned EltSize) const;		unsigned write2Opcode(unsigned EltSize) const;
unsigned write2ST64Opcode(unsigned EltSize) const;		unsigned write2ST64Opcode(unsigned EltSize) const;
MachineBasicBlock::iterator mergeWrite2Pair(CombineInfo &CI);		MachineBasicBlock::iterator mergeWrite2Pair(CombineInfo &CI);
MachineBasicBlock::iterator mergeSBufferLoadImmPair(CombineInfo &CI);		MachineBasicBlock::iterator mergeSBufferLoadImmPair(CombineInfo &CI);
MachineBasicBlock::iterator mergeBufferLoadPair(CombineInfo &CI);		MachineBasicBlock::iterator mergeBufferLoadPair(CombineInfo &CI);
MachineBasicBlock::iterator mergeBufferStorePair(CombineInfo &CI);		MachineBasicBlock::iterator mergeBufferStorePair(CombineInfo &CI);

		void updateBaseAndOffset(MachineInstr &I, unsigned NewBase,
		int32_t NewOffset);
		unsigned computeBase(MachineInstr &MI, const MemAddress &Addr);
		MachineOperand createRegOrImm(int32_t Val, MachineInstr &MI);
		Optional<int32_t> extractConstOffset(const MachineOperand &Op);
		void processBaseWithConstOffset(const MachineOperand &Base, MemAddress &Addr);
		/// Promotes constant offset to the immediate by adjusting the base. It
		/// tries to use a base from the nearby instructions that allows it to have
		/// a 13bit constant offset which gets promoted to the immediate.
		bool promoteConstantOffsetToImm(MachineInstr &CI,
		MemInfoMap &Visited,
		SmallPtrSet<MachineInstr *, 4> &Promoted);

public:		public:
static char ID;		static char ID;

SILoadStoreOptimizer() : MachineFunctionPass(ID) {		SILoadStoreOptimizer() : MachineFunctionPass(ID) {
initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());		initializeSILoadStoreOptimizerPass(*PassRegistry::getPassRegistry());
}		}

bool optimizeBlock(MachineBasicBlock &MBB);		bool optimizeBlock(MachineBasicBlock &MBB);
▲ Show 20 Lines • Show All 891 Lines • ▼ Show 20 Lines	SILoadStoreOptimizer::mergeBufferStorePair(CombineInfo &CI) {
moveInstsAfter(MIB, CI.InstsToMove);		moveInstsAfter(MIB, CI.InstsToMove);

MachineBasicBlock::iterator Next = std::next(CI.I);		MachineBasicBlock::iterator Next = std::next(CI.I);
CI.I->eraseFromParent();		CI.I->eraseFromParent();
CI.Paired->eraseFromParent();		CI.Paired->eraseFromParent();
return Next;		return Next;
}		}

		MachineOperand
		SILoadStoreOptimizer::createRegOrImm(int32_t Val, MachineInstr &MI) {
		APInt V(32, Val, true);
		if (TII->isInlineConstant(V))
		return MachineOperand::CreateImm(Val);

		unsigned Reg = MRI->createVirtualRegister(&AMDGPU::SReg_32RegClass);
		MachineInstr *Mov =
		BuildMI(*MI.getParent(), MI.getIterator(), MI.getDebugLoc(),
		TII->get(AMDGPU::S_MOV_B32), Reg)
		.addImm(Val);
		LLVM_DEBUG(dbgs() << " "; Mov->dump());
		return MachineOperand::CreateReg(Reg, false);
		}

		// Compute base address using Addr and return the final register.
		unsigned SILoadStoreOptimizer::computeBase(MachineInstr &MI,
		const MemAddress &Addr) {
		MachineBasicBlock *MBB = MI.getParent();
		MachineBasicBlock::iterator MBBI = MI.getIterator();
		DebugLoc DL = MI.getDebugLoc();

		assert((TRI->getRegSizeInBits(Addr.Base.LoReg, *MRI) == 32 \|\|
		Addr.Base.LoSubReg) &&
		"Expected 32-bit Base-Register-Low!!");

		assert((TRI->getRegSizeInBits(Addr.Base.HiReg, *MRI) == 32 \|\|
		Addr.Base.HiSubReg) &&
		"Expected 32-bit Base-Register-Hi!!");

		LLVM_DEBUG(dbgs() << " Re-Computed Anchor-Base:\n");
		MachineOperand OffsetLo = createRegOrImm(static_cast<int32_t>(Addr.Offset), MI);
		MachineOperand OffsetHi =
		createRegOrImm(static_cast<int32_t>(Addr.Offset >> 32), MI);
		unsigned CarryReg = MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);
		unsigned DeadCarryReg =
		MRI->createVirtualRegister(&AMDGPU::SReg_64_XEXECRegClass);

		unsigned DestSub0 = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
		unsigned DestSub1 = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
		MachineInstr *LoHalf =
		BuildMI(*MBB, MBBI, DL, TII->get(AMDGPU::V_ADD_I32_e64), DestSub0)
		.addReg(CarryReg, RegState::Define)
		.addReg(Addr.Base.LoReg, 0, Addr.Base.LoSubReg)
		.add(OffsetLo);
		LLVM_DEBUG(dbgs() << " "; LoHalf->dump(););

		MachineInstr *HiHalf =
		BuildMI(*MBB, MBBI, DL, TII->get(AMDGPU::V_ADDC_U32_e64), DestSub1)
		.addReg(DeadCarryReg, RegState::Define \| RegState::Dead)
		.addReg(Addr.Base.HiReg, 0, Addr.Base.HiSubReg)
		.add(OffsetHi)
		.addReg(CarryReg, RegState::Kill);
		LLVM_DEBUG(dbgs() << " "; HiHalf->dump(););

		unsigned FullDestReg = MRI->createVirtualRegister(&AMDGPU::VReg_64RegClass);
		MachineInstr *FullBase =
		BuildMI(*MBB, MBBI, DL, TII->get(TargetOpcode::REG_SEQUENCE), FullDestReg)
		.addReg(DestSub0)
		.addImm(AMDGPU::sub0)
		.addReg(DestSub1)
		.addImm(AMDGPU::sub1);
		LLVM_DEBUG(dbgs() << " "; FullBase->dump(); dbgs() << "\n";);

		return FullDestReg;
		}

		// Update base and offset with the NewBase and NewOffset in MI.
		void SILoadStoreOptimizer::updateBaseAndOffset(MachineInstr &MI,
		unsigned NewBase,
		int32_t NewOffset) {
		TII->getNamedOperand(MI, AMDGPU::OpName::vaddr)->setReg(NewBase);
		TII->getNamedOperand(MI, AMDGPU::OpName::offset)->setImm(NewOffset);
		}

		Optional<int32_t>
		SILoadStoreOptimizer::extractConstOffset(const MachineOperand &Op) {
		if (Op.isImm())
		return Op.getImm();

		if (!Op.isReg())
		return None;

		MachineInstr *Def = MRI->getUniqueVRegDef(Op.getReg());
		if (!Def \|\| Def->getOpcode() != AMDGPU::S_MOV_B32 \|\|
		!Def->getOperand(1).isImm())
		return None;

		return Def->getOperand(1).getImm();
		}

		// Analyze Base and extracts:
		// - 32bit base registers, subregisters
		// - 64bit constant offset
		// Expecting base computation as:
		// %OFFSET0:sgpr_32 = S_MOV_B32 8000
		// %LO:vgpr_32, %c:sreg_64_xexec =
		// V_ADD_I32_e64 %BASE_LO:vgpr_32, %103:sgpr_32,
		// %HI:vgpr_32, = V_ADDC_U32_e64 %BASE_HI:vgpr_32, 0, killed %c:sreg_64_xexec
		// %Base:vreg_64 =
		// REG_SEQUENCE %LO:vgpr_32, %subreg.sub0, %HI:vgpr_32, %subreg.sub1
		void SILoadStoreOptimizer::processBaseWithConstOffset(const MachineOperand &Base,
		MemAddress &Addr) {
		if (!Base.isReg())
		return;

		MachineInstr *Def = MRI->getUniqueVRegDef(Base.getReg());
		if (!Def \|\| Def->getOpcode() != AMDGPU::REG_SEQUENCE
		\|\| Def->getNumOperands() != 5)
		return;

		MachineOperand BaseLo = Def->getOperand(1);
		MachineOperand BaseHi = Def->getOperand(3);
		if (!BaseLo.isReg() \|\| !BaseHi.isReg())
		return;

		MachineInstr *BaseLoDef = MRI->getUniqueVRegDef(BaseLo.getReg());
		MachineInstr *BaseHiDef = MRI->getUniqueVRegDef(BaseHi.getReg());

		if (!BaseLoDef \|\| BaseLoDef->getOpcode() != AMDGPU::V_ADD_I32_e64 \|\|
		!BaseHiDef \|\| BaseHiDef->getOpcode() != AMDGPU::V_ADDC_U32_e64)
		return;

		const auto Src0 = TII->getNamedOperand(BaseLoDef, AMDGPU::OpName::src0);
		const auto Src1 = TII->getNamedOperand(BaseLoDef, AMDGPU::OpName::src1);

		auto Offset0P = extractConstOffset(*Src0);
		if (Offset0P)
		BaseLo = *Src1;
		else {
		if (!(Offset0P = extractConstOffset(*Src1)))
		return;
		BaseLo = *Src0;
		}

		Src0 = TII->getNamedOperand(*BaseHiDef, AMDGPU::OpName::src0);
		Src1 = TII->getNamedOperand(*BaseHiDef, AMDGPU::OpName::src1);

		if (Src0->isImm())
		std::swap(Src0, Src1);

		if (!Src1->isImm())
		return;

		assert(isInt<32>(*Offset0P) && isInt<32>(Src1->getImm())
		&& "Expected 32bit immediate!!!");
		uint64_t Offset1 = Src1->getImm();
		BaseHi = *Src0;

		Addr.Base.LoReg = BaseLo.getReg();
		Addr.Base.HiReg = BaseHi.getReg();
		Addr.Base.LoSubReg = BaseLo.getSubReg();
		Addr.Base.HiSubReg = BaseHi.getSubReg();
		Addr.Offset = (*Offset0P & 0x00000000ffffffff) \| (Offset1 << 32);
		}

		bool SILoadStoreOptimizer::promoteConstantOffsetToImm(
		MachineInstr &MI,
		MemInfoMap &Visited,
		SmallPtrSet<MachineInstr *, 4> &AnchorList) {

		// TODO: Support flat and scratch.
		if (AMDGPU::getGlobalSaddrOp(MI.getOpcode()) < 0 \|\|
		TII->getNamedOperand(MI, AMDGPU::OpName::vdata) != NULL)
		return false;

		// TODO: Support Store.
		if (!MI.mayLoad())
		return false;

		if (AnchorList.count(&MI))
		return false;

		LLVM_DEBUG(dbgs() << "\nTryToPromoteConstantOffsetToImmFor "; MI.dump());

		if (TII->getNamedOperand(MI, AMDGPU::OpName::offset)->getImm()) {
		LLVM_DEBUG(dbgs() << " Const-offset is already promoted.\n";);
		return false;
		}

		// Step1: Find the base-registers and a 64bit constant offset.
		MachineOperand &Base = *TII->getNamedOperand(MI, AMDGPU::OpName::vaddr);
		MemAddress MAddr;
		if (Visited.find(&MI) == Visited.end()) {
		processBaseWithConstOffset(Base, MAddr);
		Visited[&MI] = MAddr;
		} else
		MAddr = Visited[&MI];

		if (MAddr.Offset == 0) {
		LLVM_DEBUG(dbgs() << " Failed to extract constant-offset or there are no"
		" constant offsets that can be promoted.\n";);
		return false;
		}

		LLVM_DEBUG(dbgs() << " BASE: {" << MAddr.Base.HiReg << ", "
		<< MAddr.Base.LoReg << "} Offset: " << MAddr.Offset << "\n\n";);

		// Step2: Traverse through MI's basic block and find an anchor(that has the
		// same base-registers) with the highest 13bit distance from MI's offset.
		// E.g. (64bit loads)
		// bb:
		// addr1 = &a + 4096; load1 = load(addr1, 0)
		// addr2 = &a + 6144; load2 = load(addr2, 0)
		// addr3 = &a + 8192; load3 = load(addr3, 0)
		// addr4 = &a + 10240; load4 = load(addr4, 0)
		// addr5 = &a + 12288; load5 = load(addr5, 0)
		//
		// Starting from the first load, the optimization will try to find a new base
		// from which (&a + 4096) has 13 bit distance. Both &a + 6144 and &a + 8192
		// has 13bit distance from &a + 4096. The heuristic considers &a + 8192
		// as the new-base(anchor) because of the maximum distance which can
		// accomodate more intermediate bases presumeably.
		//
		// Step3: move (&a + 8192) above load1. Compute and promote offsets from
		// (&a + 8192) for load1, load2, load4.
		// addr = &a + 8192
		// load1 = load(addr, -4096)
		// load2 = load(addr, -2048)
		// load3 = load(addr, 0)
		// load4 = load(addr, 2048)
		// addr5 = &a + 12288; load5 = load(addr5, 0)
		//
		MachineInstr *AnchorInst = nullptr;
		MemAddress AnchorAddr;
		uint32_t MaxDist = std::numeric_limits<uint32_t>::min();
		SmallVector<std::pair<MachineInstr *, int64_t>, 4> InstsWCommonBase;

		MachineBasicBlock *MBB = MI.getParent();
		MachineBasicBlock::iterator E = MBB->end();
		MachineBasicBlock::iterator MBBI = MI.getIterator();
		++MBBI;
		const SITargetLowering *TLI =
		static_cast<const SITargetLowering *>(STM->getTargetLowering());

		for ( ; MBBI != E; ++MBBI) {
		MachineInstr &MINext = *MBBI;
		// TODO: Support finding an anchor(with same base) from store addresses or
		// any other load addresses where the opcodes are different.
		if (MINext.getOpcode() != MI.getOpcode() \|\|
		TII->getNamedOperand(MINext, AMDGPU::OpName::offset)->getImm())
		continue;

		const MachineOperand &BaseNext =
		*TII->getNamedOperand(MINext, AMDGPU::OpName::vaddr);
		MemAddress MAddrNext;
		if (Visited.find(&MINext) == Visited.end()) {
		processBaseWithConstOffset(BaseNext, MAddrNext);
		Visited[&MINext] = MAddrNext;
		} else
		MAddrNext = Visited[&MINext];

		if (MAddrNext.Base.LoReg != MAddr.Base.LoReg \|\|
		MAddrNext.Base.HiReg != MAddr.Base.HiReg \|\|
		MAddrNext.Base.LoSubReg != MAddr.Base.LoSubReg \|\|
		MAddrNext.Base.HiSubReg != MAddr.Base.HiSubReg)
		continue;

		InstsWCommonBase.push_back(std::make_pair(&MINext, MAddrNext.Offset));

		int64_t Dist = MAddr.Offset - MAddrNext.Offset;
		TargetLoweringBase::AddrMode AM;
		AM.HasBaseReg = true;
		AM.BaseOffs = Dist;
		if (TLI->isLegalGlobalAddressingMode(AM) &&
		(uint32_t)abs(Dist) > MaxDist) {
		MaxDist = abs(Dist);

		AnchorAddr = MAddrNext;
		AnchorInst = &MINext;
		}
		}

		if (AnchorInst) {
		LLVM_DEBUG(dbgs() << " Anchor-Inst(with max-distance from Offset): ";
		AnchorInst->dump());
		LLVM_DEBUG(dbgs() << " Anchor-Offset from BASE: "
		<< AnchorAddr.Offset << "\n\n");

		// Instead of moving up, just re-compute anchor-instruction's base address.
		unsigned Base = computeBase(MI, AnchorAddr);

		updateBaseAndOffset(MI, Base, MAddr.Offset - AnchorAddr.Offset);
		LLVM_DEBUG(dbgs() << " After promotion: "; MI.dump(););

		for (auto P : InstsWCommonBase) {
		TargetLoweringBase::AddrMode AM;
		AM.HasBaseReg = true;
		AM.BaseOffs = P.second - AnchorAddr.Offset;

		if (TLI->isLegalGlobalAddressingMode(AM)) {
		LLVM_DEBUG(dbgs() << " Promote Offset(" << P.second;
		dbgs() << ")"; P.first->dump());
		updateBaseAndOffset(*P.first, Base, P.second - AnchorAddr.Offset);
		LLVM_DEBUG(dbgs() << " After promotion: "; P.first->dump());
		}
		}
		AnchorList.insert(AnchorInst);
		return true;
		}

		return false;
		}

// Scan through looking for adjacent LDS operations with constant offsets from		// Scan through looking for adjacent LDS operations with constant offsets from
// the same base register. We rely on the scheduler to do the hard work of		// the same base register. We rely on the scheduler to do the hard work of
// clustering nearby loads, and assume these are all adjacent.		// clustering nearby loads, and assume these are all adjacent.
bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {		bool SILoadStoreOptimizer::optimizeBlock(MachineBasicBlock &MBB) {
bool Modified = false;		bool Modified = false;

		// Contain the list
		MemInfoMap Visited;
		// Contains the list of instructions for which constant offsets are being
		// promoted to the IMM.
		SmallPtrSet<MachineInstr *, 4> AnchorList;

for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {		for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end(); I != E;) {
MachineInstr &MI = *I;		MachineInstr &MI = *I;

		if (promoteConstantOffsetToImm(MI, Visited, AnchorList))
		Modified = true;

// Don't combine if volatile.		// Don't combine if volatile.
if (MI.hasOrderedMemoryRef()) {		if (MI.hasOrderedMemoryRef()) {
++I;		++I;
continue;		continue;
}		}

const unsigned Opc = MI.getOpcode();		const unsigned Opc = MI.getOpcode();

▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll

				; RUN: llc -mtriple=amdgcn -mcpu=gfx803 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX8 %s
				; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9 %s

				declare i64 @_Z13get_global_idj(i32)

				define amdgpu_kernel void @clmem_read_simplified(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: clmem_read_simplified:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 7
				%idx.ext11 = and i64 %a0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%saddr = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*

				%addr1 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 %conv
				%load1 = load i64, i64 addrspace(1)* %addr1, align 8
				%addr2 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 256
				%load2 = load i64, i64 addrspace(1)* %addr2, align 8
				%add.1 = add i64 %load2, %load1

				%add.ptr8.2 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 512
				%load3 = load i64, i64 addrspace(1)* %add.ptr8.2, align 8
				%add.2 = add i64 %load3, %add.1
				%add.ptr8.3 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 768
				%load4 = load i64, i64 addrspace(1)* %add.ptr8.3, align 8
				%add.3 = add i64 %load4, %add.2

				%add.ptr8.4 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1024
				%load5 = load i64, i64 addrspace(1)* %add.ptr8.4, align 8
				%add.4 = add i64 %load5, %add.3
				%add.ptr8.5 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1280
				%load6 = load i64, i64 addrspace(1)* %add.ptr8.5, align 8
				%add.5 = add i64 %load6, %add.4

				%add.ptr8.6 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1536
				%load7 = load i64, i64 addrspace(1)* %add.ptr8.6, align 8
				%add.6 = add i64 %load7, %add.5
				%add.ptr8.7 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1792
				%load8 = load i64, i64 addrspace(1)* %add.ptr8.7, align 8
				%add.7 = add i64 %load8, %add.6

				store i64 %add.7, i64 addrspace(1)* %saddr, align 8
				ret void
				}

				define hidden amdgpu_kernel void @clmem_read(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: clmem_read:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 17
				%idx.ext11 = and i64 %a0, 4261412864
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%a1 = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*
				%add.ptr6 = getelementptr inbounds i64, i64 addrspace(1)* %a1, i64 %conv
				br label %for.cond.preheader

				while.cond.loopexit: ; preds = %for.body
				%dec = add nsw i32 %dec31, -1
				%tobool = icmp eq i32 %dec31, 0
				br i1 %tobool, label %while.end, label %for.cond.preheader

				for.cond.preheader: ; preds = %entry, %while.cond.loopexit
				%dec31 = phi i32 [ 127, %entry ], [ %dec, %while.cond.loopexit ]
				%sum.030 = phi i64 [ 0, %entry ], [ %add.10, %while.cond.loopexit ]
				br label %for.body

				for.body: ; preds = %for.body, %for.cond.preheader
				%block.029 = phi i32 [ 0, %for.cond.preheader ], [ %add9.31, %for.body ]
				%sum.128 = phi i64 [ %sum.030, %for.cond.preheader ], [ %add.10, %for.body ]
				%conv3 = zext i32 %block.029 to i64
				%add.ptr8 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3
				%load1 = load i64, i64 addrspace(1)* %add.ptr8, align 8
				%add = add i64 %load1, %sum.128

				%add9 = or i32 %block.029, 256
				%conv3.1 = zext i32 %add9 to i64
				%add.ptr8.1 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.1
				%load2 = load i64, i64 addrspace(1)* %add.ptr8.1, align 8
				%add.1 = add i64 %load2, %add

				%add9.1 = or i32 %block.029, 512
				%conv3.2 = zext i32 %add9.1 to i64
				%add.ptr8.2 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.2
				%l3 = load i64, i64 addrspace(1)* %add.ptr8.2, align 8
				%add.2 = add i64 %l3, %add.1

				%add9.2 = or i32 %block.029, 768
				%conv3.3 = zext i32 %add9.2 to i64
				%add.ptr8.3 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.3
				%l4 = load i64, i64 addrspace(1)* %add.ptr8.3, align 8
				%add.3 = add i64 %l4, %add.2

				%add9.3 = or i32 %block.029, 1024
				%conv3.4 = zext i32 %add9.3 to i64
				%add.ptr8.4 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.4
				%l5 = load i64, i64 addrspace(1)* %add.ptr8.4, align 8
				%add.4 = add i64 %l5, %add.3

				%add9.4 = or i32 %block.029, 1280
				%conv3.5 = zext i32 %add9.4 to i64
				%add.ptr8.5 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.5
				%l6 = load i64, i64 addrspace(1)* %add.ptr8.5, align 8
				%add.5 = add i64 %l6, %add.4

				%add9.5 = or i32 %block.029, 1536
				%conv3.6 = zext i32 %add9.5 to i64
				%add.ptr8.6 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.6
				%load7 = load i64, i64 addrspace(1)* %add.ptr8.6, align 8
				%add.6 = add i64 %load7, %add.5

				%add9.6 = or i32 %block.029, 1792
				%conv3.7 = zext i32 %add9.6 to i64
				%add.ptr8.7 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.7
				%load8 = load i64, i64 addrspace(1)* %add.ptr8.7, align 8
				%add.7 = add i64 %load8, %add.6

				%add9.7 = or i32 %block.029, 2048
				%conv3.8 = zext i32 %add9.7 to i64
				%add.ptr8.8 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.8
				%load9 = load i64, i64 addrspace(1)* %add.ptr8.8, align 8
				%add.8 = add i64 %load9, %add.7

				%add9.8 = or i32 %block.029, 2304
				%conv3.9 = zext i32 %add9.8 to i64
				%add.ptr8.9 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.9
				%load10 = load i64, i64 addrspace(1)* %add.ptr8.9, align 8
				%add.9 = add i64 %load10, %add.8

				%add9.9 = or i32 %block.029, 2560
				%conv3.10 = zext i32 %add9.9 to i64
				%add.ptr8.10 = getelementptr inbounds i64, i64 addrspace(1)* %add.ptr6, i64 %conv3.10
				%load11 = load i64, i64 addrspace(1)* %add.ptr8.10, align 8
				%add.10 = add i64 %load11, %add.9

				%add9.31 = add nuw nsw i32 %block.029, 8192
				%cmp.31 = icmp ult i32 %add9.31, 4194304
				br i1 %cmp.31, label %for.body, label %while.cond.loopexit

				while.end: ; preds = %while.cond.loopexit
				store i64 %add.10, i64 addrspace(1)* %a1, align 8
				ret void
				}

				; using 32bit address.
				define amdgpu_kernel void @Address32(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: Address32:
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:1024
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:3072
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:-3072
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:-1024
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%id = shl i64 %call, 7
				%idx.ext11 = and i64 %id, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%addr = bitcast i8 addrspace(1)* %add.ptr12 to i32 addrspace(1)*

				%add.ptr6 = getelementptr inbounds i32, i32 addrspace(1)* %addr, i64 %conv
				%load1 = load i32, i32 addrspace(1)* %add.ptr6, align 4

				%add.ptr8.1 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 256
				%load2 = load i32, i32 addrspace(1)* %add.ptr8.1, align 4
				%add.1 = add i32 %load2, %load1

				%add.ptr8.2 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 512
				%load3 = load i32, i32 addrspace(1)* %add.ptr8.2, align 4
				%add.2 = add i32 %load3, %add.1

				%add.ptr8.3 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 768
				%load4 = load i32, i32 addrspace(1)* %add.ptr8.3, align 4
				%add.3 = add i32 %load4, %add.2

				%add.ptr8.4 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 1024
				%load5 = load i32, i32 addrspace(1)* %add.ptr8.4, align 4
				%add.4 = add i32 %load5, %add.3

				%add.ptr8.5 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 1280
				%load6 = load i32, i32 addrspace(1)* %add.ptr8.5, align 4
				%add.5 = add i32 %load6, %add.4

				%add.ptr8.6 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 1536
				%load7 = load i32, i32 addrspace(1)* %add.ptr8.6, align 4
				%add.6 = add i32 %load7, %add.5

				%add.ptr8.7 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 1792
				%load8 = load i32, i32 addrspace(1)* %add.ptr8.7, align 4
				%add.7 = add i32 %load8, %add.6

				%add.ptr8.8 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 2048
				%load9 = load i32, i32 addrspace(1)* %add.ptr8.8, align 4
				%add.8 = add i32 %load9, %add.7

				%add.ptr8.9 = getelementptr inbounds i32, i32 addrspace(1)* %add.ptr6, i64 2304
				%load10 = load i32, i32 addrspace(1)* %add.ptr8.9, align 4
				%add.9 = add i32 %load10, %add.8

				store i32 %add.9, i32 addrspace(1)* %addr, align 4
				ret void
				}

				define amdgpu_kernel void @Offset64(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: Offset64:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 7
				%idx.ext11 = and i64 %a0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%saddr = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*

				%addr1 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 %conv
				%load1 = load i64, i64 addrspace(1)* %addr1, align 8

				%addr2 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 536870400
				%load2 = load i64, i64 addrspace(1)* %addr2, align 8

				%add1 = add i64 %load2, %load1

				%addr3 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 536870656
				%load3 = load i64, i64 addrspace(1)* %addr3, align 8

				%add2 = add i64 %load3, %add1

				%addr4 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 536870912
				%load4 = load i64, i64 addrspace(1)* %addr4, align 8
				%add4 = add i64 %load4, %add2

				store i64 %add4, i64 addrspace(1)* %saddr, align 8
				ret void
				}

				; TODO: Support load4 as anchor instruction.
				define amdgpu_kernel void @p32Offset64(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: p32Offset64:
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off offset:-1024
				; GFX9: global_load_dword {{v[0-9]+}}, v[{{[0-9]+:[0-9]+}}], off
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 7
				%idx.ext11 = and i64 %a0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%saddr = bitcast i8 addrspace(1)* %add.ptr12 to i32 addrspace(1)*

				%addr1 = getelementptr inbounds i32, i32 addrspace(1)* %saddr, i64 %conv
				%load1 = load i32, i32 addrspace(1)* %addr1, align 8

				%addr2 = getelementptr inbounds i32, i32 addrspace(1)* %addr1, i64 536870400
				%load2 = load i32, i32 addrspace(1)* %addr2, align 8

				%add1 = add i32 %load2, %load1

				%addr3 = getelementptr inbounds i32, i32 addrspace(1)* %addr1, i64 536870656
				%load3 = load i32, i32 addrspace(1)* %addr3, align 8

				%add2 = add i32 %load3, %add1

				%addr4 = getelementptr inbounds i32, i32 addrspace(1)* %addr1, i64 536870912
				%load4 = load i32, i32 addrspace(1)* %addr4, align 8
				%add4 = add i32 %load4, %add2

				store i32 %add4, i32 addrspace(1)* %saddr, align 8
				ret void
				}

				define amdgpu_kernel void @DiffBase(i8 addrspace(1)* %buffer1,
				; GCN-LABEL: DiffBase:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				i8 addrspace(1)* %buffer2) {
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 7
				%idx.ext11 = and i64 %a0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer1, i64 %idx.ext11
				%saddr = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*

				%add.ptr2 = getelementptr inbounds i8, i8 addrspace(1)* %buffer2, i64 %idx.ext11
				%saddr2 = bitcast i8 addrspace(1)* %add.ptr2 to i64 addrspace(1)*

				%addr1 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 512
				%load1 = load i64, i64 addrspace(1)* %addr1, align 8
				%add.ptr8.3 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 768
				%load2 = load i64, i64 addrspace(1)* %add.ptr8.3, align 8
				%add1 = add i64 %load2, %load1
				%add.ptr8.4 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 1024
				%load3 = load i64, i64 addrspace(1)* %add.ptr8.4, align 8
				%add2 = add i64 %load3, %add1

				%add.ptr8.5 = getelementptr inbounds i64, i64 addrspace(1)* %saddr2, i64 1280
				%load4 = load i64, i64 addrspace(1)* %add.ptr8.5, align 8

				%add.ptr8.6 = getelementptr inbounds i64, i64 addrspace(1)* %saddr2, i64 1536
				%load5 = load i64, i64 addrspace(1)* %add.ptr8.6, align 8
				%add3 = add i64 %load5, %load4

				%add.ptr8.7 = getelementptr inbounds i64, i64 addrspace(1)* %saddr2, i64 1792
				%load6 = load i64, i64 addrspace(1)* %add.ptr8.7, align 8
				%add4 = add i64 %load6, %add3

				%add5 = add i64 %add2, %add4

				store i64 %add5, i64 addrspace(1)* %saddr, align 8
				ret void
				}

				define amdgpu_kernel void @ReverseOrder(i8 addrspace(1)* %buffer) {
				; GCN-LABEL: ReverseOrder:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:-4096
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0)
				%conv = and i64 %call, 255
				%a0 = shl i64 %call, 7
				%idx.ext11 = and i64 %a0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%saddr = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*

				%addr1 = getelementptr inbounds i64, i64 addrspace(1)* %saddr, i64 %conv
				%load1 = load i64, i64 addrspace(1)* %addr1, align 8

				%add.ptr8.7 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1792
				%load8 = load i64, i64 addrspace(1)* %add.ptr8.7, align 8
				%add7 = add i64 %load8, %load1

				%add.ptr8.6 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1536
				%load7 = load i64, i64 addrspace(1)* %add.ptr8.6, align 8
				%add6 = add i64 %load7, %add7

				%add.ptr8.5 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1280
				%load6 = load i64, i64 addrspace(1)* %add.ptr8.5, align 8
				%add5 = add i64 %load6, %add6

				%add.ptr8.4 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 1024
				%load5 = load i64, i64 addrspace(1)* %add.ptr8.4, align 8
				%add4 = add i64 %load5, %add5

				%add.ptr8.3 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 768
				%load4 = load i64, i64 addrspace(1)* %add.ptr8.3, align 8
				%add3 = add i64 %load4, %add4

				%add.ptr8.2 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 512
				%load3 = load i64, i64 addrspace(1)* %add.ptr8.2, align 8
				%add2 = add i64 %load3, %add3

				%addr2 = getelementptr inbounds i64, i64 addrspace(1)* %addr1, i64 256
				%load2 = load i64, i64 addrspace(1)* %addr2, align 8
				%add1 = add i64 %load2, %add2

				store i64 %add1, i64 addrspace(1)* %saddr, align 8
				ret void
				}

				define hidden amdgpu_kernel void @negativeoffset(i8 addrspace(1)* nocapture %buffer) {
				; GCN-LABEL: negativeoffset:
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				; GFX8: flat_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}]
				;
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off
				; GFX9: global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], off offset:2048
				entry:
				%call = tail call i64 @_Z13get_global_idj(i32 0) #2
				%conv = and i64 %call, 255
				%0 = shl i64 %call, 7
				%idx.ext11 = and i64 %0, 4294934528
				%add.ptr12 = getelementptr inbounds i8, i8 addrspace(1)* %buffer, i64 %idx.ext11
				%buffer_head = bitcast i8 addrspace(1)* %add.ptr12 to i64 addrspace(1)*

				%buffer_wave = getelementptr inbounds i64, i64 addrspace(1)* %buffer_head, i64 %conv

				%addr1 = getelementptr inbounds i64, i64 addrspace(1)* %buffer_wave, i64 -536870656
				%load1 = load i64, i64 addrspace(1)* %addr1, align 8

				%addr2 = getelementptr inbounds i64, i64 addrspace(1)* %buffer_wave, i64 -536870912
				%load2 = load i64, i64 addrspace(1)* %addr2, align 8


				%add = add i64 %load2, %load1

				store i64 %add, i64 addrspace(1)* %buffer_head, align 8
				ret void
				}

llvm/trunk/test/CodeGen/AMDGPU/promote-constOffset-to-imm.mir

				# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass si-load-store-opt -o - %s \| FileCheck -check-prefix=GFX9 %s

				# GFX9-LABEL: name: diffoporder_add
				# GFX9: %{{[0-9]+}}:vreg_64 = GLOBAL_LOAD_DWORDX2 %{{[0-9]+}}, -2048, 0, 0
				# GFX9: %{{[0-9]+}}:vreg_64 = GLOBAL_LOAD_DWORDX2 %{{[0-9]+}}, 0, 0, 0

				name: diffoporder_add
				body: \|
				bb.0.entry:
				%0:sgpr_64 = COPY $sgpr0_sgpr1
				%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0, 36, 0
				%3:sreg_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
				%4:sreg_32_xm0 = COPY $sgpr101
				%5:sreg_32_xm0 = S_MOV_B32 0
				$sgpr0_sgpr1_sgpr2_sgpr3 = COPY %3
				$sgpr4 = COPY %4
				$vgpr0 = V_MOV_B32_e32 0, implicit $exec
				%6:vreg_64 = COPY $vgpr0_vgpr1
				%7:vgpr_32 = V_AND_B32_e32 255, %6.sub0, implicit $exec
				%8:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%9:vreg_64 = REG_SEQUENCE killed %7, %subreg.sub0, %8, %subreg.sub1
				%10:vgpr_32 = V_LSHLREV_B32_e64 7, %6.sub0, implicit $exec
				%11:vgpr_32 = V_AND_B32_e32 -32768, killed %10, implicit $exec
				%12:sgpr_32 = COPY %1.sub1
				%13:vgpr_32 = COPY %5
				%14:vgpr_32, %15:sreg_64_xexec = V_ADD_I32_e64 %1.sub0, %11, implicit $exec
				%16:vgpr_32 = COPY %12
				%17:vgpr_32, dead %18:sreg_64_xexec = V_ADDC_U32_e64 %16, %13, killed %15, implicit $exec
				%19:vreg_64 = REG_SEQUENCE %14, %subreg.sub0, %17, %subreg.sub1
				%20:vreg_64 = V_LSHLREV_B64 3, %9, implicit $exec
				%21:vgpr_32, %22:sreg_64_xexec = V_ADD_I32_e64 %14, %20.sub0, implicit $exec
				%23:vgpr_32, dead %24:sreg_64_xexec = V_ADDC_U32_e64 %17, %20.sub1, killed %22, implicit $exec
				%25:sgpr_32 = S_MOV_B32 4096
				%26:vgpr_32, %27:sreg_64_xexec = V_ADD_I32_e64 %25, %21, implicit $exec
				%28:vgpr_32, dead %29:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %27, implicit $exec
				%30:vreg_64 = REG_SEQUENCE %26, %subreg.sub0, %28, %subreg.sub1
				%31:vreg_64 = GLOBAL_LOAD_DWORDX2 %30, 0, 0, 0, implicit $exec
				%32:sgpr_32 = S_MOV_B32 6144
				%33:vgpr_32, %34:sreg_64_xexec = V_ADD_I32_e64 %21, %32, implicit $exec
				%35:vgpr_32, dead %36:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %34, implicit $exec
				%37:vreg_64 = REG_SEQUENCE %33, %subreg.sub0, %35, %subreg.sub1
				%38:vreg_64 = GLOBAL_LOAD_DWORDX2 %37, 0, 0, 0, implicit $exec
				...
				---

				# GFX9-LABEL: name: LowestInMiddle
				# GFX9: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 11200
				# GFX9: [[BASE_LO:%[0-9]+]]:vgpr_32, [[V_ADD_I32_e64_5:%[0-9]+]]:sreg_64_xexec = V_ADD_I32_e64 %{{[0-9]+}}, [[S_MOV_B32_1]]
				# GFX9: [[BASE_HI:%[0-9]+]]:vgpr_32, dead %{{[0-9]+}}:sreg_64_xexec = V_ADDC_U32_e64 %{{[0-9]+}}, 0, killed [[V_ADD_I32_e64_5]]
				# GFX9: [[REG_SEQUENCE2:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[BASE_LO]], %subreg.sub0, [[BASE_HI]], %subreg.sub1
				# GFX9: [[GLOBAL_LOAD_DWORDX2_:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE2]], -3200, 0, 0
				#
				# GFX9: [[S_MOV_B32_2:%[0-9]+]]:sgpr_32 = S_MOV_B32 6400
				# GFX9: [[BASE1_LO:%[0-9]+]]:vgpr_32, [[V_ADD_I32_e64_7:%[0-9]+]]:sreg_64_xexec = V_ADD_I32_e64 %{{[0-9]+}}, [[S_MOV_B32_2]]
				# GFX9: [[BASE1_HI:%[0-9]+]]:vgpr_32, dead %{{[0-9]+}}:sreg_64_xexec = V_ADDC_U32_e64 %{{[0-9]+}}, 0, killed [[V_ADD_I32_e64_7]]
				# GFX9: [[REG_SEQUENCE3:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[BASE1_LO]], %subreg.sub0, [[BASE1_HI]], %subreg.sub1
				# GFX9: [[GLOBAL_LOAD_DWORDX2_1:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE3]], 0, 0, 0,
				# GFX9: [[GLOBAL_LOAD_DWORDX2_2:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE2]], 0, 0, 0,

				name: LowestInMiddle
				body: \|
				bb.0.entry:
				%0:sgpr_64 = COPY $sgpr0_sgpr1
				%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0, 36, 0
				%3:sreg_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
				%4:sreg_32_xm0 = COPY $sgpr101
				%5:sreg_32_xm0 = S_MOV_B32 0
				$sgpr0_sgpr1_sgpr2_sgpr3 = COPY %3
				$sgpr4 = COPY %4
				$vgpr0 = V_MOV_B32_e32 0, implicit $exec
				%6:vreg_64 = COPY $vgpr0_vgpr1
				%7:vgpr_32 = V_AND_B32_e32 255, %6.sub0, implicit $exec
				%8:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%9:vreg_64 = REG_SEQUENCE killed %7, %subreg.sub0, %8, %subreg.sub1
				%10:vgpr_32 = V_LSHLREV_B32_e64 7, %6.sub0, implicit $exec
				%11:vgpr_32 = V_AND_B32_e32 -32768, killed %10, implicit $exec
				%12:sgpr_32 = COPY %1.sub1
				%13:vgpr_32 = COPY %5
				%14:vgpr_32, %15:sreg_64_xexec = V_ADD_I32_e64 %1.sub0, %11, implicit $exec
				%16:vgpr_32 = COPY %12
				%17:vgpr_32, dead %18:sreg_64_xexec = V_ADDC_U32_e64 %16, %13, killed %15, implicit $exec
				%19:vreg_64 = REG_SEQUENCE %14, %subreg.sub0, %17, %subreg.sub1
				%20:vreg_64 = V_LSHLREV_B64 3, %9, implicit $exec
				%21:vgpr_32, %22:sreg_64_xexec = V_ADD_I32_e64 %14, %20.sub0, implicit $exec
				%23:vgpr_32, dead %24:sreg_64_xexec = V_ADDC_U32_e64 %17, %20.sub1, killed %22, implicit $exec
				%25:sgpr_32 = S_MOV_B32 8000
				%26:vgpr_32, %27:sreg_64_xexec = V_ADD_I32_e64 %21, %25, implicit $exec
				%28:vgpr_32, dead %29:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %27, implicit $exec
				%30:vreg_64 = REG_SEQUENCE %26, %subreg.sub0, %28, %subreg.sub1
				%31:vreg_64 = GLOBAL_LOAD_DWORDX2 %30, 0, 0, 0, implicit $exec
				%32:sgpr_32 = S_MOV_B32 6400
				%33:vgpr_32, %34:sreg_64_xexec = V_ADD_I32_e64 %21, %32, implicit $exec
				%35:vgpr_32, dead %36:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %34, implicit $exec
				%37:vreg_64 = REG_SEQUENCE %33, %subreg.sub0, %35, %subreg.sub1
				%38:vreg_64 = GLOBAL_LOAD_DWORDX2 %37, 0, 0, 0, implicit $exec
				%39:sgpr_32 = S_MOV_B32 11200
				%40:vgpr_32, %41:sreg_64_xexec = V_ADD_I32_e64 %21, %39, implicit $exec
				%42:vgpr_32, dead %43:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %41, implicit $exec
				%44:vreg_64 = REG_SEQUENCE %40, %subreg.sub0, %42, %subreg.sub1
				%45:vreg_64 = GLOBAL_LOAD_DWORDX2 %44, 0, 0, 0, implicit $exec
				...
				---

				# GFX9-LABEL: name: NegativeDistance
				# GFX9: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 10240
				# GFX9: [[V_ADD_I32_e64_4:%[0-9]+]]:vgpr_32, [[V_ADD_I32_e64_5:%[0-9]+]]:sreg_64_xexec = V_ADD_I32_e64 %{{[0-9]+}}, [[S_MOV_B32_1]]
				# GFX9: [[BASE_HI:%[0-9]+]]:vgpr_32, dead %{{[0-9]+}}:sreg_64_xexec = V_ADDC_U32_e64 %{{[0-9]+}}, 0, killed [[V_ADD_I32_e64_5]]
				# GFX9: [[REG_SEQUENCE2:%[0-9]+]]:vreg_64 = REG_SEQUENCE [[V_ADD_I32_e64_4]], %subreg.sub0, [[BASE_HI]], %subreg.sub1
				# GFX9: [[GLOBAL_LOAD_DWORDX2_:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE2]], -4096, 0, 0
				# GFX9: [[GLOBAL_LOAD_DWORDX2_1:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE2]], -2048, 0, 0
				# GFX9: [[GLOBAL_LOAD_DWORDX2_2:%[0-9]+]]:vreg_64 = GLOBAL_LOAD_DWORDX2 [[REG_SEQUENCE2]], 0, 0, 0

				name: NegativeDistance
				body: \|
				bb.0.entry:
				%0:sgpr_64 = COPY $sgpr0_sgpr1
				%1:sreg_64_xexec = S_LOAD_DWORDX2_IMM %0, 36, 0
				%3:sreg_128 = COPY $sgpr96_sgpr97_sgpr98_sgpr99
				%4:sreg_32_xm0 = COPY $sgpr101
				%5:sreg_32_xm0 = S_MOV_B32 0
				$sgpr0_sgpr1_sgpr2_sgpr3 = COPY %3
				$sgpr4 = COPY %4
				$vgpr0 = V_MOV_B32_e32 0, implicit $exec
				%6:vreg_64 = COPY $vgpr0_vgpr1
				%7:vgpr_32 = V_AND_B32_e32 255, %6.sub0, implicit $exec
				%8:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
				%9:vreg_64 = REG_SEQUENCE killed %7, %subreg.sub0, %8, %subreg.sub1
				%10:vgpr_32 = V_LSHLREV_B32_e64 7, %6.sub0, implicit $exec
				%11:vgpr_32 = V_AND_B32_e32 -32768, killed %10, implicit $exec
				%12:sgpr_32 = COPY %1.sub1
				%13:vgpr_32 = COPY %5
				%14:vgpr_32, %15:sreg_64_xexec = V_ADD_I32_e64 %1.sub0, %11, implicit $exec
				%16:vgpr_32 = COPY %12
				%17:vgpr_32, dead %18:sreg_64_xexec = V_ADDC_U32_e64 %16, %13, killed %15, implicit $exec
				%19:vreg_64 = REG_SEQUENCE %14, %subreg.sub0, %17, %subreg.sub1
				%20:vreg_64 = V_LSHLREV_B64 3, %9, implicit $exec
				%21:vgpr_32, %22:sreg_64_xexec = V_ADD_I32_e64 %14, %20.sub0, implicit $exec
				%23:vgpr_32, dead %24:sreg_64_xexec = V_ADDC_U32_e64 %17, %20.sub1, killed %22, implicit $exec
				%25:sgpr_32 = S_MOV_B32 6144
				%26:vgpr_32, %27:sreg_64_xexec = V_ADD_I32_e64 %21, %25, implicit $exec
				%28:vgpr_32, dead %29:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %27, implicit $exec
				%30:vreg_64 = REG_SEQUENCE %26, %subreg.sub0, %28, %subreg.sub1
				%31:vreg_64 = GLOBAL_LOAD_DWORDX2 %30, 0, 0, 0, implicit $exec
				%32:sgpr_32 = S_MOV_B32 8192
				%33:vgpr_32, %34:sreg_64_xexec = V_ADD_I32_e64 %21, %32, implicit $exec
				%35:vgpr_32, dead %36:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %34, implicit $exec
				%37:vreg_64 = REG_SEQUENCE %33, %subreg.sub0, %35, %subreg.sub1
				%38:vreg_64 = GLOBAL_LOAD_DWORDX2 %37, 0, 0, 0, implicit $exec
				%39:sgpr_32 = S_MOV_B32 10240
				%40:vgpr_32, %41:sreg_64_xexec = V_ADD_I32_e64 %21, %39, implicit $exec
				%42:vgpr_32, dead %43:sreg_64_xexec = V_ADDC_U32_e64 %23, 0, killed %41, implicit $exec
				%44:vreg_64 = REG_SEQUENCE %40, %subreg.sub0, %42, %subreg.sub1
				%45:vreg_64 = GLOBAL_LOAD_DWORDX2 %44, 0, 0, 0, implicit $exec
				...