This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
2/2
AMDGPUISelDAGToDAG.cpp
-
AMDGPUSubtarget.h
-
AMDGPUSubtarget.cpp
1/1
FLATInstructions.td
-
SIFoldOperands.cpp
4/4
SIFrameLowering.cpp
-
SIRegisterInfo.h
4/5
SIRegisterInfo.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
call-preserved-registers.ll
-
callee-frame-setup.ll
-
chain-hi-to-lo.ll
-
fast-unaligned-load-store.private.ll
-
flat-scratch.ll
-
frame-index-elimination.ll
-
load-hi16.ll
-
load-lo16.ll
-
local-stack-alloc-block-sp-reference.ll
-
memcpy-fixed-align.ll
-
non-entry-alloca.ll
-
scratch-simple.ll
-
stack-pointer-offset-relative-frameindex.ll
-
store-hi16.ll

Differential D89170

[AMDGPU] Use flat scratch instructions where available
ClosedPublic

Authored by rampitec on Oct 9 2020, 4:20 PM.

Download Raw Diff

Details

Reviewers

arsenm
sebastian-ne
Flakebi

Commits

rG038d884a50a4: [AMDGPU] Use flat scratch instructions where available

Summary

The support is disabled by default. So far there is instruction
selection, spilling, and frame elimination. It also changes SP
from unswizzled to swizzled as used by flat scratch instructions,
so it cannot be mixed with MUBUF stack access.

At the very least missing:

GlobalISel;
Some optimizations in frame elimination in between vector and scalar ALU;
It shall finally allow to always materialize frame index as an SGPR, but that is not implemented and frame elimination cannot handle it yet;
Unaligned and/or multidword flat scratch shall work, but it is legalized now for MUBUF;
Operand folding cannot optimize FI like with MUBUF yet;
It will need scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address;

Diff Detail

Event Timeline

rampitec created this revision.Oct 9 2020, 4:20 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 9 2020, 4:20 PM

Herald added subscribers: kerbowa, arphaman, hiraditya and 7 others. · View Herald Transcript

rampitec requested review of this revision.Oct 9 2020, 4:20 PM

Herald added a subscriber: wdng. · View Herald TranscriptOct 9 2020, 4:20 PM

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
368	s/Scracth/Scratch/

In D89170#2325299, @scott.linder wrote:

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Thanks! I have fixed the typo and added DWARF update to the commit message.

In D89170#2325299, @scott.linder wrote:

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Is this change changing the call convention ABI? For example, making the SP be a swizzled address as opposed to a swizzled address? If so then AMDGPUUsage will also need updating.

In D89170#2325862, @t-tye wrote:

In D89170#2325299, @scott.linder wrote:

I haven't done a meaningful review, but I wanted to note that this will require changes to the debug information (which isn't committed yet). I think this could be as simple as scaling the value of the SP/FP in the DWARF expression to recover the unswizzled scratch address.

Is this change changing the call convention ABI? For example, making the SP be a swizzled address as opposed to a swizzled address? If so then AMDGPUUsage will also need updating.

Yes it does. However, it is a little premature to update the documentation. This is WIP, disabled by default and more or less not working at least until spilling is implemented. When it is at least working we can consider documenting it. Documenting it earlier just gives an impression there is an option to use it.

Fixed typo in test check.

rampitec added a child revision: D89424: [AMDGPU] Spilling using flat scratch.Oct 14 2020, 1:35 PM

Testing showed couple problems:

Debug tablegen asserts with this.
Using null register in flat scratch does not work, but it needs a new ST addressing mode of GFX10. I will create a separate patch to support ST mode.

Fixed operand order in store pattern.

Still needs ST mode.

This will change the ABI, so I don't think belongs as a subtarget property

In D89170#2332955, @arsenm wrote:

This will change the ABI, so I don't think belongs as a subtarget property

The ABI will in fact depend on the subtarget. We can only use it starting from gfx9, and then even on gfx9 it might not be desirable. GFX10 is better in this respect.
Anyway, I need a subtarget to decide if we even have flat scratch instructions. So far this switch is experimental, but if you have an idea of a better placement please tell.

Use ST mode on GFX10 instead of NULL register.

rampitec added a parent revision: D89501: [AMDGPU] flat scratch ST addressing mode on gfx10.Oct 15 2020, 4:26 PM

Flakebi added a subscriber: Flakebi.Oct 16 2020, 3:57 AM

Flakebi added inline comments.

llvm/lib/Target/AMDGPU/FLATInstructions.td

849

Should this be called ScratchLoadSignedPat_D16?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

1431–1436

I get a failing assert here with NewOpc = 4294967295:

llvm/include/llvm/MC/MCInstrInfo.h:63: const llvm::MCInstrDesc &llvm::MCInstrInfo::get(unsigned int) const: Assertion `Opcode < NumOpcodes && "Invalid opcode!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: compiler/llpc/amdllpc -gfxip=10.1 -amdgpu-enable-flat-scratch /pipelines/PipelineVsFs_0x1BEFB7D1A235B4F6.pipe -verify-machineinstrs
1.      Running pass 'CallGraph Pass Manager' on module 'lgcPipeline'.
2.      Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@_amdgpu_ps_main'
 #0 0x00000000023f0db1 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /llvm/lib/Support/Unix/Signals.inc:563:13
 #1 0x00000000023ef060 llvm::sys::RunSignalHandlers() /llvm/lib/Support/Signals.cpp:72:18
 #2 0x00000000023f1152 SignalHandler(int) /llvm/lib/Support/Unix/Signals.inc:0:3
 #3 0x00007fadd6ebfee0 __restore_rt (/glibc-2.31/lib/libpthread.so.0+0x12ee0)
 #4 0x00007fadd6d0c08a raise (/glibc-2.31/lib/libc.so.6+0x3808a)
 #5 0x00007fadd6cf6528 abort (/glibc-2.31/lib/libc.so.6+0x22528)
 #6 0x00007fadd6cf640f _nl_load_domain.cold.0 (/glibc-2.31/lib/libc.so.6+0x2240f)
 #7 0x00007fadd6d04a02 (/glibc-2.31/lib/libc.so.6+0x30a02)
 #8 0x0000000001a03170 llvm::SIRegisterInfo::eliminateFrameIndex(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, int, unsigned int, llvm::RegScavenger*) const /llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1465:11
 #9 0x000000000214e0f3 (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineBasicBlock*, llvm::MachineFunction&, int&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:0:11
#10 0x000000000214caef llvm::MachineBasicBlock::getNumber() const /llvm/include/llvm/CodeGen/MachineBasicBlock.h:904:34
#11 0x000000000214caef (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:1161:17
#12 0x000000000214caef (anonymous namespace)::PEI::runOnMachineFunction(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:269:3
#13 0x0000000002031e7e llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /llvm/lib/CodeGen/MachineFunctionPass.cpp:0:13
#14 0x0000000003136a85 llvm::FPPassManager::runOnFunction(llvm::Function&) /llvm/lib/IR/LegacyPassManager.cpp:1519:27
#15 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass*, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:178:25
#16 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:476:9
#17 0x0000000001c76b38 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:541:18
#18 0x0000000003137149 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:0:27
#19 0x0000000003137149 llvm::legacy::PassManagerImpl::run(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:615:44
…

Renamed pattern.

rampitec added inline comments.Oct 16 2020, 10:10 AM

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1431–1436	I cannot reproduce this. Take in mind that D89424 is not updated to use ST mode yet, so they do not work together yet.

Rebased to parent.

Correct rebase patch.

arsenm added inline comments.Oct 19 2020, 3:30 PM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1691	Swap these?
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1464	What happens if this needs an SGPR spill?

rampitec updated this revision to Diff 299210.Oct 19 2020, 3:55 PM

rampitec marked an inline comment as done.

rampitec added inline comments.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
1464	It it can scavenge it it shall be fine as offset shall not change. If not I guess I would need to adjust SP and revert it. I have added FIXME here.

Flakebi added inline comments.Oct 20 2020, 7:59 AM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
482	Should this be `MFI->hasFlatScratchInit() \|\| (ST.enableFlatScratch() && requiresStackPointerReference(MF))`? Otherwise, the scratch does not get initialized (I guess it’s fine to do that in a later patch).

rampitec added inline comments.Oct 20 2020, 10:51 AM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
482	Do you see it not initialized? In the SIMachineFunctionInfo() there is this code: if (ST.hasFlatAddressSpace() && isEntryFunction() && isAmdHsaOrMesa) { // TODO: This could be refined a lot. The attribute is a poor way of // detecting calls or stack objects that may require it before argument // lowering. if (HasCalls \|\| HasStackObjects) FlatScratchInit = true; } So I assume it has to be initialized. Probably the culprit is this isAmdHsaOrMesa condition? It may be needed to say (isAmdHsaOrMesa \|\| ST.enableFlatScratch()) instead. For some reason this code is not executed for amdpal, I do not see an obvious reason why.

rampitec added inline comments.Oct 20 2020, 11:22 AM

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp
482	Actually I see it uninitialized in my own test. But the code as suggested does not always work because requiresStackPointerReference() is not necessarily true if we have just private loads, we need to make sure hasFlatScratchInit() is set.

Ensure flat scratch initialization;
Added asserts around scavenger calls until there is a better handling of failed scavenging;

rampitec removed a child revision: D89424: [AMDGPU] Spilling using flat scratch.Oct 21 2020, 3:28 PM

Integrated spilling from child revision, child is dropped;
Fixed situation when an SGPR has to be spilled while scavenging in frame elimination;

Herald added a subscriber: qcolombet. · View Herald TranscriptOct 21 2020, 3:31 PM

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

In D89170#2345943, @rampitec wrote:

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

I was considering requiring an FP if the stack size was starting to hit the offset limit, but was unable to come up with a testcase where it would really break

In D89170#2347139, @arsenm wrote:

In D89170#2345943, @rampitec wrote:

I also came to conclusion that the only robust way to have no failed scavenging during frame lowering is to always have an sp or fp. Otherwise it can fail regardless of the spilling method. The only other way is to have an instruction with full 32 bit immediate offset. I.e. it can fail in a kernel with MUBUF as well.

I was considering requiring an FP if the stack size was starting to hit the offset limit, but was unable to come up with a testcase where it would really break

This sounds like a good idea. We can run into a situation when we can scavenge nothing at all, even if it is not easy to forge a testcase. That is more so with flat scratch until ST mode is available as you always need a register as a base. In fact in this scenario it may be needed even if potential offsets are small. Then we do not need buffer descriptor with flat scratch, so we are saving 4 SGPRs. It sounds fair to use one for the base pointer instead.

Fixed a need of SGPR spill during VGPR spilling on targets w/o flat scratch ST mode, reused existing code adjusting offsets.

Fixed issue with flat scratch not always being initialized. It was not initialized if we had no stack objects or calls, but later did spilling.
It is too late to insert system SGPRs at frame lowering, so initialize it always if flat scratch is used.

arsenm added inline comments.Oct 23 2020, 8:35 AM

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp
1569–1570	This should really be a pattern predicate. I was recently working on fixing these explicit subtarget checks in the complex patterns recently but didn't finish

arsenm added inline comments.Oct 23 2020, 8:39 AM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4308–4311 ↗	(On Diff #300113)	Unrelated change?
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
807–808	This looks backwards with the negated conditions

Corrected IsOffsetLegal to remove negation.

Moved predicates from complex patterns into td files.

rampitec added inline comments.Oct 23 2020, 10:56 AM

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4308–4311 ↗	(On Diff #300113)	It is related, we just never hit it before. I am probing a physical SGPR to see if it is legal. RC is SReg_32, but DRC for scratch instructions is SReg_32_XEXEC_HI and test fails.

rampitec mentioned this in D90064: [AMDGPU] Fixed isLegalRegOperand() with physregs.Oct 23 2020, 11:13 AM

rampitec mentioned this in rG2e64ad949487: [AMDGPU] Fixed isLegalRegOperand() with physregs.Oct 23 2020, 11:33 AM

Rebased.

rampitec marked an inline comment as done.Oct 23 2020, 11:37 AM

Removed unrelated subtarget change.

Looks good to me.
I tested it with the amdvlk vulkan driver (needs a pal-specific patch) and a short Vulkan CTS test ran through fine (except for pal-related failures).

This revision is now accepted and ready to land.Oct 26 2020, 8:30 AM

rampitec requested review of this revision.Oct 26 2020, 2:31 PM

This revision was not accepted when it landed; it landed in state Needs Review.Oct 26 2020, 2:41 PM

Closed by commit rG038d884a50a4: [AMDGPU] Use flat scratch instructions where available (authored by rampitec). · Explain Why

This revision was automatically updated to reflect the committed changes.

rampitec added a commit: rG038d884a50a4: [AMDGPU] Use flat scratch instructions where available.

It looks like this broke the windows lldb bot:

http://lab.llvm.org:8011/#/builders/83/builds/336

In D89170#2354963, @stella.stamenova wrote:

It looks like this broke the windows lldb bot:

http://lab.llvm.org:8011/#/builders/83/builds/336

Oops! Release build in fact. Will fix soon.

rampitec mentioned this in rGd176e13ca553: Fixed release build after D89170.Oct 26 2020, 4:01 PM

In D89170#2354963, @stella.stamenova wrote:

It looks like this broke the windows lldb bot:

http://lab.llvm.org:8011/#/builders/83/builds/336

Fixed in https://reviews.llvm.org/rGd176e13ca55353c7ee8d4da23be6eae9f82a64e1

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUISelDAGToDAG.cpp

143 lines

6 lines

9 lines

120 lines

2 lines

16 lines

2 lines

156 lines

test/

CodeGen/

AMDGPU/

call-preserved-registers.ll

10 lines

callee-frame-setup.ll

131 lines

chain-hi-to-lo.ll

210 lines

fast-unaligned-load-store.private.ll

88 lines

flat-scratch.ll

1241 lines

frame-index-elimination.ll

99 lines

load-hi16.ll

65 lines

load-lo16.ll

445 lines

local-stack-alloc-block-sp-reference.ll

225 lines

memcpy-fixed-align.ll

73 lines

non-entry-alloca.ll

432 lines

scratch-simple.ll

113 lines

stack-pointer-offset-relative-frameindex.ll

121 lines

store-hi16.ll

41 lines

Diff 297963

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	private:
bool SelectMUBUFOffset(SDValue Addr, SDValue &SRsrc, SDValue &Soffset,		bool SelectMUBUFOffset(SDValue Addr, SDValue &SRsrc, SDValue &Soffset,
SDValue &Offset) const;		SDValue &Offset) const;

template <bool IsSigned>		template <bool IsSigned>
bool SelectFlatOffset(SDNode *N, SDValue Addr, SDValue &VAddr,		bool SelectFlatOffset(SDNode *N, SDValue Addr, SDValue &VAddr,
SDValue &Offset) const;		SDValue &Offset) const;
bool SelectGlobalSAddr(SDNode *N, SDValue Addr, SDValue &SAddr,		bool SelectGlobalSAddr(SDNode *N, SDValue Addr, SDValue &SAddr,
SDValue &VOffset, SDValue &Offset) const;		SDValue &VOffset, SDValue &Offset) const;
		bool SelectScratchSAddr(SDNode *N, SDValue Addr, SDValue &SAddr,
		SDValue &Offset) const;

bool SelectSMRDOffset(SDValue ByteOffsetNode, SDValue &Offset,		bool SelectSMRDOffset(SDValue ByteOffsetNode, SDValue &Offset,
bool &Imm) const;		bool &Imm) const;
SDValue Expand32BitAddress(SDValue Addr) const;		SDValue Expand32BitAddress(SDValue Addr) const;
bool SelectSMRD(SDValue Addr, SDValue &SBase, SDValue &Offset,		bool SelectSMRD(SDValue Addr, SDValue &SBase, SDValue &Offset,
bool &Imm) const;		bool &Imm) const;
bool SelectSMRDImm(SDValue Addr, SDValue &SBase, SDValue &Offset) const;		bool SelectSMRDImm(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
bool SelectSMRDImm32(SDValue Addr, SDValue &SBase, SDValue &Offset) const;		bool SelectSMRDImm32(SDValue Addr, SDValue &SBase, SDValue &Offset) const;
▲ Show 20 Lines • Show All 1,234 Lines • ▼ Show 20 Lines	std::pair<SDValue, SDValue> AMDGPUDAGToDAGISel::foldFrameIndex(SDValue N) const {
// be relative to the entry point's scratch wave offset.		// be relative to the entry point's scratch wave offset.
return std::make_pair(N, CurDAG->getTargetConstant(0, DL, MVT::i32));		return std::make_pair(N, CurDAG->getTargetConstant(0, DL, MVT::i32));
}		}

bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffen(SDNode *Parent,		bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffen(SDNode *Parent,
SDValue Addr, SDValue &Rsrc,		SDValue Addr, SDValue &Rsrc,
SDValue &VAddr, SDValue &SOffset,		SDValue &VAddr, SDValue &SOffset,
SDValue &ImmOffset) const {		SDValue &ImmOffset) const {
		if (Subtarget->enableFlatScratch())
		return false;

SDLoc DL(Addr);		SDLoc DL(Addr);
MachineFunction &MF = CurDAG->getMachineFunction();		MachineFunction &MF = CurDAG->getMachineFunction();
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();

Rsrc = CurDAG->getRegister(Info->getScratchRSrcReg(), MVT::v4i32);		Rsrc = CurDAG->getRegister(Info->getScratchRSrcReg(), MVT::v4i32);

if (ConstantSDNode *CAddr = dyn_cast<ConstantSDNode>(Addr)) {		if (ConstantSDNode *CAddr = dyn_cast<ConstantSDNode>(Addr)) {
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffen(SDNode *Parent,
return true;		return true;
}		}

bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffset(SDNode *Parent,		bool AMDGPUDAGToDAGISel::SelectMUBUFScratchOffset(SDNode *Parent,
SDValue Addr,		SDValue Addr,
SDValue &SRsrc,		SDValue &SRsrc,
SDValue &SOffset,		SDValue &SOffset,
SDValue &Offset) const {		SDValue &Offset) const {
		if (Subtarget->enableFlatScratch())
		return false;
		arsenmUnsubmitted Done Reply Inline Actions This should really be a pattern predicate. I was recently working on fixing these explicit subtarget checks in the complex patterns recently but didn't finish arsenm: This should really be a pattern predicate. I was recently working on fixing these explicit…

ConstantSDNode *CAddr = dyn_cast<ConstantSDNode>(Addr);		ConstantSDNode *CAddr = dyn_cast<ConstantSDNode>(Addr);
if (!CAddr \|\| !SIInstrInfo::isLegalMUBUFImmOffset(CAddr->getZExtValue()))		if (!CAddr \|\| !SIInstrInfo::isLegalMUBUFImmOffset(CAddr->getZExtValue()))
return false;		return false;

SDLoc DL(Addr);		SDLoc DL(Addr);
MachineFunction &MF = CurDAG->getMachineFunction();		MachineFunction &MF = CurDAG->getMachineFunction();
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();

▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines

template <bool IsSigned>		template <bool IsSigned>
bool AMDGPUDAGToDAGISel::SelectFlatOffset(SDNode *N,		bool AMDGPUDAGToDAGISel::SelectFlatOffset(SDNode *N,
SDValue Addr,		SDValue Addr,
SDValue &VAddr,		SDValue &VAddr,
SDValue &Offset) const {		SDValue &Offset) const {
int64_t OffsetVal = 0;		int64_t OffsetVal = 0;

		unsigned AS = findMemSDNode(N)->getAddressSpace();
		if (!Subtarget->enableFlatScratch() && AS == AMDGPUAS::PRIVATE_ADDRESS)
		arsenmUnsubmitted Done Reply Inline Actions Swap these? arsenm: Swap these?
		return false;

if (Subtarget->hasFlatInstOffsets() &&		if (Subtarget->hasFlatInstOffsets() &&
(!Subtarget->hasFlatSegmentOffsetBug() \|\|		(!Subtarget->hasFlatSegmentOffsetBug() \|\|
findMemSDNode(N)->getAddressSpace() != AMDGPUAS::FLAT_ADDRESS)) {		AS != AMDGPUAS::FLAT_ADDRESS)) {
SDValue N0, N1;		SDValue N0, N1;
if (CurDAG->isBaseWithConstantOffset(Addr)) {		if (CurDAG->isBaseWithConstantOffset(Addr)) {
N0 = Addr.getOperand(0);		N0 = Addr.getOperand(0);
N1 = Addr.getOperand(1);		N1 = Addr.getOperand(1);
} else if (getBaseWithOffsetUsingSplitOR(*CurDAG, Addr, N0, N1)) {		} else if (getBaseWithOffsetUsingSplitOR(*CurDAG, Addr, N0, N1)) {
assert(N0 && N1 && isa<ConstantSDNode>(N1));		assert(N0 && N1 && isa<ConstantSDNode>(N1));
}		}
if (N0 && N1) {		if (N0 && N1) {
uint64_t COffsetVal = cast<ConstantSDNode>(N1)->getSExtValue();		uint64_t COffsetVal = cast<ConstantSDNode>(N1)->getSExtValue();

const SIInstrInfo *TII = Subtarget->getInstrInfo();		const SIInstrInfo *TII = Subtarget->getInstrInfo();
unsigned AS = findMemSDNode(N)->getAddressSpace();
if (TII->isLegalFLATOffset(COffsetVal, AS, IsSigned)) {		if (TII->isLegalFLATOffset(COffsetVal, AS, IsSigned)) {
Addr = N0;		Addr = N0;
OffsetVal = COffsetVal;		OffsetVal = COffsetVal;
} else {		} else {
// If the offset doesn't fit, put the low bits into the offset field and		// If the offset doesn't fit, put the low bits into the offset field and
// add the rest.		// add the rest.
//		//
// For a FLAT instruction the hardware decides whether to access		// For a FLAT instruction the hardware decides whether to access
Show All 16 Lines	if (N0 && N1) {
ImmField = COffsetVal & maskTrailingOnes<uint64_t>(NumBits);		ImmField = COffsetVal & maskTrailingOnes<uint64_t>(NumBits);
RemainderOffset = COffsetVal - ImmField;		RemainderOffset = COffsetVal - ImmField;
}		}
assert(TII->isLegalFLATOffset(ImmField, AS, IsSigned));		assert(TII->isLegalFLATOffset(ImmField, AS, IsSigned));
assert(RemainderOffset + ImmField == COffsetVal);		assert(RemainderOffset + ImmField == COffsetVal);

OffsetVal = ImmField;		OffsetVal = ImmField;

		SDValue AddOffsetLo =
		getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
		SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);

		if (Addr.getValueType().getSizeInBits() == 32) {
		SmallVector<SDValue, 3> Opnds;
		Opnds.push_back(N0);
		Opnds.push_back(AddOffsetLo);
		unsigned AddOp = AMDGPU::V_ADD_CO_U32_e32;
		if (Subtarget->hasAddNoCarry()) {
		AddOp = AMDGPU::V_ADD_U32_e64;
		Opnds.push_back(Clamp);
		}
		Addr = SDValue(CurDAG->getMachineNode(AddOp, DL, MVT::i32, Opnds), 0);
		} else {
// TODO: Should this try to use a scalar add pseudo if the base address		// TODO: Should this try to use a scalar add pseudo if the base address
// is uniform and saddr is usable?		// is uniform and saddr is usable?
SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);		SDValue Sub0 = CurDAG->getTargetConstant(AMDGPU::sub0, DL, MVT::i32);
SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);		SDValue Sub1 = CurDAG->getTargetConstant(AMDGPU::sub1, DL, MVT::i32);

SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, DL,		SDNode *N0Lo = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
MVT::i32, N0, Sub0);		DL, MVT::i32, N0, Sub0);
SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG, DL,		SDNode *N0Hi = CurDAG->getMachineNode(TargetOpcode::EXTRACT_SUBREG,
MVT::i32, N0, Sub1);		DL, MVT::i32, N0, Sub1);

SDValue AddOffsetLo =
getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
SDValue AddOffsetHi =		SDValue AddOffsetHi =
getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);		getMaterializedScalarImm32(Hi_32(RemainderOffset), DL);

SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);		SDVTList VTs = CurDAG->getVTList(MVT::i32, MVT::i1);
SDValue Clamp = CurDAG->getTargetConstant(0, DL, MVT::i1);

SDNode *Add =		SDNode *Add =
CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,		CurDAG->getMachineNode(AMDGPU::V_ADD_CO_U32_e64, DL, VTs,
{AddOffsetLo, SDValue(N0Lo, 0), Clamp});		{AddOffsetLo, SDValue(N0Lo, 0), Clamp});

SDNode *Addc = CurDAG->getMachineNode(		SDNode *Addc = CurDAG->getMachineNode(
AMDGPU::V_ADDC_U32_e64, DL, VTs,		AMDGPU::V_ADDC_U32_e64, DL, VTs,
{AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});		{AddOffsetHi, SDValue(N0Hi, 0), SDValue(Add, 1), Clamp});

SDValue RegSequenceArgs[] = {		SDValue RegSequenceArgs[] = {
CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL, MVT::i32),		CurDAG->getTargetConstant(AMDGPU::VReg_64RegClassID, DL, MVT::i32),
SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};		SDValue(Add, 0), Sub0, SDValue(Addc, 0), Sub1};

Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,		Addr = SDValue(CurDAG->getMachineNode(AMDGPU::REG_SEQUENCE, DL,
MVT::i64, RegSequenceArgs),		MVT::i64, RegSequenceArgs),
0);		0);
}		}
}		}
}		}
		}

VAddr = Addr;		VAddr = Addr;
Offset = CurDAG->getTargetConstant(OffsetVal, SDLoc(), MVT::i16);		Offset = CurDAG->getTargetConstant(OffsetVal, SDLoc(), MVT::i16);
return true;		return true;
}		}

// If this matches zero_extend i32:x, return x		// If this matches zero_extend i32:x, return x
static SDValue matchZExtFromI32(SDValue Op) {		static SDValue matchZExtFromI32(SDValue Op) {
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	bool AMDGPUDAGToDAGISel::SelectGlobalSAddr(SDNode *N,

if (!SAddr)		if (!SAddr)
return false;		return false;

Offset = CurDAG->getTargetConstant(ImmOffset, SDLoc(), MVT::i16);		Offset = CurDAG->getTargetConstant(ImmOffset, SDLoc(), MVT::i16);
return true;		return true;
}		}

		// Match (32-bit SGPR base) + sext(imm offset)
		bool AMDGPUDAGToDAGISel::SelectScratchSAddr(SDNode *N,
		SDValue Addr,
		SDValue &SAddr,
		SDValue &Offset) const {
		if (!Subtarget->enableFlatScratch() \|\| Addr->isDivergent())
		return false;

		SAddr = Addr;
		int64_t COffsetVal = 0;

		if (CurDAG->isBaseWithConstantOffset(Addr)) {
		COffsetVal = cast<ConstantSDNode>(Addr.getOperand(1))->getSExtValue();
		SAddr = Addr.getOperand(0);
		}

		if (auto FI = dyn_cast<FrameIndexSDNode>(SAddr)) {
		SAddr = CurDAG->getTargetFrameIndex(FI->getIndex(), FI->getValueType(0));
		} else if (SAddr.getOpcode() == ISD::ADD &&
		isa<FrameIndexSDNode>(SAddr.getOperand(0))) {
		// Materialize this into a scalar move for scalar address to avoid
		// readfirstlane.
		auto FI = cast<FrameIndexSDNode>(SAddr.getOperand(0));
		SDValue TFI = CurDAG->getTargetFrameIndex(FI->getIndex(),
		FI->getValueType(0));
		SAddr = SDValue(CurDAG->getMachineNode(AMDGPU::S_ADD_U32, SDLoc(SAddr),
		MVT::i32, TFI, SAddr.getOperand(1)),
		0);
		}

		const SIInstrInfo *TII = Subtarget->getInstrInfo();

		if (!TII->isLegalFLATOffset(COffsetVal, AMDGPUAS::PRIVATE_ADDRESS, true)) {
		int64_t RemainderOffset = COffsetVal;
		int64_t ImmField = 0;
		const unsigned NumBits = TII->getNumFlatOffsetBits(true);
		// Use signed division by a power of two to truncate towards 0.
		int64_t D = 1LL << (NumBits - 1);
		RemainderOffset = (COffsetVal / D) * D;
		ImmField = COffsetVal - RemainderOffset;

		assert(TII->isLegalFLATOffset(ImmField, AMDGPUAS::PRIVATE_ADDRESS, true));
		assert(RemainderOffset + ImmField == COffsetVal);

		COffsetVal = ImmField;

		SDLoc DL(N);
		SDValue AddOffset =
		getMaterializedScalarImm32(Lo_32(RemainderOffset), DL);
		SAddr = SDValue(CurDAG->getMachineNode(AMDGPU::S_ADD_U32, DL, MVT::i32,
		SAddr, AddOffset), 0);
		}

		Offset = CurDAG->getTargetConstant(COffsetVal, SDLoc(), MVT::i16);

		return true;
		}

bool AMDGPUDAGToDAGISel::SelectSMRDOffset(SDValue ByteOffsetNode,		bool AMDGPUDAGToDAGISel::SelectSMRDOffset(SDValue ByteOffsetNode,
SDValue &Offset, bool &Imm) const {		SDValue &Offset, bool &Imm) const {
ConstantSDNode *C = dyn_cast<ConstantSDNode>(ByteOffsetNode);		ConstantSDNode *C = dyn_cast<ConstantSDNode>(ByteOffsetNode);
if (!C) {		if (!C) {
if (ByteOffsetNode.getValueType().isScalarInteger() &&		if (ByteOffsetNode.getValueType().isScalarInteger() &&
ByteOffsetNode.getValueType().getSizeInBits() == 32) {		ByteOffsetNode.getValueType().getSizeInBits() == 32) {
Offset = ByteOffsetNode;		Offset = ByteOffsetNode;
Imm = false;		Imm = false;
▲ Show 20 Lines • Show All 1,151 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h

Show First 20 Lines • Show All 952 Lines • ▼ Show 20 Lines	public:
// static wrappers		// static wrappers
static bool hasHalfRate64Ops(const TargetSubtargetInfo &STI);		static bool hasHalfRate64Ops(const TargetSubtargetInfo &STI);

// XXX - Why is this here if it isn't in the default pass set?		// XXX - Why is this here if it isn't in the default pass set?
bool enableEarlyIfConversion() const override {		bool enableEarlyIfConversion() const override {
return true;		return true;
}		}

		bool enableFlatScratch() const;

void overrideSchedPolicy(MachineSchedPolicy &Policy,		void overrideSchedPolicy(MachineSchedPolicy &Policy,
unsigned NumRegionInstrs) const override;		unsigned NumRegionInstrs) const override;

unsigned getMaxNumUserSGPRs() const {		unsigned getMaxNumUserSGPRs() const {
return 16;		return 16;
}		}

bool hasSMemRealTime() const {		bool hasSMemRealTime() const {
Show All 37 Lines	public:
bool hasDPPWavefrontShifts() const {		bool hasDPPWavefrontShifts() const {
return HasDPP && getGeneration() < GFX10;		return HasDPP && getGeneration() < GFX10;
}		}

bool hasDPP8() const {		bool hasDPP8() const {
return HasDPP8;		return HasDPP8;
}		}

		bool hasSGPRNull() const {
		return getGeneration() >= GFX10;
		}

bool hasR128A16() const {		bool hasR128A16() const {
return HasR128A16;		return HasR128A16;
}		}

bool hasGFX10A16() const {		bool hasGFX10A16() const {
return HasGFX10A16;		return HasGFX10A16;
}		}

▲ Show 20 Lines • Show All 394 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	static cl::opt<bool> DisablePowerSched(
cl::desc("Disable scheduling to minimize mAI power bursts"),		cl::desc("Disable scheduling to minimize mAI power bursts"),
cl::init(false));		cl::init(false));

static cl::opt<bool> EnableVGPRIndexMode(		static cl::opt<bool> EnableVGPRIndexMode(
"amdgpu-vgpr-index-mode",		"amdgpu-vgpr-index-mode",
cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),		cl::desc("Use GPR indexing mode instead of movrel for vector indexing"),
cl::init(false));		cl::init(false));

		static cl::opt<bool> EnableFlatScratch(
		"amdgpu-enable-flat-scratch",
		cl::desc("Use flat scratch instructions"),
		cl::init(false));

GCNSubtarget::~GCNSubtarget() = default;		GCNSubtarget::~GCNSubtarget() = default;

R600Subtarget &		R600Subtarget &
R600Subtarget::initializeSubtargetDependencies(const Triple &TT,		R600Subtarget::initializeSubtargetDependencies(const Triple &TT,
StringRef GPU, StringRef FS) {		StringRef GPU, StringRef FS) {
SmallString<256> FullFS("+promote-alloca,");		SmallString<256> FullFS("+promote-alloca,");
FullFS += FS;		FullFS += FS;
ParseSubtargetFeatures(GPU, /TuneCPU/ GPU, FullFS);		ParseSubtargetFeatures(GPU, /TuneCPU/ GPU, FullFS);
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	GCNSubtarget::GCNSubtarget(const Triple &TT, StringRef GPU, StringRef FS,
CallLoweringInfo.reset(new AMDGPUCallLowering(*getTargetLowering()));		CallLoweringInfo.reset(new AMDGPUCallLowering(*getTargetLowering()));
InlineAsmLoweringInfo.reset(new InlineAsmLowering(getTargetLowering()));		InlineAsmLoweringInfo.reset(new InlineAsmLowering(getTargetLowering()));
Legalizer.reset(new AMDGPULegalizerInfo(*this, TM));		Legalizer.reset(new AMDGPULegalizerInfo(*this, TM));
RegBankInfo.reset(new AMDGPURegisterBankInfo(*this));		RegBankInfo.reset(new AMDGPURegisterBankInfo(*this));
InstSelector.reset(new AMDGPUInstructionSelector(		InstSelector.reset(new AMDGPUInstructionSelector(
this, static_cast<AMDGPURegisterBankInfo *>(RegBankInfo.get()), TM));		this, static_cast<AMDGPURegisterBankInfo *>(RegBankInfo.get()), TM));
}		}

		bool GCNSubtarget::enableFlatScratch() const {
		return EnableFlatScratch && hasFlatScratchInsts();
		}

unsigned GCNSubtarget::getConstantBusLimit(unsigned Opcode) const {		unsigned GCNSubtarget::getConstantBusLimit(unsigned Opcode) const {
if (getGeneration() < GFX10)		if (getGeneration() < GFX10)
return 1;		return 1;

switch (Opcode) {		switch (Opcode) {
case AMDGPU::V_LSHLREV_B64:		case AMDGPU::V_LSHLREV_B64:
case AMDGPU::V_LSHLREV_B64_gfx10:		case AMDGPU::V_LSHLREV_B64_gfx10:
case AMDGPU::V_LSHL_B64:		case AMDGPU::V_LSHL_B64:
▲ Show 20 Lines • Show All 630 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/FLATInstructions.td

//===-- FLATInstructions.td - FLAT Instruction Definitions ----------------===//		//===-- FLATInstructions.td - FLAT Instruction Definitions ----------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def FLATOffset : ComplexPattern<i64, 2, "SelectFlatOffset<false>", [], [SDNPWantRoot], -10>;		def FLATOffset : ComplexPattern<i64, 2, "SelectFlatOffset<false>", [], [SDNPWantRoot], -10>;
def FLATOffsetSigned : ComplexPattern<i64, 2, "SelectFlatOffset<true>", [], [SDNPWantRoot], -10>;		def FLATOffsetSigned : ComplexPattern<i64, 2, "SelectFlatOffset<true>", [], [SDNPWantRoot], -10>;
		def ScratchOffset : ComplexPattern<i32, 2, "SelectFlatOffset<true>", [], [SDNPWantRoot], -10>;

def GlobalSAddr : ComplexPattern<i64, 3, "SelectGlobalSAddr", [], [SDNPWantRoot], -10>;		def GlobalSAddr : ComplexPattern<i64, 3, "SelectGlobalSAddr", [], [SDNPWantRoot], -10>;
		def ScratchSAddr : ComplexPattern<i32, 2, "SelectScratchSAddr", [], [SDNPWantRoot], -10>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// FLAT classes		// FLAT classes
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class FLAT_Pseudo<string opName, dag outs, dag ins,		class FLAT_Pseudo<string opName, dag outs, dag ins,
string asmOps, list<dag> pattern=[]> :		string asmOps, list<dag> pattern=[]> :
InstSI<outs, ins, "", pattern>,		InstSI<outs, ins, "", pattern>,
▲ Show 20 Lines • Show All 813 Lines • ▼ Show 20 Lines
>;		>;

class FlatSignedAtomicPat <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt,		class FlatSignedAtomicPat <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt,
ValueType data_vt = vt> : GCNPat <		ValueType data_vt = vt> : GCNPat <
(vt (node (FLATOffsetSigned i64:$vaddr, i16:$offset), data_vt:$data)),		(vt (node (FLATOffsetSigned i64:$vaddr, i16:$offset), data_vt:$data)),
(inst $vaddr, $data, $offset)		(inst $vaddr, $data, $offset)
>;		>;

		class ScratchLoadSignedPat <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> : GCNPat <
		(vt (node (ScratchOffset (i32 VGPR_32:$vaddr), i16:$offset))),
		(inst $vaddr, $offset)
		>;

		class ScratchSignedLoadPat_D16 <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> : GCNPat <
		FlakebiUnsubmitted Done Reply Inline Actions Should this be called `ScratchLoadSignedPat_D16`? Flakebi: Should this be called `ScratchLoadSignedPat_D16`?
		(node (ScratchOffset (i32 VGPR_32:$vaddr), i16:$offset), vt:$in),
		(inst $vaddr, $offset, 0, 0, 0, $in)
		>;

		class ScratchStoreSignedPat <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> : GCNPat <
		(node vt:$data, (ScratchOffset i32:$vaddr, i16:$offset)),
		(inst $vaddr, getVregSrcForVT<vt>.ret:$data, $offset)
		>;

		class ScratchLoadSaddrPat <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> : GCNPat <
		(vt (node (ScratchSAddr (i32 SGPR_32:$saddr), i16:$offset))),
		(inst $saddr, $offset)
		>;

		class ScratchLoadSaddrPat_D16 <FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> : GCNPat <
		(vt (node (ScratchSAddr (i32 SGPR_32:$saddr), i16:$offset), vt:$in)),
		(inst $saddr, $offset, 0, 0, 0, $in)
		>;

		class ScratchStoreSaddrPat <FLAT_Pseudo inst, SDPatternOperator node,
		ValueType vt> : GCNPat <
		(node vt:$data, (ScratchSAddr (i32 SGPR_32:$saddr), i16:$offset)),
		(inst getVregSrcForVT<vt>.ret:$data, $saddr, $offset)
		>;

let OtherPredicates = [HasFlatAddressSpace] in {		let OtherPredicates = [HasFlatAddressSpace] in {

def : FlatLoadPat <FLAT_LOAD_UBYTE, extloadi8_flat, i32>;		def : FlatLoadPat <FLAT_LOAD_UBYTE, extloadi8_flat, i32>;
def : FlatLoadPat <FLAT_LOAD_UBYTE, zextloadi8_flat, i32>;		def : FlatLoadPat <FLAT_LOAD_UBYTE, zextloadi8_flat, i32>;
def : FlatLoadPat <FLAT_LOAD_SBYTE, sextloadi8_flat, i32>;		def : FlatLoadPat <FLAT_LOAD_SBYTE, sextloadi8_flat, i32>;
def : FlatLoadPat <FLAT_LOAD_UBYTE, extloadi8_flat, i16>;		def : FlatLoadPat <FLAT_LOAD_UBYTE, extloadi8_flat, i16>;
def : FlatLoadPat <FLAT_LOAD_UBYTE, zextloadi8_flat, i16>;		def : FlatLoadPat <FLAT_LOAD_UBYTE, zextloadi8_flat, i16>;
def : FlatLoadPat <FLAT_LOAD_SBYTE, sextloadi8_flat, i16>;		def : FlatLoadPat <FLAT_LOAD_SBYTE, sextloadi8_flat, i16>;
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	def : FlatSignedAtomicPatNoRtn <inst, node, vt> {
let AddedComplexity = 10;		let AddedComplexity = 10;
}		}

def : GlobalAtomicNoRtnSaddrPat<!cast<FLAT_Pseudo>(!cast<string>(inst)#"_SADDR"), node, vt> {		def : GlobalAtomicNoRtnSaddrPat<!cast<FLAT_Pseudo>(!cast<string>(inst)#"_SADDR"), node, vt> {
let AddedComplexity = 11;		let AddedComplexity = 11;
}		}
}		}

		multiclass ScratchFLATLoadPats<FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> {
		def : ScratchLoadSignedPat <inst, node, vt> {
		let AddedComplexity = 25;
		}

		def : ScratchLoadSaddrPat<!cast<FLAT_Pseudo>(!cast<string>(inst)#"_SADDR"), node, vt> {
		let AddedComplexity = 26;
		}
		}

		multiclass ScratchFLATStorePats<FLAT_Pseudo inst, SDPatternOperator node,
		ValueType vt> {
		def : ScratchStoreSignedPat <inst, node, vt> {
		let AddedComplexity = 25;
		}

		def : ScratchStoreSaddrPat<!cast<FLAT_Pseudo>(!cast<string>(inst)#"_SADDR"), node, vt> {
		let AddedComplexity = 26;
		}
		}

		multiclass ScratchFLATLoadPats_D16<FLAT_Pseudo inst, SDPatternOperator node, ValueType vt> {
		def : ScratchSignedLoadPat_D16 <inst, node, vt> {
		let AddedComplexity = 25;
		}

		def : ScratchLoadSaddrPat_D16<!cast<FLAT_Pseudo>(!cast<string>(inst)#"_SADDR"), node, vt> {
		let AddedComplexity = 26;
		}
		}

let OtherPredicates = [HasFlatGlobalInsts] in {		let OtherPredicates = [HasFlatGlobalInsts] in {

defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, extloadi8_global, i32>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, extloadi8_global, i32>;
defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, zextloadi8_global, i32>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, zextloadi8_global, i32>;
defm : GlobalFLATLoadPats <GLOBAL_LOAD_SBYTE, sextloadi8_global, i32>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_SBYTE, sextloadi8_global, i32>;
defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, extloadi8_global, i16>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, extloadi8_global, i16>;
defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, zextloadi8_global, i16>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_UBYTE, zextloadi8_global, i16>;
defm : GlobalFLATLoadPats <GLOBAL_LOAD_SBYTE, sextloadi8_global, i16>;		defm : GlobalFLATLoadPats <GLOBAL_LOAD_SBYTE, sextloadi8_global, i16>;
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines

let OtherPredicates = [HasAtomicFaddInsts] in {		let OtherPredicates = [HasAtomicFaddInsts] in {
defm : GlobalFLATNoRtnAtomicPats <GLOBAL_ATOMIC_ADD_F32, atomic_load_fadd_global_noret_32, f32>;		defm : GlobalFLATNoRtnAtomicPats <GLOBAL_ATOMIC_ADD_F32, atomic_load_fadd_global_noret_32, f32>;
defm : GlobalFLATNoRtnAtomicPats <GLOBAL_ATOMIC_PK_ADD_F16, atomic_load_fadd_v2f16_global_noret_32, v2f16>;		defm : GlobalFLATNoRtnAtomicPats <GLOBAL_ATOMIC_PK_ADD_F16, atomic_load_fadd_v2f16_global_noret_32, v2f16>;
}		}

} // End OtherPredicates = [HasFlatGlobalInsts], AddedComplexity = 10		} // End OtherPredicates = [HasFlatGlobalInsts], AddedComplexity = 10

		let OtherPredicates = [HasFlatScratchInsts] in {

		defm : ScratchFLATLoadPats <SCRATCH_LOAD_UBYTE, extloadi8_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_UBYTE, zextloadi8_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_SBYTE, sextloadi8_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_UBYTE, extloadi8_private, i16>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_UBYTE, zextloadi8_private, i16>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_SBYTE, sextloadi8_private, i16>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_USHORT, extloadi16_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_USHORT, zextloadi16_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_SSHORT, sextloadi16_private, i32>;
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_USHORT, load_private, i16>;

		foreach vt = Reg32Types.types in {
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_DWORD, load_private, vt>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_DWORD, store_private, vt>;
		}

		foreach vt = VReg_64.RegTypes in {
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_DWORDX2, load_private, vt>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_DWORDX2, store_private, vt>;
		}

		defm : ScratchFLATLoadPats <SCRATCH_LOAD_DWORDX3, load_private, v3i32>;

		foreach vt = VReg_128.RegTypes in {
		defm : ScratchFLATLoadPats <SCRATCH_LOAD_DWORDX4, load_private, vt>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_DWORDX4, store_private, vt>;
		}

		defm : ScratchFLATStorePats <SCRATCH_STORE_BYTE, truncstorei8_private, i32>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_BYTE, truncstorei8_private, i16>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_SHORT, truncstorei16_private, i32>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_SHORT, store_private, i16>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_DWORDX3, store_private, v3i32>;

		let OtherPredicates = [D16PreservesUnusedBits, HasFlatScratchInsts] in {
		defm : ScratchFLATStorePats <SCRATCH_STORE_SHORT_D16_HI, truncstorei16_hi16_private, i32>;
		defm : ScratchFLATStorePats <SCRATCH_STORE_BYTE_D16_HI, truncstorei8_hi16_private, i32>;

		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_UBYTE_D16_HI, az_extloadi8_d16_hi_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_UBYTE_D16_HI, az_extloadi8_d16_hi_private, v2f16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SBYTE_D16_HI, sextloadi8_d16_hi_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SBYTE_D16_HI, sextloadi8_d16_hi_private, v2f16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SHORT_D16_HI, load_d16_hi_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SHORT_D16_HI, load_d16_hi_private, v2f16>;

		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_UBYTE_D16, az_extloadi8_d16_lo_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_UBYTE_D16, az_extloadi8_d16_lo_private, v2f16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SBYTE_D16, sextloadi8_d16_lo_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SBYTE_D16, sextloadi8_d16_lo_private, v2f16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SHORT_D16, load_d16_lo_private, v2i16>;
		defm : ScratchFLATLoadPats_D16 <SCRATCH_LOAD_SHORT_D16, load_d16_lo_private, v2f16>;
		}

		} // End OtherPredicates = [HasFlatScratchInsts]

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Target		// Target
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CI		// CI
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 422 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

	Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines

	// TODO: Add heuristic that the frame index might not fit in the addressing mode			// TODO: Add heuristic that the frame index might not fit in the addressing mode
	// immediate offset to avoid materializing in loops.			// immediate offset to avoid materializing in loops.
	static bool frameIndexMayFold(const SIInstrInfo *TII,			static bool frameIndexMayFold(const SIInstrInfo *TII,
	const MachineInstr &UseMI,			const MachineInstr &UseMI,
	int OpNo,			int OpNo,
	const MachineOperand &OpToFold) {			const MachineOperand &OpToFold) {
	return OpToFold.isFI() &&			return OpToFold.isFI() &&
	(TII->isMUBUF(UseMI) \|\| TII->isFLATScratch(UseMI)) &&			TII->isMUBUF(UseMI) &&
	OpNo == AMDGPU::getNamedOperandIdx(UseMI.getOpcode(), AMDGPU::OpName::vaddr);			OpNo == AMDGPU::getNamedOperandIdx(UseMI.getOpcode(), AMDGPU::OpName::vaddr);
	}			}

	FunctionPass *llvm::createSIFoldOperandsPass() {			FunctionPass *llvm::createSIFoldOperandsPass() {
	return new SIFoldOperands();			return new SIFoldOperands();
	}			}

	static bool updateOperand(FoldCandidate &Fold,			static bool updateOperand(FoldCandidate &Fold,
	▲ Show 20 Lines • Show All 1,390 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

Show First 20 Lines • Show All 359 Lines • ▼ Show 20 Lines	if (!MRI.isPhysRegUsed(Reg) && MRI.isAllocatable(Reg) &&
MFI->setScratchRSrcReg(Reg);		MFI->setScratchRSrcReg(Reg);
return Reg;		return Reg;
}		}
}		}

return ScratchRsrcReg;		return ScratchRsrcReg;
}		}

		static unsigned getScratchScaleFactor(const GCNSubtarget &ST) {
		scott.linderUnsubmitted Done Reply Inline Actions s/Scracth/Scratch/ scott.linder: s/Scracth/Scratch/
		return ST.enableFlatScratch() ? 1 : ST.getWavefrontSize();
		}

void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,		void SIFrameLowering::emitEntryFunctionPrologue(MachineFunction &MF,
MachineBasicBlock &MBB) const {		MachineBasicBlock &MBB) const {
assert(&MF.front() == &MBB && "Shrink-wrapping not yet supported");		assert(&MF.front() == &MBB && "Shrink-wrapping not yet supported");

// FIXME: If we only have SGPR spills, we won't actually be using scratch		// FIXME: If we only have SGPR spills, we won't actually be using scratch
// memory since these spill to VGPRs. We should be cleaning up these unused		// memory since these spill to VGPRs. We should be cleaning up these unused
// SGPR spill frame indices somewhere.		// SGPR spill frame indices somewhere.

▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	if (TRI->isSubRegisterEq(ScratchRsrcReg, PreloadedScratchWaveOffsetReg)) {
ScratchWaveOffsetReg = PreloadedScratchWaveOffsetReg;		ScratchWaveOffsetReg = PreloadedScratchWaveOffsetReg;
}		}
assert(ScratchWaveOffsetReg);		assert(ScratchWaveOffsetReg);

if (requiresStackPointerReference(MF)) {		if (requiresStackPointerReference(MF)) {
Register SPReg = MFI->getStackPtrOffsetReg();		Register SPReg = MFI->getStackPtrOffsetReg();
assert(SPReg != AMDGPU::SP_REG);		assert(SPReg != AMDGPU::SP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg)		BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), SPReg)
.addImm(MF.getFrameInfo().getStackSize() * ST.getWavefrontSize());		.addImm(MF.getFrameInfo().getStackSize() * getScratchScaleFactor(ST));
}		}

if (hasFP(MF)) {		if (hasFP(MF)) {
Register FPReg = MFI->getFrameOffsetReg();		Register FPReg = MFI->getFrameOffsetReg();
assert(FPReg != AMDGPU::FP_REG);		assert(FPReg != AMDGPU::FP_REG);
BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);		BuildMI(MBB, I, DL, TII->get(AMDGPU::S_MOV_B32), FPReg).addImm(0);
}		}

if (MFI->hasFlatScratchInit() \|\| ScratchRsrcReg) {		if (MFI->hasFlatScratchInit() \|\| ScratchRsrcReg) {
MRI.addLiveIn(PreloadedScratchWaveOffsetReg);		MRI.addLiveIn(PreloadedScratchWaveOffsetReg);
MBB.addLiveIn(PreloadedScratchWaveOffsetReg);		MBB.addLiveIn(PreloadedScratchWaveOffsetReg);
}		}

if (MFI->hasFlatScratchInit()) {		if (MFI->hasFlatScratchInit()) {
		FlakebiUnsubmitted Done Reply Inline Actions Should this be `MFI->hasFlatScratchInit() \|\| (ST.enableFlatScratch() && requiresStackPointerReference(MF))`? Otherwise, the scratch does not get initialized (I guess it’s fine to do that in a later patch). Flakebi: Should this be `MFI->hasFlatScratchInit() \|\| (ST.enableFlatScratch() &&…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Do you see it not initialized? In the SIMachineFunctionInfo() there is this code: if (ST.hasFlatAddressSpace() && isEntryFunction() && isAmdHsaOrMesa) { // TODO: This could be refined a lot. The attribute is a poor way of // detecting calls or stack objects that may require it before argument // lowering. if (HasCalls \|\| HasStackObjects) FlatScratchInit = true; } So I assume it has to be initialized. Probably the culprit is this isAmdHsaOrMesa condition? It may be needed to say (isAmdHsaOrMesa \|\| ST.enableFlatScratch()) instead. For some reason this code is not executed for amdpal, I do not see an obvious reason why. rampitec: Do you see it not initialized? In the SIMachineFunctionInfo() there is this code: ``` if…
		rampitecAuthorUnsubmitted Done Reply Inline Actions Actually I see it uninitialized in my own test. But the code as suggested does not always work because requiresStackPointerReference() is not necessarily true if we have just private loads, we need to make sure hasFlatScratchInit() is set. rampitec: Actually I see it uninitialized in my own test. But the code as suggested does not always work…
emitEntryFunctionFlatScratchInit(MF, MBB, I, DL, ScratchWaveOffsetReg);		emitEntryFunctionFlatScratchInit(MF, MBB, I, DL, ScratchWaveOffsetReg);
}		}

if (ScratchRsrcReg) {		if (ScratchRsrcReg) {
emitEntryFunctionScratchRsrcRegSetup(MF, MBB, I, DL,		emitEntryFunctionScratchRsrcRegSetup(MF, MBB, I, DL,
PreloadedScratchRsrcReg,		PreloadedScratchRsrcReg,
ScratchRsrcReg, ScratchWaveOffsetReg);		ScratchRsrcReg, ScratchWaveOffsetReg);
}		}
▲ Show 20 Lines • Show All 396 Lines • ▼ Show 20 Lines	Register ScratchSPReg = findScratchNonCalleeSaveRegister(
MRI, LiveRegs, AMDGPU::SReg_32_XM0RegClass);		MRI, LiveRegs, AMDGPU::SReg_32_XM0RegClass);
assert(ScratchSPReg && ScratchSPReg != FuncInfo->SGPRForFPSaveRestoreCopy &&		assert(ScratchSPReg && ScratchSPReg != FuncInfo->SGPRForFPSaveRestoreCopy &&
ScratchSPReg != FuncInfo->SGPRForBPSaveRestoreCopy);		ScratchSPReg != FuncInfo->SGPRForBPSaveRestoreCopy);

// s_add_u32 tmp_reg, s32, NumBytes		// s_add_u32 tmp_reg, s32, NumBytes
// s_and_b32 s32, tmp_reg, 0b111...0000		// s_and_b32 s32, tmp_reg, 0b111...0000
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_U32), ScratchSPReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_U32), ScratchSPReg)
.addReg(StackPtrReg)		.addReg(StackPtrReg)
.addImm((Alignment - 1) * ST.getWavefrontSize())		.addImm((Alignment - 1) * getScratchScaleFactor(ST))
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_AND_B32), FramePtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_AND_B32), FramePtrReg)
.addReg(ScratchSPReg, RegState::Kill)		.addReg(ScratchSPReg, RegState::Kill)
.addImm(-Alignment * ST.getWavefrontSize())		.addImm(-Alignment * getScratchScaleFactor(ST))
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
FuncInfo->setIsStackRealigned(true);		FuncInfo->setIsStackRealigned(true);
} else if ((HasFP = hasFP(MF))) {		} else if ((HasFP = hasFP(MF))) {
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), FramePtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), FramePtrReg)
.addReg(StackPtrReg)		.addReg(StackPtrReg)
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
}		}

// If we need a base pointer, set it up here. It's whatever the value of		// If we need a base pointer, set it up here. It's whatever the value of
// the stack pointer is at this point. Any variable size objects will be		// the stack pointer is at this point. Any variable size objects will be
// allocated after this, so we can still use the base pointer to reference		// allocated after this, so we can still use the base pointer to reference
// the incoming arguments.		// the incoming arguments.
if ((HasBP = TRI.hasBasePointer(MF))) {		if ((HasBP = TRI.hasBasePointer(MF))) {
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), BasePtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), BasePtrReg)
.addReg(StackPtrReg)		.addReg(StackPtrReg)
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
}		}

if (HasFP && RoundedSize != 0) {		if (HasFP && RoundedSize != 0) {
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_U32), StackPtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_ADD_U32), StackPtrReg)
.addReg(StackPtrReg)		.addReg(StackPtrReg)
.addImm(RoundedSize * ST.getWavefrontSize())		.addImm(RoundedSize * getScratchScaleFactor(ST))
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
}		}

assert((!HasFP \|\| (FuncInfo->SGPRForFPSaveRestoreCopy \|\|		assert((!HasFP \|\| (FuncInfo->SGPRForFPSaveRestoreCopy \|\|
FuncInfo->FramePointerSaveIndex)) &&		FuncInfo->FramePointerSaveIndex)) &&
"Needed to save FP but didn't save it anywhere");		"Needed to save FP but didn't save it anywhere");

assert((HasFP \|\| (!FuncInfo->SGPRForFPSaveRestoreCopy &&		assert((HasFP \|\| (!FuncInfo->SGPRForFPSaveRestoreCopy &&
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	void SIFrameLowering::emitEpilogue(MachineFunction &MF,
if (HasBPSaveIndex) {		if (HasBPSaveIndex) {
SpillBPToMemory = MFI.getStackID(*FuncInfo->BasePointerSaveIndex) !=		SpillBPToMemory = MFI.getStackID(*FuncInfo->BasePointerSaveIndex) !=
TargetStackID::SGPRSpill;		TargetStackID::SGPRSpill;
}		}

if (RoundedSize != 0 && hasFP(MF)) {		if (RoundedSize != 0 && hasFP(MF)) {
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_SUB_U32), StackPtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::S_SUB_U32), StackPtrReg)
.addReg(StackPtrReg)		.addReg(StackPtrReg)
.addImm(RoundedSize * ST.getWavefrontSize())		.addImm(RoundedSize * getScratchScaleFactor(ST))
.setMIFlag(MachineInstr::FrameDestroy);		.setMIFlag(MachineInstr::FrameDestroy);
}		}

if (FuncInfo->SGPRForFPSaveRestoreCopy) {		if (FuncInfo->SGPRForFPSaveRestoreCopy) {
BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), FramePtrReg)		BuildMI(MBB, MBBI, DL, TII->get(AMDGPU::COPY), FramePtrReg)
.addReg(FuncInfo->SGPRForFPSaveRestoreCopy)		.addReg(FuncInfo->SGPRForFPSaveRestoreCopy)
.setMIFlag(MachineInstr::FrameSetup);		.setMIFlag(MachineInstr::FrameSetup);
}		}
▲ Show 20 Lines • Show All 271 Lines • ▼ Show 20 Lines	if (!hasReservedCallFrame(MF)) {
Amount = alignTo(Amount, getStackAlign());		Amount = alignTo(Amount, getStackAlign());
assert(isUInt<32>(Amount) && "exceeded stack address space size");		assert(isUInt<32>(Amount) && "exceeded stack address space size");
const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
Register SPReg = MFI->getStackPtrOffsetReg();		Register SPReg = MFI->getStackPtrOffsetReg();

unsigned Op = IsDestroy ? AMDGPU::S_SUB_U32 : AMDGPU::S_ADD_U32;		unsigned Op = IsDestroy ? AMDGPU::S_SUB_U32 : AMDGPU::S_ADD_U32;
BuildMI(MBB, I, DL, TII->get(Op), SPReg)		BuildMI(MBB, I, DL, TII->get(Op), SPReg)
.addReg(SPReg)		.addReg(SPReg)
.addImm(Amount * ST.getWavefrontSize());		.addImm(Amount * getScratchScaleFactor(ST));
} else if (CalleePopAmount != 0) {		} else if (CalleePopAmount != 0) {
llvm_unreachable("is this used?");		llvm_unreachable("is this used?");
}		}

return MBB.erase(I);		return MBB.erase(I);
}		}

/// Returns true if the frame will require a reference to the stack pointer.		/// Returns true if the frame will require a reference to the stack pointer.
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	public:
bool canRealignStack(const MachineFunction &MF) const override;		bool canRealignStack(const MachineFunction &MF) const override;
bool requiresRegisterScavenging(const MachineFunction &Fn) const override;		bool requiresRegisterScavenging(const MachineFunction &Fn) const override;

bool requiresFrameIndexScavenging(const MachineFunction &MF) const override;		bool requiresFrameIndexScavenging(const MachineFunction &MF) const override;
bool requiresFrameIndexReplacementScavenging(		bool requiresFrameIndexReplacementScavenging(
const MachineFunction &MF) const override;		const MachineFunction &MF) const override;
bool requiresVirtualBaseRegisters(const MachineFunction &Fn) const override;		bool requiresVirtualBaseRegisters(const MachineFunction &Fn) const override;

int64_t getMUBUFInstrOffset(const MachineInstr *MI) const;		int64_t getScratchInstrOffset(const MachineInstr *MI) const;

int64_t getFrameIndexInstrOffset(const MachineInstr *MI,		int64_t getFrameIndexInstrOffset(const MachineInstr *MI,
int Idx) const override;		int Idx) const override;

bool needsFrameBaseReg(MachineInstr *MI, int64_t Offset) const override;		bool needsFrameBaseReg(MachineInstr *MI, int64_t Offset) const override;

void materializeFrameBaseRegister(MachineBasicBlock *MBB, Register BaseReg,		void materializeFrameBaseRegister(MachineBasicBlock *MBB, Register BaseReg,
int FrameIdx,		int FrameIdx,
▲ Show 20 Lines • Show All 249 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 409 Lines • ▼ Show 20 Lines
}		}

bool SIRegisterInfo::requiresVirtualBaseRegisters(		bool SIRegisterInfo::requiresVirtualBaseRegisters(
const MachineFunction &) const {		const MachineFunction &) const {
// There are no special dedicated stack or frame pointers.		// There are no special dedicated stack or frame pointers.
return true;		return true;
}		}

int64_t SIRegisterInfo::getMUBUFInstrOffset(const MachineInstr *MI) const {		int64_t SIRegisterInfo::getScratchInstrOffset(const MachineInstr *MI) const {
assert(SIInstrInfo::isMUBUF(*MI));		assert(SIInstrInfo::isMUBUF(MI) \|\| SIInstrInfo::isFLATScratch(MI));

int OffIdx = AMDGPU::getNamedOperandIdx(MI->getOpcode(),		int OffIdx = AMDGPU::getNamedOperandIdx(MI->getOpcode(),
AMDGPU::OpName::offset);		AMDGPU::OpName::offset);
return MI->getOperand(OffIdx).getImm();		return MI->getOperand(OffIdx).getImm();
}		}

int64_t SIRegisterInfo::getFrameIndexInstrOffset(const MachineInstr *MI,		int64_t SIRegisterInfo::getFrameIndexInstrOffset(const MachineInstr *MI,
int Idx) const {		int Idx) const {
if (!SIInstrInfo::isMUBUF(*MI))		if (!SIInstrInfo::isMUBUF(MI) && !SIInstrInfo::isFLATScratch(MI))
return 0;		return 0;

assert(Idx == AMDGPU::getNamedOperandIdx(MI->getOpcode(),		assert((Idx == AMDGPU::getNamedOperandIdx(MI->getOpcode(),
AMDGPU::OpName::vaddr) &&		AMDGPU::OpName::vaddr) \|\|
		(Idx == AMDGPU::getNamedOperandIdx(MI->getOpcode(),
		AMDGPU::OpName::saddr))) &&
"Should never see frame index on non-address operand");		"Should never see frame index on non-address operand");

return getMUBUFInstrOffset(MI);		return getScratchInstrOffset(MI);
}		}

bool SIRegisterInfo::needsFrameBaseReg(MachineInstr *MI, int64_t Offset) const {		bool SIRegisterInfo::needsFrameBaseReg(MachineInstr *MI, int64_t Offset) const {
if (!MI->mayLoadOrStore())		if (!MI->mayLoadOrStore())
return false;		return false;

int64_t FullOffset = Offset + getMUBUFInstrOffset(MI);		int64_t FullOffset = Offset + getScratchInstrOffset(MI);

		if (SIInstrInfo::isMUBUF(*MI))
return !SIInstrInfo::isLegalMUBUFImmOffset(FullOffset);		return !SIInstrInfo::isLegalMUBUFImmOffset(FullOffset);

		const SIInstrInfo *TII = ST.getInstrInfo();
		return TII->isLegalFLATOffset(FullOffset, AMDGPUAS::PRIVATE_ADDRESS, true);
}		}

void SIRegisterInfo::materializeFrameBaseRegister(MachineBasicBlock *MBB,		void SIRegisterInfo::materializeFrameBaseRegister(MachineBasicBlock *MBB,
Register BaseReg,		Register BaseReg,
int FrameIdx,		int FrameIdx,
int64_t Offset) const {		int64_t Offset) const {
MachineBasicBlock::iterator Ins = MBB->begin();		MachineBasicBlock::iterator Ins = MBB->begin();
DebugLoc DL; // Defaults to "unknown"		DebugLoc DL; // Defaults to "unknown"

if (Ins != MBB->end())		if (Ins != MBB->end())
DL = Ins->getDebugLoc();		DL = Ins->getDebugLoc();

MachineFunction *MF = MBB->getParent();		MachineFunction *MF = MBB->getParent();
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
		unsigned MovOpc = ST.enableFlatScratch() ? AMDGPU::S_MOV_B32
		: AMDGPU::V_MOV_B32_e32;

if (Offset == 0) {		if (Offset == 0) {
BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_MOV_B32_e32), BaseReg)		BuildMI(*MBB, Ins, DL, TII->get(MovOpc), BaseReg)
.addFrameIndex(FrameIdx);		.addFrameIndex(FrameIdx);
return;		return;
}		}

MachineRegisterInfo &MRI = MF->getRegInfo();		MachineRegisterInfo &MRI = MF->getRegInfo();
Register OffsetReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);		Register OffsetReg = MRI.createVirtualRegister(&AMDGPU::SReg_32_XM0RegClass);

Register FIReg = MRI.createVirtualRegister(&AMDGPU::VGPR_32RegClass);		Register FIReg = MRI.createVirtualRegister(
		ST.enableFlatScratch() ? &AMDGPU::SReg_32_XM0RegClass
		: &AMDGPU::VGPR_32RegClass);

BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg)		BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::S_MOV_B32), OffsetReg)
.addImm(Offset);		.addImm(Offset);
BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::V_MOV_B32_e32), FIReg)		BuildMI(*MBB, Ins, DL, TII->get(MovOpc), FIReg)
.addFrameIndex(FrameIdx);		.addFrameIndex(FrameIdx);

		if (ST.enableFlatScratch() ) {
		BuildMI(*MBB, Ins, DL, TII->get(AMDGPU::S_ADD_U32), BaseReg)
		.addReg(OffsetReg, RegState::Kill)
		.addReg(FIReg);
		return;
		}

TII->getAddNoCarry(*MBB, Ins, DL, BaseReg)		TII->getAddNoCarry(*MBB, Ins, DL, BaseReg)
.addReg(OffsetReg, RegState::Kill)		.addReg(OffsetReg, RegState::Kill)
.addReg(FIReg)		.addReg(FIReg)
.addImm(0); // clamp bit		.addImm(0); // clamp bit
}		}

void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, Register BaseReg,		void SIRegisterInfo::resolveFrameIndex(MachineInstr &MI, Register BaseReg,
int64_t Offset) const {		int64_t Offset) const {
const SIInstrInfo *TII = ST.getInstrInfo();		const SIInstrInfo *TII = ST.getInstrInfo();
		bool IsFlat = TII->isFLATScratch(MI);

#ifndef NDEBUG		#ifndef NDEBUG
// FIXME: Is it possible to be storing a frame index to itself?		// FIXME: Is it possible to be storing a frame index to itself?
bool SeenFI = false;		bool SeenFI = false;
for (const MachineOperand &MO: MI.operands()) {		for (const MachineOperand &MO: MI.operands()) {
if (MO.isFI()) {		if (MO.isFI()) {
if (SeenFI)		if (SeenFI)
llvm_unreachable("should not see multiple frame indices");		llvm_unreachable("should not see multiple frame indices");

SeenFI = true;		SeenFI = true;
}		}
}		}
#endif		#endif

MachineOperand *FIOp = TII->getNamedOperand(MI, AMDGPU::OpName::vaddr);		MachineOperand *FIOp =
		TII->getNamedOperand(MI, IsFlat ? AMDGPU::OpName::saddr
		: AMDGPU::OpName::vaddr);
#ifndef NDEBUG		#ifndef NDEBUG
MachineBasicBlock *MBB = MI.getParent();		MachineBasicBlock *MBB = MI.getParent();
MachineFunction *MF = MBB->getParent();		MachineFunction *MF = MBB->getParent();
#endif		#endif
assert(FIOp && FIOp->isFI() && "frame index must be address operand");		assert(FIOp && FIOp->isFI() && "frame index must be address operand");
assert(TII->isMUBUF(MI));		assert(TII->isMUBUF(MI) \|\| TII->isFLATScratch(MI));

		MachineOperand *OffsetOp = TII->getNamedOperand(MI, AMDGPU::OpName::offset);
		int64_t NewOffset = OffsetOp->getImm() + Offset;

		if (IsFlat) {
		assert(TII->isLegalFLATOffset(NewOffset, AMDGPUAS::PRIVATE_ADDRESS, true) &&
		"offset should be legal");
		FIOp->ChangeToRegister(BaseReg, false);
		OffsetOp->setImm(NewOffset);
		return;
		}

MachineOperand *SOffset = TII->getNamedOperand(MI, AMDGPU::OpName::soffset);		MachineOperand *SOffset = TII->getNamedOperand(MI, AMDGPU::OpName::soffset);
assert(SOffset->getReg() ==		assert(SOffset->getReg() ==
MF->getInfo<SIMachineFunctionInfo>()->getStackPtrOffsetReg() &&		MF->getInfo<SIMachineFunctionInfo>()->getStackPtrOffsetReg() &&
"should only be seeing stack pointer offset relative FrameIndex");		"should only be seeing stack pointer offset relative FrameIndex");

MachineOperand *OffsetOp = TII->getNamedOperand(MI, AMDGPU::OpName::offset);
int64_t NewOffset = OffsetOp->getImm() + Offset;
assert(SIInstrInfo::isLegalMUBUFImmOffset(NewOffset) &&		assert(SIInstrInfo::isLegalMUBUFImmOffset(NewOffset) &&
"offset should be legal");		"offset should be legal");

FIOp->ChangeToRegister(BaseReg, false);		FIOp->ChangeToRegister(BaseReg, false);
OffsetOp->setImm(NewOffset);		OffsetOp->setImm(NewOffset);

// The move materializing the base address will be an absolute stack address,		// The move materializing the base address will be an absolute stack address,
// so clear the base offset.		// so clear the base offset.
SOffset->ChangeToImmediate(0);		SOffset->ChangeToImmediate(0);
}		}

bool SIRegisterInfo::isFrameOffsetLegal(const MachineInstr *MI,		bool SIRegisterInfo::isFrameOffsetLegal(const MachineInstr *MI,
Register BaseReg,		Register BaseReg,
int64_t Offset) const {		int64_t Offset) const {
if (!SIInstrInfo::isMUBUF(*MI))		if (!SIInstrInfo::isMUBUF(MI) && !!SIInstrInfo::isFLATScratch(MI))
return false;		return false;

int64_t NewOffset = Offset + getMUBUFInstrOffset(MI);		int64_t NewOffset = Offset + getScratchInstrOffset(MI);

		if (SIInstrInfo::isMUBUF(*MI))
return SIInstrInfo::isLegalMUBUFImmOffset(NewOffset);		return SIInstrInfo::isLegalMUBUFImmOffset(NewOffset);

		const SIInstrInfo *TII = ST.getInstrInfo();
		return TII->isLegalFLATOffset(NewOffset, AMDGPUAS::PRIVATE_ADDRESS, true);
}		}

const TargetRegisterClass *SIRegisterInfo::getPointerRegClass(		const TargetRegisterClass *SIRegisterInfo::getPointerRegClass(
const MachineFunction &MF, unsigned Kind) const {		const MachineFunction &MF, unsigned Kind) const {
// This is inaccurate. It depends on the instruction and address space. The		// This is inaccurate. It depends on the instruction and address space. The
// only place where we should hit this is for dealing with frame indexes /		// only place where we should hit this is for dealing with frame indexes /
// private accesses, so this is correct in that case.		// private accesses, so this is correct in that case.
return &AMDGPU::VGPR_32RegClass;		return &AMDGPU::VGPR_32RegClass;
▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	void SIRegisterInfo::buildSpillLoadStore(MachineBasicBlock::iterator MI,
int64_t Offset = InstOffset + MFI.getObjectOffset(Index);		int64_t Offset = InstOffset + MFI.getObjectOffset(Index);
int64_t ScratchOffsetRegDelta = 0;		int64_t ScratchOffsetRegDelta = 0;

Align Alignment = MFI.getObjectAlign(Index);		Align Alignment = MFI.getObjectAlign(Index);
const MachinePointerInfo &BasePtrInfo = MMO->getPointerInfo();		const MachinePointerInfo &BasePtrInfo = MMO->getPointerInfo();

assert((Offset % EltSize) == 0 && "unexpected VGPR spill offset");		assert((Offset % EltSize) == 0 && "unexpected VGPR spill offset");

if (!SIInstrInfo::isLegalMUBUFImmOffset(Offset + Size - EltSize)) {		if (!SIInstrInfo::isLegalMUBUFImmOffset(Offset + Size - EltSize)) {
SOffset = MCRegister();		SOffset = MCRegister();
		arsenmUnsubmitted Done Reply Inline Actions This looks backwards with the negated conditions arsenm: This looks backwards with the negated conditions

// We currently only support spilling VGPRs to EltSize boundaries, meaning		// We currently only support spilling VGPRs to EltSize boundaries, meaning
// we can simplify the adjustment of Offset here to just scale with		// we can simplify the adjustment of Offset here to just scale with
// WavefrontSize.		// WavefrontSize.
Offset *= ST.getWavefrontSize();		Offset *= ST.getWavefrontSize();

// We don't have access to the register scavenger if this function is called		// We don't have access to the register scavenger if this function is called
// during PEI::scavengeFrameVirtualRegs().		// during PEI::scavengeFrameVirtualRegs().
▲ Show 20 Lines • Show All 567 Lines • ▼ Show 20 Lines	case AMDGPU::SI_SPILL_A1024_RESTORE: {
*MI->memoperands_begin(),		*MI->memoperands_begin(),
RS);		RS);
MI->eraseFromParent();		MI->eraseFromParent();
break;		break;
}		}

default: {		default: {
const DebugLoc &DL = MI->getDebugLoc();		const DebugLoc &DL = MI->getDebugLoc();

		if (TII->isFLATScratch(*MI)) {
		// The offset is always swizzled, just replace it
		if (FrameReg)
		FIOp.ChangeToRegister(FrameReg, false);

		int64_t Offset = FrameInfo.getObjectOffset(Index);
		if (!Offset)
		return;

		MachineOperand *OffsetOp =
		TII->getNamedOperand(*MI, AMDGPU::OpName::offset);
		int64_t NewOffset = Offset + OffsetOp->getImm();
		if (TII->isLegalFLATOffset(NewOffset, AMDGPUAS::PRIVATE_ADDRESS,
		true)) {
		OffsetOp->setImm(NewOffset);
		if (FrameReg)
		return;
		Offset = 0;
		}

		Register SReg = AMDGPU::SGPR_NULL;
		// On GFX10 we have NULL register to use here.
		// Otherwise we need to materialize 0 into an SGPR.
		if (Offset \|\| !ST.hasSGPRNull()) {
		SReg = RS->scavengeRegister(&AMDGPU::SReg_32_XM0RegClass, MI, 0);
		if (FrameReg)
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), SReg)
		.addReg(FrameReg)
		.addImm(Offset);
		else
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), SReg)
		.addImm(Offset);
		}
		FIOp.ChangeToRegister(SReg, false, false, true);
		return;
		}

		if (ST.enableFlatScratch()) {
		int64_t Offset = FrameInfo.getObjectOffset(Index);
		if (!FrameReg) {
		FIOp.ChangeToImmediate(Offset);
		if (TII->isImmOperandLegal(*MI, FIOperandNum, FIOp))
		return;
		}
		FlakebiUnsubmitted Not Done Reply Inline Actions I get a failing assert here with `NewOpc = 4294967295`: llvm/include/llvm/MC/MCInstrInfo.h:63: const llvm::MCInstrDesc &llvm::MCInstrInfo::get(unsigned int) const: Assertion `Opcode < NumOpcodes && "Invalid opcode!"' failed. PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: compiler/llpc/amdllpc -gfxip=10.1 -amdgpu-enable-flat-scratch /pipelines/PipelineVsFs_0x1BEFB7D1A235B4F6.pipe -verify-machineinstrs 1. Running pass 'CallGraph Pass Manager' on module 'lgcPipeline'. 2. Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@_amdgpu_ps_main' #0 0x00000000023f0db1 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /llvm/lib/Support/Unix/Signals.inc:563:13 #1 0x00000000023ef060 llvm::sys::RunSignalHandlers() /llvm/lib/Support/Signals.cpp:72:18 #2 0x00000000023f1152 SignalHandler(int) /llvm/lib/Support/Unix/Signals.inc:0:3 #3 0x00007fadd6ebfee0 __restore_rt (/glibc-2.31/lib/libpthread.so.0+0x12ee0) #4 0x00007fadd6d0c08a raise (/glibc-2.31/lib/libc.so.6+0x3808a) #5 0x00007fadd6cf6528 abort (/glibc-2.31/lib/libc.so.6+0x22528) #6 0x00007fadd6cf640f _nl_load_domain.cold.0 (/glibc-2.31/lib/libc.so.6+0x2240f) #7 0x00007fadd6d04a02 (/glibc-2.31/lib/libc.so.6+0x30a02) #8 0x0000000001a03170 llvm::SIRegisterInfo::eliminateFrameIndex(llvm::MachineInstrBundleIterator<llvm::MachineInstr, false>, int, unsigned int, llvm::RegScavenger) const /llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1465:11 #9 0x000000000214e0f3 (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineBasicBlock, llvm::MachineFunction&, int&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:0:11 #10 0x000000000214caef llvm::MachineBasicBlock::getNumber() const /llvm/include/llvm/CodeGen/MachineBasicBlock.h:904:34 #11 0x000000000214caef (anonymous namespace)::PEI::replaceFrameIndices(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:1161:17 #12 0x000000000214caef (anonymous namespace)::PEI::runOnMachineFunction(llvm::MachineFunction&) /llvm/lib/CodeGen/PrologEpilogInserter.cpp:269:3 #13 0x0000000002031e7e llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /llvm/lib/CodeGen/MachineFunctionPass.cpp:0:13 #14 0x0000000003136a85 llvm::FPPassManager::runOnFunction(llvm::Function&) /llvm/lib/IR/LegacyPassManager.cpp:1519:27 #15 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunPassOnSCC(llvm::Pass, llvm::CallGraphSCC&, llvm::CallGraph&, bool&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:178:25 #16 0x0000000001c76b38 (anonymous namespace)::CGPassManager::RunAllPassesOnSCC(llvm::CallGraphSCC&, llvm::CallGraph&, bool&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:476:9 #17 0x0000000001c76b38 (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) /llvm/lib/Analysis/CallGraphSCCPass.cpp:541:18 #18 0x0000000003137149 (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:0:27 #19 0x0000000003137149 llvm::legacy::PassManagerImpl::run(llvm::Module&) /llvm/lib/IR/LegacyPassManager.cpp:615:44 … Flakebi:* I get a failing assert here with `NewOpc = 4294967295`: ``` llvm/include/llvm/MC/MCInstrInfo.h…
		rampitecAuthorUnsubmitted Done Reply Inline Actions I cannot reproduce this. Take in mind that D89424 is not updated to use ST mode yet, so they do not work together yet. rampitec: I cannot reproduce this. Take in mind that D89424 is not updated to use ST mode yet, so they do…

		// We need to use register here. Check if we can use an SGPR or need
		// a VGPR.
		FIOp.ChangeToRegister(AMDGPU::M0, false);
		bool UseSGPR = TII->isOperandLegal(*MI, FIOperandNum, &FIOp);
		const TargetRegisterClass *RC = UseSGPR ? &AMDGPU::SReg_32_XM0RegClass
		: &AMDGPU::VGPR_32RegClass;

		if (!Offset && UseSGPR) {
		FIOp.setReg(FrameReg);
		return;
		}

		Register TmpReg = RS->scavengeRegister(RC, MI, 0);
		FIOp.setReg(TmpReg);
		FIOp.setIsKill(true);

		if (!Offset \|\| !FrameReg) {
		unsigned Opc = UseSGPR ? AMDGPU::S_MOV_B32 : AMDGPU::V_MOV_B32_e32;
		auto MIB = BuildMI(*MBB, MI, DL, TII->get(Opc), TmpReg);
		if (FrameReg)
		MIB.addReg(FrameReg);
		if (Offset)
		MIB.addImm(Offset);
		return;
		}

		Register TmpSReg = UseSGPR ? TmpReg
		arsenmUnsubmitted Done Reply Inline Actions What happens if this needs an SGPR spill? arsenm: What happens if this needs an SGPR spill?
		rampitecAuthorUnsubmitted Done Reply Inline Actions It it can scavenge it it shall be fine as offset shall not change. If not I guess I would need to adjust SP and revert it. I have added FIXME here. rampitec: It it can scavenge it it shall be fine as offset shall not change. If not I guess I would need…
		: RS->scavengeRegister(&AMDGPU::SReg_32_XM0RegClass, MI, 0);
		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), TmpSReg)
		.addReg(FrameReg)
		.addImm(Offset);

		if (UseSGPR)
		return;

		BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_MOV_B32_e32), TmpReg)
		.addReg(TmpSReg, RegState::Kill);

		return;
		}

bool IsMUBUF = TII->isMUBUF(*MI);		bool IsMUBUF = TII->isMUBUF(*MI);

if (!IsMUBUF && !MFI->isEntryFunction()) {		if (!IsMUBUF && !MFI->isEntryFunction()) {
// Convert to a swizzled stack address by scaling by the wave size.		// Convert to a swizzled stack address by scaling by the wave size.
//		//
// In an entry function/kernel the offset is already swizzled.		// In an entry function/kernel the offset is already swizzled.

bool IsCopy = MI->getOpcode() == AMDGPU::V_MOV_B32_e32;		bool IsCopy = MI->getOpcode() == AMDGPU::V_MOV_B32_e32;
▲ Show 20 Lines • Show All 645 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MUBUF %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MUBUF %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-ipra=0 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MUBUF %s
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-ipra=0 -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,FLATSCR %s

declare hidden void @external_void_func_void() #0		declare hidden void @external_void_func_void() #0

; GCN-LABEL: {{^}}test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void:		; GCN-LABEL: {{^}}test_kernel_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void:
; GCN: s_getpc_b64 s[34:35]		; GCN: s_getpc_b64 s[34:35]
; GCN-NEXT: s_add_u32 s34, s34,		; GCN-NEXT: s_add_u32 s34, s34,
; GCN-NEXT: s_addc_u32 s35, s35,		; GCN-NEXT: s_addc_u32 s35, s35,
; GCN-NEXT: s_mov_b32 s32, 0		; GCN-NEXT: s_mov_b32 s32, 0
Show All 36 Lines	define void @test_func_call_external_void_func_void_clobber_s30_s31_call_external_void_func_void() #0 {
ret void		ret void
}		}

; GCN-LABEL: {{^}}test_func_call_external_void_funcx2:		; GCN-LABEL: {{^}}test_func_call_external_void_funcx2:
; GCN: buffer_store_dword v40		; GCN: buffer_store_dword v40
; GCN: v_writelane_b32 v40, s33, 4		; GCN: v_writelane_b32 v40, s33, 4

; GCN: s_mov_b32 s33, s32		; GCN: s_mov_b32 s33, s32
; GCN: s_add_u32 s32, s32, 0x400		; MUBUF: s_add_u32 s32, s32, 0x400
		; FLATSCR: s_add_u32 s32, s32, 16
; GCN: s_swappc_b64		; GCN: s_swappc_b64
; GCN-NEXT: s_swappc_b64		; GCN-NEXT: s_swappc_b64

; GCN: v_readlane_b32 s33, v40, 4		; GCN: v_readlane_b32 s33, v40, 4
; GCN: buffer_load_dword v40,		; GCN: buffer_load_dword v40,
define void @test_func_call_external_void_funcx2() #0 {		define void @test_func_call_external_void_funcx2() #0 {
call void @external_void_func_void()		call void @external_void_func_void()
call void @external_void_func_void()		call void @external_void_func_void()
▲ Show 20 Lines • Show All 261 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll

; RUN: llc -march=amdgcn -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN -check-prefix=CI %s		; RUN: llc -march=amdgcn -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CI,MUBUF %s
; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN -check-prefix=GFX9 %s		; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9,MUBUF %s
		; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,FLATSCR %s

; GCN-LABEL: {{^}}callee_no_stack:		; GCN-LABEL: {{^}}callee_no_stack:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_no_stack() #0 {		define void @callee_no_stack() #0 {
ret void		ret void
}		}
Show All 16 Lines
define void @callee_no_stack_no_fp_elim_nonleaf() #2 {		define void @callee_no_stack_no_fp_elim_nonleaf() #2 {
ret void		ret void
}		}

; GCN-LABEL: {{^}}callee_with_stack:		; GCN-LABEL: {{^}}callee_with_stack:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}		; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}
; GCN-NEXT: buffer_store_dword v0, off, s[0:3], s32{{$}}		; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s32{{$}}
		; FLATSCR-NEXT: scratch_store_dword off, v0, s32
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_with_stack() #0 {		define void @callee_with_stack() #0 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
ret void		ret void
}		}

; Can use free call clobbered register to preserve original FP value.		; Can use free call clobbered register to preserve original FP value.

; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_all:		; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_all:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_mov_b32 s4, s33		; GCN-NEXT: s_mov_b32 s4, s33
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32
; GCN-NEXT: s_add_u32 s32, s32, 0x200		; MUBUF-NEXT: s_add_u32 s32, s32, 0x200
		; FLATSCR-NEXT: s_add_u32 s32, s32, 8
; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}		; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}
; GCN-NEXT: buffer_store_dword v0, off, s[0:3], s33 offset:4{{$}}		; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s33 offset:4{{$}}
; GCN-NEXT: s_sub_u32 s32, s32, 0x200		; FLATSCR-NEXT: scratch_store_dword off, v0, s33 offset:4{{$}}
		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x200
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 8
; GCN-NEXT: s_mov_b32 s33, s4		; GCN-NEXT: s_mov_b32 s33, s4
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_with_stack_no_fp_elim_all() #1 {		define void @callee_with_stack_no_fp_elim_all() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
ret void		ret void
}		}

; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_non_leaf:		; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_non_leaf:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}		; GCN-NEXT: v_mov_b32_e32 v0, 0{{$}}
; GCN-NEXT: buffer_store_dword v0, off, s[0:3], s32{{$}}		; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s32{{$}}
		; FLATSCR-NEXT: scratch_store_dword off, v0, s32{{$}}
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_with_stack_no_fp_elim_non_leaf() #2 {		define void @callee_with_stack_no_fp_elim_non_leaf() #2 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
ret void		ret void
}		}

; GCN-LABEL: {{^}}callee_with_stack_and_call:		; GCN-LABEL: {{^}}callee_with_stack_and_call:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 offset:4 ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 offset:4 ; 4-byte Folded Spill
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]
; GCN: v_writelane_b32 [[CSR_VGPR]], s33, 2		; GCN: v_writelane_b32 [[CSR_VGPR]], s33, 2
; GCN-DAG: s_mov_b32 s33, s32		; GCN-DAG: s_mov_b32 s33, s32
; GCN-DAG: s_add_u32 s32, s32, 0x400{{$}}		; MUBUF-DAG: s_add_u32 s32, s32, 0x400{{$}}
		; FLATSCR-DAG: s_add_u32 s32, s32, 16{{$}}
; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}		; GCN-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s30,		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s30,
; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31,		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31,

; GCN-DAG: buffer_store_dword [[ZERO]], off, s[0:3], s33{{$}}		; MUBUF-DAG: buffer_store_dword [[ZERO]], off, s[0:3], s33{{$}}
		; FLATSCR-DAG: scratch_store_dword off, [[ZERO]], s33{{$}}

; GCN: s_swappc_b64		; GCN: s_swappc_b64

; GCN-DAG: v_readlane_b32 s5, [[CSR_VGPR]]		; GCN-DAG: v_readlane_b32 s5, [[CSR_VGPR]]
; GCN-DAG: v_readlane_b32 s4, [[CSR_VGPR]]		; GCN-DAG: v_readlane_b32 s4, [[CSR_VGPR]]

; GCN: s_sub_u32 s32, s32, 0x400{{$}}		; MUBUF: s_sub_u32 s32, s32, 0x400{{$}}
		; FLATSCR: s_sub_u32 s32, s32, 16{{$}}
; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2		; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 offset:4 ; 4-byte Folded Reload		; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 offset:4 ; 4-byte Folded Reload
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)

; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_with_stack_and_call() #0 {		define void @callee_with_stack_and_call() #0 {
Show All 9 Lines
; There is stack usage only because of the need to evict a VGPR for		; There is stack usage only because of the need to evict a VGPR for
; spilling CSR SGPRs.		; spilling CSR SGPRs.

; GCN-LABEL: {{^}}callee_no_stack_with_call:		; GCN-LABEL: {{^}}callee_no_stack_with_call:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 ; 4-byte Folded Spill
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]
; GCN-DAG: s_add_u32 s32, s32, 0x400		; MUBUF-DAG: s_add_u32 s32, s32, 0x400
		; FLATSCR-DAG: s_add_u32 s32, s32, 16
; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s33, [[FP_SPILL_LANE:[0-9]+]]		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s33, [[FP_SPILL_LANE:[0-9]+]]

; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s30, 0		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s30, 0
; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1
; GCN: s_swappc_b64		; GCN: s_swappc_b64

; GCN-DAG: v_readlane_b32 s4, v40, 0		; GCN-DAG: v_readlane_b32 s4, v40, 0
; GCN-DAG: v_readlane_b32 s5, v40, 1		; GCN-DAG: v_readlane_b32 s5, v40, 1

; GCN: s_sub_u32 s32, s32, 0x400		; MUBUF: s_sub_u32 s32, s32, 0x400
		; FLATSCR: s_sub_u32 s32, s32, 16
; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], [[FP_SPILL_LANE]]		; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], [[FP_SPILL_LANE]]
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 ; 4-byte Folded Reload		; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 ; 4-byte Folded Reload
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_no_stack_with_call() #0 {		define void @callee_no_stack_with_call() #0 {
call void @external_void_func_void()		call void @external_void_func_void()
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

; TODO: Can the SP inc/deec be remvoed?		; TODO: Can the SP inc/deec be remvoed?
; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_csr_vgpr:		; GCN-LABEL: {{^}}callee_with_stack_no_fp_elim_csr_vgpr:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT:s_mov_b32 [[FP_COPY:s[0-9]+]], s33		; GCN-NEXT:s_mov_b32 [[FP_COPY:s[0-9]+]], s33
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32
; GCN: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0		; GCN: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0
; GCN-DAG: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill		; GCN-DAG: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
; GCN-DAG: buffer_store_dword [[ZERO]], off, s[0:3], s33 offset:8		; MUBUF-DAG: buffer_store_dword [[ZERO]], off, s[0:3], s33 offset:8
		; FLATSCR-DAG: scratch_store_dword off, [[ZERO]], s33 offset:8

; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART
; GCN-NEXT: ; clobber v41		; GCN-NEXT: ; clobber v41
; GCN-NEXT: ;;#ASMEND		; GCN-NEXT: ;;#ASMEND

; GCN: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload		; GCN: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
; GCN: s_add_u32 s32, s32, 0x300		; MUBUF: s_add_u32 s32, s32, 0x300
; GCN-NEXT: s_sub_u32 s32, s32, 0x300		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x300
		; FLATSCR: s_add_u32 s32, s32, 12
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 12
; GCN-NEXT: s_mov_b32 s33, s4		; GCN-NEXT: s_mov_b32 s33, s4
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @callee_with_stack_no_fp_elim_csr_vgpr() #1 {		define void @callee_with_stack_no_fp_elim_csr_vgpr() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
call void asm sideeffect "; clobber v41", "~{v41}"()		call void asm sideeffect "; clobber v41", "~{v41}"()
ret void		ret void
}		}

; Use a copy to a free SGPR instead of introducing a second CSR VGPR.		; Use a copy to a free SGPR instead of introducing a second CSR VGPR.
; GCN-LABEL: {{^}}last_lane_vgpr_for_fp_csr:		; GCN-LABEL: {{^}}last_lane_vgpr_for_fp_csr:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: v_writelane_b32 v1, s33, 63		; GCN-NEXT: v_writelane_b32 v1, s33, 63
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32
; GCN: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill		; GCN: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
; GCN-COUNT-63: v_writelane_b32 v1		; GCN-COUNT-63: v_writelane_b32 v1
; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33 offset:8		; MUBUF: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33 offset:8
		; FLATSCR: scratch_store_dword off, v{{[0-9]+}}, s33 offset:8
; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART
; GCN-COUNT-63: v_readlane_b32 s{{[0-9]+}}, v1		; GCN-COUNT-63: v_readlane_b32 s{{[0-9]+}}, v1

; GCN: s_add_u32 s32, s32, 0x300		; MUBUF: s_add_u32 s32, s32, 0x300
; GCN-NEXT: s_sub_u32 s32, s32, 0x300		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x300
		; FLATSCR: s_add_u32 s32, s32, 12
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 12
; GCN-NEXT: v_readlane_b32 s33, v1, 63		; GCN-NEXT: v_readlane_b32 s33, v1, 63
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @last_lane_vgpr_for_fp_csr() #1 {		define void @last_lane_vgpr_for_fp_csr() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
call void asm sideeffect "; clobber v41", "~{v41}"()		call void asm sideeffect "; clobber v41", "~{v41}"()
call void asm sideeffect "",		call void asm sideeffect "",
Show All 11 Lines
; Use a copy to a free SGPR instead of introducing a second CSR VGPR.		; Use a copy to a free SGPR instead of introducing a second CSR VGPR.
; GCN-LABEL: {{^}}no_new_vgpr_for_fp_csr:		; GCN-LABEL: {{^}}no_new_vgpr_for_fp_csr:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: s_mov_b32 [[FP_COPY:s[0-9]+]], s33		; GCN-NEXT: s_mov_b32 [[FP_COPY:s[0-9]+]], s33
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32
; GCN-NEXT: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword v41, off, s[0:3], s33 ; 4-byte Folded Spill
; GCN-COUNT-64: v_writelane_b32 v1,		; GCN-COUNT-64: v_writelane_b32 v1,

; GCN: buffer_store_dword		; MUBUF: buffer_store_dword
		; FLATSCR: scratch_store_dword
; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART
; GCN-COUNT-64: v_readlane_b32 s{{[0-9]+}}, v1		; GCN-COUNT-64: v_readlane_b32 s{{[0-9]+}}, v1

; GCN: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload		; GCN: buffer_load_dword v41, off, s[0:3], s33 ; 4-byte Folded Reload
; GCN: s_add_u32 s32, s32, 0x300		; MUBUF: s_add_u32 s32, s32, 0x300
; GCN-NEXT: s_sub_u32 s32, s32, 0x300		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x300
		; FLATSCR: s_add_u32 s32, s32, 12
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 12
; GCN-NEXT: s_mov_b32 s33, [[FP_COPY]]		; GCN-NEXT: s_mov_b32 s33, [[FP_COPY]]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @no_new_vgpr_for_fp_csr() #1 {		define void @no_new_vgpr_for_fp_csr() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
call void asm sideeffect "; clobber v41", "~{v41}"()		call void asm sideeffect "; clobber v41", "~{v41}"()
call void asm sideeffect "",		call void asm sideeffect "",
"~{s39},~{s40},~{s41},~{s42},~{s43},~{s44},~{s45},~{s46},~{s47},~{s48},~{s49}		"~{s39},~{s40},~{s41},~{s42},~{s43},~{s44},~{s45},~{s46},~{s47},~{s48},~{s49}
,~{s50},~{s51},~{s52},~{s53},~{s54},~{s55},~{s56},~{s57},~{s58},~{s59}		,~{s50},~{s51},~{s52},~{s53},~{s54},~{s55},~{s56},~{s57},~{s58},~{s59}
,~{s60},~{s61},~{s62},~{s63},~{s64},~{s65},~{s66},~{s67},~{s68},~{s69}		,~{s60},~{s61},~{s62},~{s63},~{s64},~{s65},~{s66},~{s67},~{s68},~{s69}
,~{s70},~{s71},~{s72},~{s73},~{s74},~{s75},~{s76},~{s77},~{s78},~{s79}		,~{s70},~{s71},~{s72},~{s73},~{s74},~{s75},~{s76},~{s77},~{s78},~{s79}
,~{s80},~{s81},~{s82},~{s83},~{s84},~{s85},~{s86},~{s87},~{s88},~{s89}		,~{s80},~{s81},~{s82},~{s83},~{s84},~{s85},~{s86},~{s87},~{s88},~{s89}
,~{s90},~{s91},~{s92},~{s93},~{s94},~{s95},~{s96},~{s97},~{s98},~{s99}		,~{s90},~{s91},~{s92},~{s93},~{s94},~{s95},~{s96},~{s97},~{s98},~{s99}
,~{s100},~{s101},~{s102}"() #1		,~{s100},~{s101},~{s102}"() #1

ret void		ret void
}		}

; GCN-LABEL: {{^}}realign_stack_no_fp_elim:		; GCN-LABEL: {{^}}realign_stack_no_fp_elim:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: s_add_u32 [[SCRATCH:s[0-9]+]], s32, 0x7ffc0		; MUBUF-NEXT: s_add_u32 [[SCRATCH:s[0-9]+]], s32, 0x7ffc0
		; FLATSCR-NEXT: s_add_u32 [[SCRATCH:s[0-9]+]], s32, 0x1fff
; GCN-NEXT: s_mov_b32 s4, s33		; GCN-NEXT: s_mov_b32 s4, s33
; GCN-NEXT: s_and_b32 s33, [[SCRATCH]], 0xfff80000		; MUBUF-NEXT: s_and_b32 s33, [[SCRATCH]], 0xfff80000
; GCN-NEXT: s_add_u32 s32, s32, 0x100000		; FLATSCR-NEXT: s_and_b32 s33, [[SCRATCH]], 0xffffe000
		; MUBUF-NEXT: s_add_u32 s32, s32, 0x100000
		; FLATSCR-NEXT: s_add_u32 s32, s32, 0x4000
; GCN-NEXT: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0		; GCN-NEXT: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0
; GCN-NEXT: buffer_store_dword [[ZERO]], off, s[0:3], s33		; MUBUF-NEXT: buffer_store_dword [[ZERO]], off, s[0:3], s33
; GCN-NEXT: s_sub_u32 s32, s32, 0x100000		; FLATSCR-NEXT: scratch_store_dword off, [[ZERO]], s33
		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x100000
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 0x4000
; GCN-NEXT: s_mov_b32 s33, s4		; GCN-NEXT: s_mov_b32 s33, s4
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @realign_stack_no_fp_elim() #1 {		define void @realign_stack_no_fp_elim() #1 {
%alloca = alloca i32, align 8192, addrspace(5)		%alloca = alloca i32, align 8192, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
ret void		ret void
}		}

; GCN-LABEL: {{^}}no_unused_non_csr_sgpr_for_fp:		; GCN-LABEL: {{^}}no_unused_non_csr_sgpr_for_fp:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: v_writelane_b32 v1, s33, 2		; GCN-NEXT: v_writelane_b32 v1, s33, 2
; GCN-NEXT: v_writelane_b32 v1, s30, 0		; GCN-NEXT: v_writelane_b32 v1, s30, 0
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32
; GCN: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0		; GCN: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0
; GCN: v_writelane_b32 v1, s31, 1		; GCN: v_writelane_b32 v1, s31, 1
; GCN: buffer_store_dword [[ZERO]], off, s[0:3], s33 offset:4		; MUBUF: buffer_store_dword [[ZERO]], off, s[0:3], s33 offset:4
		; FLATSCR: scratch_store_dword off, [[ZERO]], s33 offset:4
; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART
; GCN: v_readlane_b32 s4, v1, 0		; GCN: v_readlane_b32 s4, v1, 0
; GCN-NEXT: s_add_u32 s32, s32, 0x200		; MUBUF-NEXT: s_add_u32 s32, s32, 0x200
		; FLATSCR-NEXT: s_add_u32 s32, s32, 8
; GCN-NEXT: v_readlane_b32 s5, v1, 1		; GCN-NEXT: v_readlane_b32 s5, v1, 1
; GCN-NEXT: s_sub_u32 s32, s32, 0x200		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x200
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 8
; GCN-NEXT: v_readlane_b32 s33, v1, 2		; GCN-NEXT: v_readlane_b32 s33, v1, 2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64 s[4:5]		; GCN-NEXT: s_setpc_b64 s[4:5]
define void @no_unused_non_csr_sgpr_for_fp() #1 {		define void @no_unused_non_csr_sgpr_for_fp() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca

; Use all clobberable registers, so FP has to spill to a VGPR.		; Use all clobberable registers, so FP has to spill to a VGPR.
Show All 12 Lines
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 offset:8 ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], off, s[0:3], s32 offset:8 ; 4-byte Folded Spill
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]
; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s33, 2		; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s33, 2
; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s30, 0		; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s30, 0
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-NEXT: s_mov_b32 s33, s32

; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1
; GCN-DAG: buffer_store_dword		; MUBUF-DAG: buffer_store_dword
; GCN: s_add_u32 s32, s32, 0x300{{$}}		; FLATSCR-DAG: scratch_store_dword
		; MUBUF: s_add_u32 s32, s32, 0x300{{$}}
		; FLATSCR: s_add_u32 s32, s32, 12{{$}}

; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART

; GCN: v_readlane_b32 s4, [[CSR_VGPR]], 0		; GCN: v_readlane_b32 s4, [[CSR_VGPR]], 0
; GCN-NEXT: v_readlane_b32 s5, [[CSR_VGPR]], 1		; GCN-NEXT: v_readlane_b32 s5, [[CSR_VGPR]], 1
; GCN-NEXT: s_sub_u32 s32, s32, 0x300{{$}}		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x300{{$}}
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 12{{$}}
; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2		; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 offset:8 ; 4-byte Folded Reload		; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], off, s[0:3], s32 offset:8 ; 4-byte Folded Reload
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @no_unused_non_csr_sgpr_for_fp_no_scratch_vgpr() #1 {		define void @no_unused_non_csr_sgpr_for_fp_no_scratch_vgpr() #1 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
Show All 19 Lines
; register is needed to access the CSR VGPR slot.		; register is needed to access the CSR VGPR slot.
; GCN-LABEL: {{^}}scratch_reg_needed_mubuf_offset:		; GCN-LABEL: {{^}}scratch_reg_needed_mubuf_offset:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC0:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: v_mov_b32_e32 [[SCRATCH_VGPR:v[0-9]+]], 0x1008		; GCN-NEXT: v_mov_b32_e32 [[SCRATCH_VGPR:v[0-9]+]], 0x1008
; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], [[SCRATCH_VGPR]], s[0:3], s32 offen ; 4-byte Folded Spill		; GCN-NEXT: buffer_store_dword [[CSR_VGPR:v[0-9]+]], [[SCRATCH_VGPR]], s[0:3], s32 offen ; 4-byte Folded Spill
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC0]]
; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s33, 2		; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s33, 2
; GCN-NEXT: v_writelane_b32 [[CSR_VGPR]], s30, 0		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s30, 0
; GCN-NEXT: s_mov_b32 s33, s32		; GCN-DAG: s_mov_b32 s33, s32
; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1		; GCN-DAG: v_writelane_b32 [[CSR_VGPR]], s31, 1
; GCN-DAG: s_add_u32 s32, s32, 0x40300{{$}}		; MUBUF-DAG: s_add_u32 s32, s32, 0x40300{{$}}
; GCN-DAG: buffer_store_dword		; FLATSCR-DAG: s_add_u32 s32, s32, 0x100c{{$}}
		; MUBUF-DAG: buffer_store_dword
		; FLATSCR-DAG: scratch_store_dword

; GCN: ;;#ASMSTART		; GCN: ;;#ASMSTART

; GCN: v_readlane_b32 s4, [[CSR_VGPR]], 0		; GCN: v_readlane_b32 s4, [[CSR_VGPR]], 0
; GCN-NEXT: v_readlane_b32 s5, [[CSR_VGPR]], 1		; GCN-NEXT: v_readlane_b32 s5, [[CSR_VGPR]], 1
; GCN-NEXT: s_sub_u32 s32, s32, 0x40300{{$}}		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x40300{{$}}
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 0x100c{{$}}
; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2		; GCN-NEXT: v_readlane_b32 s33, [[CSR_VGPR]], 2
; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}		; GCN-NEXT: s_or_saveexec_b64 [[COPY_EXEC1:s\[[0-9]+:[0-9]+\]]], -1{{$}}
; GCN-NEXT: v_mov_b32_e32 [[SCRATCH_VGPR:v[0-9]+]], 0x1008		; GCN-NEXT: v_mov_b32_e32 [[SCRATCH_VGPR:v[0-9]+]], 0x1008
; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], [[SCRATCH_VGPR]], s[0:3], s32 offen ; 4-byte Folded Reload		; GCN-NEXT: buffer_load_dword [[CSR_VGPR]], [[SCRATCH_VGPR]], s[0:3], s32 offen ; 4-byte Folded Reload
; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]		; GCN-NEXT: s_mov_b64 exec, [[COPY_EXEC1]]
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @scratch_reg_needed_mubuf_offset([4096 x i8] addrspace(5)* byval align 4 %arg) #1 {		define void @scratch_reg_needed_mubuf_offset([4096 x i8] addrspace(5)* byval align 4 %arg) #1 {
Show All 24 Lines	define internal void @local_empty_func() #0 {
ret void		ret void
}		}

; An FP is needed, despite not needing any spills		; An FP is needed, despite not needing any spills
; TODO: Ccould see callee does not use stack and omit FP.		; TODO: Ccould see callee does not use stack and omit FP.
; GCN-LABEL: {{^}}ipra_call_with_stack:		; GCN-LABEL: {{^}}ipra_call_with_stack:
; GCN: s_mov_b32 [[FP_COPY:s[0-9]+]], s33		; GCN: s_mov_b32 [[FP_COPY:s[0-9]+]], s33
; GCN: s_mov_b32 s33, s32		; GCN: s_mov_b32 s33, s32
; GCN: s_add_u32 s32, s32, 0x400		; MUBUF: s_add_u32 s32, s32, 0x400
; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33{{$}}		; FLATSCR: s_add_u32 s32, s32, 16
		; MUBUF: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33{{$}}
		; FLATSCR: scratch_store_dword off, v{{[0-9]+}}, s33{{$}}
; GCN: s_swappc_b64		; GCN: s_swappc_b64
; GCN: s_sub_u32 s32, s32, 0x400		; MUBUF: s_sub_u32 s32, s32, 0x400
		; FLATSCR: s_sub_u32 s32, s32, 16
; GCN: s_mov_b32 s33, [[FP_COPY:s[0-9]+]]		; GCN: s_mov_b32 s33, [[FP_COPY:s[0-9]+]]
define void @ipra_call_with_stack() #0 {		define void @ipra_call_with_stack() #0 {
%alloca = alloca i32, addrspace(5)		%alloca = alloca i32, addrspace(5)
store volatile i32 0, i32 addrspace(5)* %alloca		store volatile i32 0, i32 addrspace(5)* %alloca
call void @local_empty_func()		call void @local_empty_func()
ret void		ret void
}		}

▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900 %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900 %s
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch < %s \| FileCheck -check-prefixes=GCN,FLATSCR %s

define <2 x half> @chain_hi_to_lo_private() {		define <2 x half> @chain_hi_to_lo_private() {
; GCN-LABEL: chain_hi_to_lo_private:		; GFX900-LABEL: chain_hi_to_lo_private:
; GCN: ; %bb.0: ; %bb		; GFX900: ; %bb.0: ; %bb
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:2		; GFX900-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_short_d16_hi v0, off, s[0:3], 0		; GFX900-NEXT: buffer_load_short_d16_hi v0, off, s[0:3], 0
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64 s[30:31]		; GFX900-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: chain_hi_to_lo_private:
		; FLATSCR: ; %bb.0: ; %bb
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: s_mov_b32 s4, 2
		; FLATSCR-NEXT: scratch_load_ushort v0, off, s4
		; FLATSCR-NEXT: s_mov_b32 s4, 0
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: scratch_load_short_d16_hi v0, off, s4
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]
bb:		bb:
%gep_lo = getelementptr inbounds half, half addrspace(5)* null, i64 1		%gep_lo = getelementptr inbounds half, half addrspace(5)* null, i64 1
%load_lo = load half, half addrspace(5)* %gep_lo		%load_lo = load half, half addrspace(5)* %gep_lo
%gep_hi = getelementptr inbounds half, half addrspace(5)* null, i64 0		%gep_hi = getelementptr inbounds half, half addrspace(5)* null, i64 0
%load_hi = load half, half addrspace(5)* %gep_hi		%load_hi = load half, half addrspace(5)* %gep_hi

%temp = insertelement <2 x half> undef, half %load_lo, i32 0		%temp = insertelement <2 x half> undef, half %load_lo, i32 0
%result = insertelement <2 x half> %temp, half %load_hi, i32 1		%result = insertelement <2 x half> %temp, half %load_hi, i32 1

ret <2 x half> %result		ret <2 x half> %result
}		}

define <2 x half> @chain_hi_to_lo_private_different_bases(half addrspace(5)* %base_lo, half addrspace(5)* %base_hi) {		define <2 x half> @chain_hi_to_lo_private_different_bases(half addrspace(5)* %base_lo, half addrspace(5)* %base_hi) {
; GCN-LABEL: chain_hi_to_lo_private_different_bases:		; GFX900-LABEL: chain_hi_to_lo_private_different_bases:
; GCN: ; %bb.0: ; %bb		; GFX900: ; %bb.0: ; %bb
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen		; GFX900-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_load_short_d16_hi v0, v1, s[0:3], 0 offen		; GFX900-NEXT: buffer_load_short_d16_hi v0, v1, s[0:3], 0 offen
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64 s[30:31]		; GFX900-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: chain_hi_to_lo_private_different_bases:
		; FLATSCR: ; %bb.0: ; %bb
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: scratch_load_ushort v0, v0, off
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: scratch_load_short_d16_hi v0, v1, off
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]
bb:		bb:
%load_lo = load half, half addrspace(5)* %base_lo		%load_lo = load half, half addrspace(5)* %base_lo
%load_hi = load half, half addrspace(5)* %base_hi		%load_hi = load half, half addrspace(5)* %base_hi

%temp = insertelement <2 x half> undef, half %load_lo, i32 0		%temp = insertelement <2 x half> undef, half %load_lo, i32 0
%result = insertelement <2 x half> %temp, half %load_hi, i32 1		%result = insertelement <2 x half> %temp, half %load_hi, i32 1

ret <2 x half> %result		ret <2 x half> %result
}		}

define <2 x half> @chain_hi_to_lo_arithmatic(half addrspace(5)* %base, half %in) {		define <2 x half> @chain_hi_to_lo_arithmatic(half addrspace(5)* %base, half %in) {
; GCN-LABEL: chain_hi_to_lo_arithmatic:		; GFX900-LABEL: chain_hi_to_lo_arithmatic:
; GCN: ; %bb.0: ; %bb		; GFX900: ; %bb.0: ; %bb
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: v_add_f16_e32 v1, 1.0, v1		; GFX900-NEXT: v_add_f16_e32 v1, 1.0, v1
; GCN-NEXT: buffer_load_short_d16_hi v1, v0, s[0:3], 0 offen		; GFX900-NEXT: buffer_load_short_d16_hi v1, v0, s[0:3], 0 offen
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, v1		; GFX900-NEXT: v_mov_b32_e32 v0, v1
; GCN-NEXT: s_setpc_b64 s[30:31]		; GFX900-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: chain_hi_to_lo_arithmatic:
		; FLATSCR: ; %bb.0: ; %bb
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: v_add_f16_e32 v1, 1.0, v1
		; FLATSCR-NEXT: scratch_load_short_d16_hi v1, v0, off
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v0, v1
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]
bb:		bb:
%arith_lo = fadd half %in, 1.0		%arith_lo = fadd half %in, 1.0
%load_hi = load half, half addrspace(5)* %base		%load_hi = load half, half addrspace(5)* %base

%temp = insertelement <2 x half> undef, half %arith_lo, i32 0		%temp = insertelement <2 x half> undef, half %arith_lo, i32 0
%result = insertelement <2 x half> %temp, half %load_hi, i32 1		%result = insertelement <2 x half> %temp, half %load_hi, i32 1

ret <2 x half> %result		ret <2 x half> %result
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	bb:
%temp = insertelement <2 x half> undef, half %load_lo, i32 0		%temp = insertelement <2 x half> undef, half %load_lo, i32 0
%result = insertelement <2 x half> %temp, half %load_hi, i32 1		%result = insertelement <2 x half> %temp, half %load_hi, i32 1

ret <2 x half> %result		ret <2 x half> %result
}		}

; Make sure we don't lose any of the private stores.		; Make sure we don't lose any of the private stores.
define amdgpu_kernel void @vload2_private(i16 addrspace(1)* nocapture readonly %in, <2 x i16> addrspace(1)* nocapture %out) #0 {		define amdgpu_kernel void @vload2_private(i16 addrspace(1)* nocapture readonly %in, <2 x i16> addrspace(1)* nocapture %out) #0 {
; GCN-LABEL: vload2_private:		; GFX900-LABEL: vload2_private:
; GCN: ; %bb.0: ; %entry		; GFX900: ; %bb.0: ; %entry
; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9		; GFX900-NEXT: s_add_u32 flat_scratch_lo, s6, s9
; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0		; GFX900-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
; GCN-NEXT: s_load_dwordx4 s[4:7], s[4:5], 0x0		; GFX900-NEXT: s_load_dwordx4 s[4:7], s[4:5], 0x0
; GCN-NEXT: s_add_u32 s0, s0, s9		; GFX900-NEXT: s_add_u32 s0, s0, s9
; GCN-NEXT: s_addc_u32 s1, s1, 0		; GFX900-NEXT: s_addc_u32 s1, s1, 0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; GFX900-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, s4		; GFX900-NEXT: v_mov_b32_e32 v0, s4
; GCN-NEXT: v_mov_b32_e32 v1, s5		; GFX900-NEXT: v_mov_b32_e32 v1, s5
; GCN-NEXT: global_load_ushort v2, v[0:1], off		; GFX900-NEXT: global_load_ushort v2, v[0:1], off
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:4		; GFX900-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:4
; GCN-NEXT: global_load_ushort v2, v[0:1], off offset:2		; GFX900-NEXT: global_load_ushort v2, v[0:1], off offset:2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:6		; GFX900-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:6
; GCN-NEXT: global_load_ushort v2, v[0:1], off offset:4		; GFX900-NEXT: global_load_ushort v2, v[0:1], off offset:4
; GCN-NEXT: v_mov_b32_e32 v0, s6		; GFX900-NEXT: v_mov_b32_e32 v0, s6
; GCN-NEXT: v_mov_b32_e32 v1, s7		; GFX900-NEXT: v_mov_b32_e32 v1, s7
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:8		; GFX900-NEXT: buffer_store_short v2, off, s[0:3], 0 offset:8
; GCN-NEXT: buffer_load_ushort v2, off, s[0:3], 0 offset:4		; GFX900-NEXT: buffer_load_ushort v2, off, s[0:3], 0 offset:4
; GCN-NEXT: buffer_load_ushort v4, off, s[0:3], 0 offset:6		; GFX900-NEXT: buffer_load_ushort v4, off, s[0:3], 0 offset:6
; GCN-NEXT: s_waitcnt vmcnt(1)		; GFX900-NEXT: s_waitcnt vmcnt(1)
; GCN-NEXT: v_and_b32_e32 v2, 0xffff, v2		; GFX900-NEXT: v_and_b32_e32 v2, 0xffff, v2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v3, v4		; GFX900-NEXT: v_mov_b32_e32 v3, v4
; GCN-NEXT: buffer_load_short_d16_hi v3, off, s[0:3], 0 offset:8		; GFX900-NEXT: buffer_load_short_d16_hi v3, off, s[0:3], 0 offset:8
; GCN-NEXT: v_lshl_or_b32 v2, v4, 16, v2		; GFX900-NEXT: v_lshl_or_b32 v2, v4, 16, v2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: global_store_dwordx2 v[0:1], v[2:3], off		; GFX900-NEXT: global_store_dwordx2 v[0:1], v[2:3], off
; GCN-NEXT: s_endpgm		; GFX900-NEXT: s_endpgm
		;
		; FLATSCR-LABEL: vload2_private:
		; FLATSCR: ; %bb.0: ; %entry
		; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s6, s9
		; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
		; FLATSCR-NEXT: s_load_dwordx4 s[4:7], s[4:5], 0x0
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v0, s4
		; FLATSCR-NEXT: v_mov_b32_e32 v1, s5
		; FLATSCR-NEXT: global_load_ushort v2, v[0:1], off
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: scratch_store_short off, v2, vcc_hi offset:4
		; FLATSCR-NEXT: global_load_ushort v2, v[0:1], off offset:2
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: scratch_store_short off, v2, vcc_hi offset:6
		; FLATSCR-NEXT: global_load_ushort v2, v[0:1], off offset:4
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: v_mov_b32_e32 v0, s6
		; FLATSCR-NEXT: v_mov_b32_e32 v1, s7
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: scratch_store_short off, v2, vcc_hi offset:8
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: scratch_load_ushort v2, off, vcc_hi offset:4
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: scratch_load_ushort v4, off, vcc_hi offset:6
		; FLATSCR-NEXT: s_mov_b32 vcc_hi, 0
		; FLATSCR-NEXT: s_waitcnt vmcnt(1)
		; FLATSCR-NEXT: v_and_b32_e32 v2, 0xffff, v2
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v3, v4
		; FLATSCR-NEXT: scratch_load_short_d16_hi v3, off, vcc_hi offset:8
		; FLATSCR-NEXT: v_lshl_or_b32 v2, v4, 16, v2
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: global_store_dwordx2 v[0:1], v[2:3], off
		; FLATSCR-NEXT: s_endpgm
entry:		entry:
%loc = alloca [3 x i16], align 2, addrspace(5)		%loc = alloca [3 x i16], align 2, addrspace(5)
%loc.0.sroa_cast1 = bitcast [3 x i16] addrspace(5)* %loc to i8 addrspace(5)*		%loc.0.sroa_cast1 = bitcast [3 x i16] addrspace(5)* %loc to i8 addrspace(5)*
%tmp = load i16, i16 addrspace(1)* %in, align 2		%tmp = load i16, i16 addrspace(1)* %in, align 2
%loc.0.sroa_idx = getelementptr inbounds [3 x i16], [3 x i16] addrspace(5)* %loc, i32 0, i32 0		%loc.0.sroa_idx = getelementptr inbounds [3 x i16], [3 x i16] addrspace(5)* %loc, i32 0, i32 0
store volatile i16 %tmp, i16 addrspace(5)* %loc.0.sroa_idx		store volatile i16 %tmp, i16 addrspace(5)* %loc.0.sroa_idx
%arrayidx.1 = getelementptr inbounds i16, i16 addrspace(1)* %in, i64 1		%arrayidx.1 = getelementptr inbounds i16, i16 addrspace(1)* %in, i64 1
%tmp1 = load i16, i16 addrspace(1)* %arrayidx.1, align 2		%tmp1 = load i16, i16 addrspace(1)* %arrayidx.1, align 2
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	bb:
%load_hi = load volatile i16, i16 addrspace(3)* %gep_hi		%load_hi = load volatile i16, i16 addrspace(3)* %gep_hi
%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1		%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1
%op.hi = add <2 x i16> %to.hi, <i16 12, i16 12>		%op.hi = add <2 x i16> %to.hi, <i16 12, i16 12>
%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0		%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0
ret <2 x i16> %result		ret <2 x i16> %result
}		}

define <2 x i16> @chain_hi_to_lo_private_other_dep(i16 addrspace(5)* %ptr) {		define <2 x i16> @chain_hi_to_lo_private_other_dep(i16 addrspace(5)* %ptr) {
; GCN-LABEL: chain_hi_to_lo_private_other_dep:		; GFX900-LABEL: chain_hi_to_lo_private_other_dep:
; GCN: ; %bb.0: ; %bb		; GFX900: ; %bb.0: ; %bb
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: buffer_load_short_d16_hi v1, v0, s[0:3], 0 offen		; GFX900-NEXT: buffer_load_short_d16_hi v1, v0, s[0:3], 0 offen
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_pk_sub_u16 v1, v1, -12 op_sel_hi:[1,0]		; GFX900-NEXT: v_pk_sub_u16 v1, v1, -12 op_sel_hi:[1,0]
; GCN-NEXT: buffer_load_short_d16 v1, v0, s[0:3], 0 offen offset:2		; GFX900-NEXT: buffer_load_short_d16 v1, v0, s[0:3], 0 offen offset:2
; GCN-NEXT: s_waitcnt vmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, v1		; GFX900-NEXT: v_mov_b32_e32 v0, v1
; GCN-NEXT: s_setpc_b64 s[30:31]		; GFX900-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: chain_hi_to_lo_private_other_dep:
		; FLATSCR: ; %bb.0: ; %bb
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: scratch_load_short_d16_hi v1, v0, off
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_pk_sub_u16 v1, v1, -12 op_sel_hi:[1,0]
		; FLATSCR-NEXT: scratch_load_short_d16 v1, v0, off offset:2
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v0, v1
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]
bb:		bb:
%gep_lo = getelementptr inbounds i16, i16 addrspace(5)* %ptr, i64 1		%gep_lo = getelementptr inbounds i16, i16 addrspace(5)* %ptr, i64 1
%load_lo = load i16, i16 addrspace(5)* %gep_lo		%load_lo = load i16, i16 addrspace(5)* %gep_lo
%gep_hi = getelementptr inbounds i16, i16 addrspace(5)* %ptr, i64 0		%gep_hi = getelementptr inbounds i16, i16 addrspace(5)* %ptr, i64 0
%load_hi = load i16, i16 addrspace(5)* %gep_hi		%load_hi = load i16, i16 addrspace(5)* %gep_hi
%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1		%to.hi = insertelement <2 x i16> undef, i16 %load_hi, i32 1
%op.hi = add <2 x i16> %to.hi, <i16 12, i16 12>		%op.hi = add <2 x i16> %to.hi, <i16 12, i16 12>
%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0		%result = insertelement <2 x i16> %op.hi, i16 %load_lo, i32 0
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -mattr=-unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX7-ALIGNED %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -mattr=-unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX7-ALIGNED %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -mattr=+unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX7-UNALIGNED %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=hawaii -mattr=+unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX7-UNALIGNED %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=+unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX9 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=+unaligned-scratch-access < %s \| FileCheck -check-prefixes=GCN,GFX9 %s
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=+unaligned-scratch-access -amdgpu-enable-flat-scratch < %s \| FileCheck -check-prefixes=GCN,GFX9-FLASTSCR %s

	; Should not merge this to a dword load			; Should not merge this to a dword load
	define i32 @private_load_2xi16_align2(i16 addrspace(5)* %p) #0 {			define i32 @private_load_2xi16_align2(i16 addrspace(5)* %p) #0 {
	; GFX7-ALIGNED-LABEL: private_load_2xi16_align2:			; GFX7-ALIGNED-LABEL: private_load_2xi16_align2:
	; GFX7-ALIGNED: ; %bb.0:			; GFX7-ALIGNED: ; %bb.0:
	; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX7-ALIGNED-NEXT: v_add_i32_e32 v1, vcc, 2, v0			; GFX7-ALIGNED-NEXT: v_add_i32_e32 v1, vcc, 2, v0
	; GFX7-ALIGNED-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen			; GFX7-ALIGNED-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen
	Show All 17 Lines
	; GFX9-LABEL: private_load_2xi16_align2:			; GFX9-LABEL: private_load_2xi16_align2:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen			; GFX9-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen
	; GFX9-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen offset:2			; GFX9-NEXT: buffer_load_ushort v0, v0, s[0:3], 0 offen offset:2
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v0, v0, 16, v1			; GFX9-NEXT: v_lshl_or_b32 v0, v0, 16, v1
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_load_2xi16_align2:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: scratch_load_ushort v1, v0, off
				; GFX9-FLASTSCR-NEXT: scratch_load_ushort v0, v0, off offset:2
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_lshl_or_b32 v0, v0, 16, v1
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
	%p.0 = load i16, i16 addrspace(5)* %p, align 2			%p.0 = load i16, i16 addrspace(5)* %p, align 2
	%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2			%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	ret i32 %or			ret i32 %or
	Show All 27 Lines
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v0, 1			; GFX9-NEXT: v_mov_b32_e32 v0, 1
	; GFX9-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen			; GFX9-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen
	; GFX9-NEXT: v_mov_b32_e32 v0, 2			; GFX9-NEXT: v_mov_b32_e32 v0, 2
	; GFX9-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen offset:2			; GFX9-NEXT: buffer_store_short v0, v1, s[0:3], 0 offen offset:2
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_store_2xi16_align2:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v0, 1
				; GFX9-FLASTSCR-NEXT: scratch_store_short v0, v1, off
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v0, 2
				; GFX9-FLASTSCR-NEXT: scratch_store_short v0, v1, off offset:2
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1			%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1
	store i16 1, i16 addrspace(5)* %r, align 2			store i16 1, i16 addrspace(5)* %r, align 2
	store i16 2, i16 addrspace(5)* %gep.r, align 2			store i16 2, i16 addrspace(5)* %gep.r, align 2
	ret void			ret void
	}			}

	; Should produce align 1 dword when legal			; Should produce align 1 dword when legal
	define i32 @private_load_2xi16_align1(i16 addrspace(5)* %p) #0 {			define i32 @private_load_2xi16_align1(i16 addrspace(5)* %p) #0 {
	Show All 30 Lines
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen			; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
	; GFX9-NEXT: v_mov_b32_e32 v1, 0xffff			; GFX9-NEXT: v_mov_b32_e32 v1, 0xffff
	; GFX9-NEXT: s_mov_b32 s4, 0xffff			; GFX9-NEXT: s_mov_b32 s4, 0xffff
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_bfi_b32 v1, v1, 0, v0			; GFX9-NEXT: v_bfi_b32 v1, v1, 0, v0
	; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1			; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_load_2xi16_align1:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v1, 0xffff
				; GFX9-FLASTSCR-NEXT: s_mov_b32 s4, 0xffff
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_bfi_b32 v1, v1, 0, v0
				; GFX9-FLASTSCR-NEXT: v_and_or_b32 v0, v0, s4, v1
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
	%p.0 = load i16, i16 addrspace(5)* %p, align 1			%p.0 = load i16, i16 addrspace(5)* %p, align 1
	%p.1 = load i16, i16 addrspace(5)* %gep.p, align 1			%p.1 = load i16, i16 addrspace(5)* %gep.p, align 1
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	ret i32 %or			ret i32 %or
	Show All 27 Lines
	;			;
	; GFX9-LABEL: private_store_2xi16_align1:			; GFX9-LABEL: private_store_2xi16_align1:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: v_mov_b32_e32 v0, 0x20001			; GFX9-NEXT: v_mov_b32_e32 v0, 0x20001
	; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen			; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_store_2xi16_align1:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v0, 0x20001
				; GFX9-FLASTSCR-NEXT: scratch_store_dword v0, v1, off
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1			%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1
	store i16 1, i16 addrspace(5)* %r, align 1			store i16 1, i16 addrspace(5)* %r, align 1
	store i16 2, i16 addrspace(5)* %gep.r, align 1			store i16 2, i16 addrspace(5)* %gep.r, align 1
	ret void			ret void
	}			}

	; Should merge this to a dword load			; Should merge this to a dword load
	define i32 @private_load_2xi16_align4(i16 addrspace(5)* %p) #0 {			define i32 @private_load_2xi16_align4(i16 addrspace(5)* %p) #0 {
	Show All 23 Lines
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen			; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
	; GFX9-NEXT: v_mov_b32_e32 v1, 0xffff			; GFX9-NEXT: v_mov_b32_e32 v1, 0xffff
	; GFX9-NEXT: s_mov_b32 s4, 0xffff			; GFX9-NEXT: s_mov_b32 s4, 0xffff
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_bfi_b32 v1, v1, 0, v0			; GFX9-NEXT: v_bfi_b32 v1, v1, 0, v0
	; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1			; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_load_2xi16_align4:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v1, 0xffff
				; GFX9-FLASTSCR-NEXT: s_mov_b32 s4, 0xffff
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_bfi_b32 v1, v1, 0, v0
				; GFX9-FLASTSCR-NEXT: v_and_or_b32 v0, v0, s4, v1
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
	%p.0 = load i16, i16 addrspace(5)* %p, align 4			%p.0 = load i16, i16 addrspace(5)* %p, align 4
	%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2			%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	ret i32 %or			ret i32 %or
	}			}

	; Should merge this to a dword store			; Should merge this to a dword store
	define void @private_store_2xi16_align4(i16 addrspace(5)* %p, i16 addrspace(5)* %r) #0 {			define void @private_store_2xi16_align4(i16 addrspace(5)* %p, i16 addrspace(5)* %r) #0 {
	; GFX7-LABEL: private_store_2xi16_align4:			; GFX7-LABEL: private_store_2xi16_align4:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2			; GFX7-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x2
	; GFX7-NEXT: v_mov_b32_e32 v2, 0x20001			; GFX7-NEXT: v_mov_b32_e32 v2, 0x20001
	; GFX7-NEXT: s_waitcnt lgkmcnt(0)			; GFX7-NEXT: s_waitcnt lgkmcnt(0)
	; GFX7-NEXT: v_mov_b32_e32 v0, s0			; GFX7-NEXT: v_mov_b32_e32 v0, s0
	; GFX7-NEXT: v_mov_b32_e32 v1, s1			; GFX7-NEXT: v_mov_b32_e32 v1, s1
	; GFX7-NEXT: flat_store_dword v[0:1], v2			; GFX7-NEXT: flat_store_dword v[0:1], v2
	; GFX7-NEXT: s_endpgm			; GFX7-NEXT: s_endpgm
	;			;
	; GCN-LABEL: private_store_2xi16_align4:			; GFX7-ALIGNED-LABEL: private_store_2xi16_align4:
	; GCN: ; %bb.0:			; GFX7-ALIGNED: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v0, 0x20001			; GFX7-ALIGNED-NEXT: v_mov_b32_e32 v0, 0x20001
	; GCN-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen			; GFX7-ALIGNED-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: s_setpc_b64 s[30:31]			; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX7-UNALIGNED-LABEL: private_store_2xi16_align4:
				; GFX7-UNALIGNED: ; %bb.0:
				; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX7-UNALIGNED-NEXT: v_mov_b32_e32 v0, 0x20001
				; GFX7-UNALIGNED-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
				; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
				; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: private_store_2xi16_align4:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0x20001
				; GFX9-NEXT: buffer_store_dword v0, v1, s[0:3], 0 offen
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-FLASTSCR-LABEL: private_store_2xi16_align4:
				; GFX9-FLASTSCR: ; %bb.0:
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v0, 0x20001
				; GFX9-FLASTSCR-NEXT: scratch_store_dword v0, v1, off
				; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
				; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1			%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1
	store i16 1, i16 addrspace(5)* %r, align 4			store i16 1, i16 addrspace(5)* %r, align 4
	store i16 2, i16 addrspace(5)* %gep.r, align 2			store i16 2, i16 addrspace(5)* %gep.r, align 2
	ret void			ret void
	}			}

llvm/test/CodeGen/AMDGPU/flat-scratch.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=-promote-alloca -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX9 %s
				; RUN: llc -march=amdgcn -mcpu=gfx1010 -mattr=-promote-alloca -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX10 %s

				define amdgpu_kernel void @zero_init_kernel() {
				; GFX9-LABEL: zero_init_kernel:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:76
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:72
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:68
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:64
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:60
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:56
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:52
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:48
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:44
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:40
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:36
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:32
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:28
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:24
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:20
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:16
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: zero_init_kernel:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:76
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:72
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:68
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:64
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:60
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:56
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:52
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:48
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:44
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:40
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:36
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:32
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:28
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:24
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:20
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:16
				; GFX10-NEXT: s_endpgm
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define void @zero_init_foo() {
				; GFX9-LABEL: zero_init_foo:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:60
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:56
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:52
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:48
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:44
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:40
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:36
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:32
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:28
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:24
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:20
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:16
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:12
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:8
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:4
				; GFX9-NEXT: scratch_store_dword off, v0, s32
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: zero_init_foo:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:60
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:56
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:52
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:48
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:44
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:40
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:36
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:32
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:28
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:24
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:20
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:16
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:12
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:8
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:4
				; GFX10-NEXT: scratch_store_dword off, v0, s32
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define amdgpu_kernel void @store_load_sindex_kernel(i32 %idx) {
				; GFX9-LABEL: store_load_sindex_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 4, s1
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 4, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 4, s0
				; GFX10-NEXT: s_add_u32 s1, 4, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%i = alloca [32 x float], align 4, addrspace(5)
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_ps void @store_load_sindex_foo(i32 inreg %idx) {
				; GFX9-LABEL: store_load_sindex_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 4, s1
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 4, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 4, s0
				; GFX10-NEXT: s_add_u32 s1, 4, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%i = alloca [32 x float], align 4, addrspace(5)
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_kernel void @store_load_vindex_kernel() {
				; GFX9-LABEL: store_load_vindex_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-NEXT: v_mov_b32_e32 v1, 4
				; GFX9-NEXT: v_add_u32_e32 v2, v1, v0
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_sub_u32_e32 v0, v1, v0
				; GFX9-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_vindex_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: v_mov_b32_e32 v1, 4
				; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX10-NEXT: v_mov_b32_e32 v3, 15
				; GFX10-NEXT: v_add_nc_u32_e32 v2, v1, v0
				; GFX10-NEXT: v_sub_nc_u32_e32 v0, v1, v0
				; GFX10-NEXT: scratch_store_dword v3, v2, off
				; GFX10-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX10-NEXT: s_endpgm
				bb:
				%i = alloca [32 x float], align 4, addrspace(5)
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i2 = tail call i32 @llvm.amdgcn.workitem.id.x()
				%i3 = zext i32 %i2 to i64
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i2
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = sub nsw i32 31, %i2
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define void @store_load_vindex_foo(i32 %idx) {
				; GFX9-LABEL: store_load_vindex_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, s32
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: v_lshl_add_u32 v2, v0, 2, v1
				; GFX9-NEXT: v_and_b32_e32 v0, v0, v3
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_lshl_add_u32 v0, v0, 2, v1
				; GFX9-NEXT: scratch_load_dword v0, v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_vindex_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: v_mov_b32_e32 v2, s32
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: v_and_b32_e32 v3, v0, v1
				; GFX10-NEXT: v_lshl_add_u32 v0, v0, 2, v2
				; GFX10-NEXT: v_lshl_add_u32 v2, v3, 2, v2
				; GFX10-NEXT: scratch_store_dword v1, v0, off
				; GFX10-NEXT: scratch_load_dword v0, v2, off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				%i = alloca [32 x float], align 4, addrspace(5)
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define void @private_ptr_foo(float addrspace(5)* nocapture %arg) {
				; GFX9-LABEL: private_ptr_foo:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x41200000
				; GFX9-NEXT: scratch_store_dword v1, v0, off offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: private_ptr_foo:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 0x41200000
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword v1, v0, off offset:4
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%gep = getelementptr inbounds float, float addrspace(5)* %arg, i32 1
				store float 1.000000e+01, float addrspace(5)* %gep, align 4
				ret void
				}

				define amdgpu_kernel void @zero_init_small_offset_kernel() {
				; GFX9-LABEL: zero_init_small_offset_kernel:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:284
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:280
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:276
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:272
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:300
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:296
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:292
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:288
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:316
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:312
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:308
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:304
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:332
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:328
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:324
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:320
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: zero_init_small_offset_kernel:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:284
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:280
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:276
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:272
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:300
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:296
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:292
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:288
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:316
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:312
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:308
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:304
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:332
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:328
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:324
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:320
				; GFX10-NEXT: s_endpgm
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define void @zero_init_small_offset_foo() {
				; GFX9-LABEL: zero_init_small_offset_foo:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v0, off, s32
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:268
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:264
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:260
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:256
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:284
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:280
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:276
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:272
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:300
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:296
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:292
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:288
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:316
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:312
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:308
				; GFX9-NEXT: scratch_store_dword off, v0, s32 offset:304
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: zero_init_small_offset_foo:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: scratch_load_dword v0, off, s32
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:268
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:264
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:260
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:256
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:284
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:280
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:276
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:272
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:300
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:296
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:292
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:288
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:316
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:312
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:308
				; GFX10-NEXT: scratch_store_dword off, v0, s32 offset:304
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define amdgpu_kernel void @store_load_sindex_small_offset_kernel(i32 %idx) {
				; GFX9-LABEL: store_load_sindex_small_offset_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 0x104, s1
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 0x104, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_small_offset_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 0x104, s0
				; GFX10-NEXT: s_add_u32 s1, 0x104, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_ps void @store_load_sindex_small_offset_foo(i32 inreg %idx) {
				; GFX9-LABEL: store_load_sindex_small_offset_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 0x104, s1
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 0x104, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_small_offset_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 0x104, s0
				; GFX10-NEXT: s_add_u32 s1, 0x104, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_kernel void @store_load_vindex_small_offset_kernel() {
				; GFX9-LABEL: store_load_vindex_small_offset_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_load_dword v1, off, vcc_hi offset:4
				; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x104
				; GFX9-NEXT: v_add_u32_e32 v2, v1, v0
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_sub_u32_e32 v0, v1, v0
				; GFX9-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_vindex_small_offset_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: v_mov_b32_e32 v1, 0x104
				; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX10-NEXT: v_mov_b32_e32 v3, 15
				; GFX10-NEXT: v_add_nc_u32_e32 v2, v1, v0
				; GFX10-NEXT: v_sub_nc_u32_e32 v0, v1, v0
				; GFX10-NEXT: scratch_load_dword v1, off, null offset:4
				; GFX10-NEXT: scratch_store_dword v3, v2, off
				; GFX10-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i2 = tail call i32 @llvm.amdgcn.workitem.id.x()
				%i3 = zext i32 %i2 to i64
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i2
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = sub nsw i32 31, %i2
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define void @store_load_vindex_small_offset_foo(i32 %idx) {
				; GFX9-LABEL: store_load_vindex_small_offset_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v1, off, s32
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x100
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, vcc_hi
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: v_lshl_add_u32 v2, v0, 2, v1
				; GFX9-NEXT: v_and_b32_e32 v0, v0, v3
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_lshl_add_u32 v0, v0, 2, v1
				; GFX9-NEXT: scratch_load_dword v0, v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_vindex_small_offset_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x100
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: v_mov_b32_e32 v2, vcc_lo
				; GFX10-NEXT: v_and_b32_e32 v3, v0, v1
				; GFX10-NEXT: v_lshl_add_u32 v0, v0, 2, v2
				; GFX10-NEXT: v_lshl_add_u32 v2, v3, 2, v2
				; GFX10-NEXT: scratch_load_dword v3, off, s32
				; GFX10-NEXT: scratch_store_dword v1, v0, off
				; GFX10-NEXT: scratch_load_dword v0, v2, off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				%padding = alloca [64 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [64 x i32], [64 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_kernel void @zero_init_large_offset_kernel() {
				; GFX9-LABEL: zero_init_large_offset_kernel:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:12
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:8
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:4
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:28
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:24
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:20
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:16
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:44
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:40
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:36
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:32
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:60
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:56
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:52
				; GFX9-NEXT: s_movk_i32 vcc_hi, 0x4010
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:48
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: zero_init_large_offset_kernel:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:12
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:8
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:4
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:28
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:24
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:20
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:16
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:44
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:40
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:36
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:32
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:60
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:56
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:52
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_movk_i32 vcc_lo, 0x4010
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:48
				; GFX10-NEXT: s_endpgm
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define void @zero_init_large_offset_foo() {
				; GFX9-LABEL: zero_init_large_offset_foo:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v0, off, s32
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 0
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:12
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:8
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:4
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:28
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:24
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:20
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:16
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:44
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:40
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:36
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:32
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:60
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:56
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:52
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:48
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: zero_init_large_offset_foo:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: scratch_load_dword v0, off, s32
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 0
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:12
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:8
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:4
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:28
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:24
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:20
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:16
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:44
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:40
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:36
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:32
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:60
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:56
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:52
				; GFX10-NEXT: s_waitcnt_depctr 0xffe3
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: scratch_store_dword off, v0, vcc_lo offset:48
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%alloca = alloca [32 x i16], align 2, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%cast = bitcast [32 x i16] addrspace(5)* %alloca to i8 addrspace(5)*
				call void @llvm.memset.p5i8.i64(i8 addrspace(5)* align 2 dereferenceable(64) %cast, i8 0, i64 64, i1 false)
				ret void
				}

				define amdgpu_kernel void @store_load_sindex_large_offset_kernel(i32 %idx) {
				; GFX9-LABEL: store_load_sindex_large_offset_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 0x4004, s1
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 0x4004, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_large_offset_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 0x4004, s0
				; GFX10-NEXT: s_add_u32 s1, 0x4004, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_ps void @store_load_sindex_large_offset_foo(i32 inreg %idx) {
				; GFX9-LABEL: store_load_sindex_large_offset_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: s_lshl_b32 s1, s0, 2
				; GFX9-NEXT: s_and_b32 s0, s0, 15
				; GFX9-NEXT: scratch_load_dword v0, off, vcc_hi offset:4
				; GFX9-NEXT: s_lshl_b32 s0, s0, 2
				; GFX9-NEXT: s_add_u32 s1, 0x4004, s1
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: scratch_store_dword off, v0, s1
				; GFX9-NEXT: s_add_u32 s0, 0x4004, s0
				; GFX9-NEXT: scratch_load_dword v0, off, s0
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_sindex_large_offset_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: scratch_load_dword v0, off, null offset:4
				; GFX10-NEXT: s_and_b32 s1, s0, 15
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: v_mov_b32_e32 v0, 15
				; GFX10-NEXT: s_lshl_b32 s0, s0, 2
				; GFX10-NEXT: s_lshl_b32 s1, s1, 2
				; GFX10-NEXT: s_add_u32 s0, 0x4004, s0
				; GFX10-NEXT: s_add_u32 s1, 0x4004, s1
				; GFX10-NEXT: scratch_store_dword off, v0, s0
				; GFX10-NEXT: scratch_load_dword v0, off, s1
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_kernel void @store_load_vindex_large_offset_kernel() {
				; GFX9-LABEL: store_load_vindex_large_offset_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_load_dword v1, off, vcc_hi offset:4
				; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x4004
				; GFX9-NEXT: v_add_u32_e32 v2, v1, v0
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_sub_u32_e32 v0, v1, v0
				; GFX9-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_vindex_large_offset_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: v_mov_b32_e32 v1, 0x4004
				; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX10-NEXT: v_mov_b32_e32 v3, 15
				; GFX10-NEXT: v_add_nc_u32_e32 v2, v1, v0
				; GFX10-NEXT: v_sub_nc_u32_e32 v0, v1, v0
				; GFX10-NEXT: scratch_load_dword v1, off, null offset:4
				; GFX10-NEXT: scratch_store_dword v3, v2, off
				; GFX10-NEXT: scratch_load_dword v0, v0, off offset:124
				; GFX10-NEXT: s_endpgm
				bb:
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i2 = tail call i32 @llvm.amdgcn.workitem.id.x()
				%i3 = zext i32 %i2 to i64
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i2
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = sub nsw i32 31, %i2
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define void @store_load_vindex_large_offset_foo(i32 %idx) {
				; GFX9-LABEL: store_load_vindex_large_offset_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: scratch_load_dword v1, off, s32
				; GFX9-NEXT: s_add_u32 vcc_hi, s32, 0x4000
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, vcc_hi
				; GFX9-NEXT: v_mov_b32_e32 v3, 15
				; GFX9-NEXT: v_lshl_add_u32 v2, v0, 2, v1
				; GFX9-NEXT: v_and_b32_e32 v0, v0, v3
				; GFX9-NEXT: scratch_store_dword v3, v2, off
				; GFX9-NEXT: v_lshl_add_u32 v0, v0, 2, v1
				; GFX9-NEXT: scratch_load_dword v0, v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_vindex_large_offset_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: s_add_u32 vcc_lo, s32, 0x4000
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: v_mov_b32_e32 v2, vcc_lo
				; GFX10-NEXT: v_and_b32_e32 v3, v0, v1
				; GFX10-NEXT: v_lshl_add_u32 v0, v0, 2, v2
				; GFX10-NEXT: v_lshl_add_u32 v2, v3, 2, v2
				; GFX10-NEXT: scratch_load_dword v3, off, s32
				; GFX10-NEXT: scratch_store_dword v1, v0, off
				; GFX10-NEXT: scratch_load_dword v0, v2, off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				%padding = alloca [4096 x i32], align 4, addrspace(5)
				%i = alloca [32 x float], align 4, addrspace(5)
				%pad_gep = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %padding, i32 0, i32 undef
				%pad_load = load volatile i32, i32 addrspace(5)* %pad_gep, align 4
				%i1 = bitcast [32 x float] addrspace(5)* %i to i8 addrspace(5)*
				%i7 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %idx
				%i8 = bitcast float addrspace(5)* %i7 to i32 addrspace(5)*
				store volatile i32 15, i32 addrspace(5)* %i8, align 4
				%i9 = and i32 %idx, 15
				%i10 = getelementptr inbounds [32 x float], [32 x float] addrspace(5)* %i, i32 0, i32 %i9
				%i11 = bitcast float addrspace(5)* %i10 to i32 addrspace(5)*
				%i12 = load volatile i32, i32 addrspace(5)* %i11, align 4
				ret void
				}

				define amdgpu_kernel void @store_load_large_imm_offset_kernel() {
				; GFX9-LABEL: store_load_large_imm_offset_kernel:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_movk_i32 s0, 0x3000
				; GFX9-NEXT: v_mov_b32_e32 v0, 13
				; GFX9-NEXT: s_mov_b32 vcc_hi, 0
				; GFX9-NEXT: scratch_store_dword off, v0, vcc_hi offset:4
				; GFX9-NEXT: s_add_u32 s0, 4, s0
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: scratch_store_dword off, v0, s0 offset:3712
				; GFX9-NEXT: scratch_load_dword v0, off, s0 offset:3712
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_large_imm_offset_kernel:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: v_mov_b32_e32 v0, 13
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: s_movk_i32 s0, 0x3800
				; GFX10-NEXT: s_add_u32 s0, 4, s0
				; GFX10-NEXT: scratch_store_dword off, v0, null offset:4
				; GFX10-NEXT: scratch_store_dword off, v1, s0 offset:1664
				; GFX10-NEXT: scratch_load_dword v0, off, s0 offset:1664
				; GFX10-NEXT: s_endpgm
				bb:
				%i = alloca [4096 x i32], align 4, addrspace(5)
				%i1 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 undef
				store volatile i32 13, i32 addrspace(5)* %i1, align 4
				%i7 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 4000
				store volatile i32 15, i32 addrspace(5)* %i7, align 4
				%i10 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 4000
				%i12 = load volatile i32, i32 addrspace(5)* %i10, align 4
				ret void
				}

				define void @store_load_large_imm_offset_foo() {
				; GFX9-LABEL: store_load_large_imm_offset_foo:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_movk_i32 s4, 0x3000
				; GFX9-NEXT: v_mov_b32_e32 v0, 13
				; GFX9-NEXT: scratch_store_dword off, v0, s32
				; GFX9-NEXT: s_add_u32 s4, s32, s4
				; GFX9-NEXT: v_mov_b32_e32 v0, 15
				; GFX9-NEXT: scratch_store_dword off, v0, s4 offset:3712
				; GFX9-NEXT: scratch_load_dword v0, off, s4 offset:3712
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_large_imm_offset_foo:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v0, 13
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: s_movk_i32 s4, 0x3800
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: s_add_u32 s4, s32, s4
				; GFX10-NEXT: scratch_store_dword off, v0, s32
				; GFX10-NEXT: scratch_store_dword off, v1, s4 offset:1664
				; GFX10-NEXT: scratch_load_dword v0, off, s4 offset:1664
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				%i = alloca [4096 x i32], align 4, addrspace(5)
				%i1 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 undef
				store volatile i32 13, i32 addrspace(5)* %i1, align 4
				%i7 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 4000
				store volatile i32 15, i32 addrspace(5)* %i7, align 4
				%i10 = getelementptr inbounds [4096 x i32], [4096 x i32] addrspace(5)* %i, i32 0, i32 4000
				%i12 = load volatile i32, i32 addrspace(5)* %i10, align 4
				ret void
				}

				define amdgpu_kernel void @store_load_vidx_sidx_offset(i32 %sidx) {
				; GFX9-LABEL: store_load_vidx_sidx_offset:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX9-NEXT: v_mov_b32_e32 v1, 4
				; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: v_add_u32_e32 v0, s0, v0
				; GFX9-NEXT: v_lshl_add_u32 v0, v0, 2, v1
				; GFX9-NEXT: v_mov_b32_e32 v1, 15
				; GFX9-NEXT: scratch_store_dword v1, v0, off offset:1024
				; GFX9-NEXT: scratch_load_dword v0, v0, off offset:1024
				; GFX9-NEXT: s_endpgm
				;
				; GFX10-LABEL: store_load_vidx_sidx_offset:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_load_dword s0, s[0:1], 0x24
				; GFX10-NEXT: v_mov_b32_e32 v1, 15
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: v_add_nc_u32_e32 v0, s0, v0
				; GFX10-NEXT: v_lshl_add_u32 v0, v0, 2, 4
				; GFX10-NEXT: scratch_store_dword v1, v0, off offset:1024
				; GFX10-NEXT: scratch_load_dword v0, v0, off offset:1024
				; GFX10-NEXT: s_endpgm
				bb:
				%alloca = alloca [32 x i32], align 4, addrspace(5)
				%vidx = tail call i32 @llvm.amdgcn.workitem.id.x()
				%add1 = add nsw i32 %sidx, %vidx
				%add2 = add nsw i32 %add1, 256
				%gep = getelementptr inbounds [32 x i32], [32 x i32] addrspace(5)* %alloca, i32 0, i32 %add2
				store volatile i32 15, i32 addrspace(5)* %gep, align 4
				%load = load volatile i32, i32 addrspace(5)* %gep, align 4
				ret void
				}

				; FIXME: Multi-DWORD scratch shall be supported
				define void @store_load_i64_aligned(i64 addrspace(5)* nocapture %arg) {
				; GFX9-LABEL: store_load_i64_aligned:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, 0
				; GFX9-NEXT: scratch_store_dword v1, v0, off offset:4
				; GFX9-NEXT: v_mov_b32_e32 v1, 15
				; GFX9-NEXT: scratch_store_dword v1, v0, off
				; GFX9-NEXT: scratch_load_dword v1, v0, off offset:4
				; GFX9-NEXT: scratch_load_dword v0, v0, off
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_i64_aligned:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 0
				; GFX10-NEXT: v_mov_b32_e32 v2, 15
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_dword v1, v0, off offset:4
				; GFX10-NEXT: scratch_store_dword v2, v0, off
				; GFX10-NEXT: s_clause 0x1
				; GFX10-NEXT: scratch_load_dword v1, v0, off offset:4
				; GFX10-NEXT: scratch_load_dword v0, v0, off
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				store volatile i64 15, i64 addrspace(5)* %arg, align 8
				%load = load volatile i64, i64 addrspace(5)* %arg, align 8
				ret void
				}

				; FIXME: Multi-DWORD unaligned scratch shall be supported
				define void @store_load_i64_unaligned(i64 addrspace(5)* nocapture %arg) {
				; GFX9-LABEL: store_load_i64_unaligned:
				; GFX9: ; %bb.0: ; %bb
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_mov_b32_e32 v1, 0
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:7
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:6
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:5
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:4
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:3
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:2
				; GFX9-NEXT: scratch_store_byte v1, v0, off offset:1
				; GFX9-NEXT: v_mov_b32_e32 v1, 15
				; GFX9-NEXT: scratch_store_byte v1, v0, off
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:6
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:7
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:4
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:5
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:2
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off offset:3
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: scratch_load_ubyte v1, v0, off
				; GFX9-NEXT: scratch_load_ubyte v0, v0, off offset:1
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: store_load_i64_unaligned:
				; GFX10: ; %bb.0: ; %bb
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_mov_b32_e32 v1, 0
				; GFX10-NEXT: v_mov_b32_e32 v2, 15
				; GFX10-NEXT: ; implicit-def: $vcc_hi
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:7
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:6
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:5
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:4
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:3
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:2
				; GFX10-NEXT: scratch_store_byte v1, v0, off offset:1
				; GFX10-NEXT: scratch_store_byte v2, v0, off
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:6
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:7
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:4
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:5
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:2
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off offset:3
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_clause 0x1
				; GFX10-NEXT: scratch_load_ubyte v1, v0, off
				; GFX10-NEXT: scratch_load_ubyte v0, v0, off offset:1
				; GFX10-NEXT: s_waitcnt vmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				bb:
				store volatile i64 15, i64 addrspace(5)* %arg, align 1
				%load = load volatile i64, i64 addrspace(5)* %arg, align 1
				ret void
				}

				declare void @llvm.memset.p5i8.i64(i8 addrspace(5)* nocapture writeonly, i8, i64, i1 immarg)
				declare i32 @llvm.amdgcn.workitem.id.x()

llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll

	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=kaveri -mattr=-promote-alloca -amdgpu-sroa=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CI %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=kaveri -mattr=-promote-alloca -amdgpu-sroa=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CI,MUBUF %s
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=-promote-alloca -amdgpu-sroa=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=-promote-alloca -amdgpu-sroa=0 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9,GFX9-MUBUF,MUBUF %s
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -mattr=-promote-alloca -amdgpu-sroa=0 -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9,GFX9-FLATSCR %s

	; Test that non-entry function frame indices are expanded properly to			; Test that non-entry function frame indices are expanded properly to
	; give an index relative to the scratch wave offset register			; give an index relative to the scratch wave offset register

	; Materialize into a mov. Make sure there isn't an unnecessary copy.			; Materialize into a mov. Make sure there isn't an unnecessary copy.
	; GCN-LABEL: {{^}}func_mov_fi_i32:			; GCN-LABEL: {{^}}func_mov_fi_i32:
	; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

	; CI-NEXT: v_lshr_b32_e64 v0, s32, 6			; CI-NEXT: v_lshr_b32_e64 v0, s32, 6
	; GFX9-NEXT: v_lshrrev_b32_e64 v0, 6, s32			; GFX9-MUBUF-NEXT: v_lshrrev_b32_e64 v0, 6, s32

				; GFX9-FLATSCR: v_mov_b32_e32 v0, s32
				; GFX9-FLATSCR-NOT: v_lshrrev_b32_e64

				; MUBUF-NOT: v_mov

	; GCN-NOT: v_mov
	; GCN: ds_write_b32 v0, v0			; GCN: ds_write_b32 v0, v0
	define void @func_mov_fi_i32() #0 {			define void @func_mov_fi_i32() #0 {
	%alloca = alloca i32, addrspace(5)			%alloca = alloca i32, addrspace(5)
	store volatile i32 addrspace(5)* %alloca, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %alloca, i32 addrspace(5)* addrspace(3)* undef
	ret void			ret void
	}			}

	; Offset due to different objects			; Offset due to different objects
	; GCN-LABEL: {{^}}func_mov_fi_i32_offset:			; GCN-LABEL: {{^}}func_mov_fi_i32_offset:
	; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

	; CI-DAG: v_lshr_b32_e64 v0, s32, 6			; CI-DAG: v_lshr_b32_e64 v0, s32, 6
	; CI-NOT: v_mov			; CI-NOT: v_mov
	; CI: ds_write_b32 v0, v0			; CI: ds_write_b32 v0, v0
	; CI-NEXT: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6			; CI-NEXT: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6
	; CI-NEXT: v_add_i32_e{{32\|64}} v0, {{s\[[0-9]+:[0-9]+\]\|vcc}}, 4, [[SCALED]]			; CI-NEXT: v_add_i32_e{{32\|64}} v0, {{s\[[0-9]+:[0-9]+\]\|vcc}}, 4, [[SCALED]]
	; CI-NEXT: ds_write_b32 v0, v0			; CI-NEXT: ds_write_b32 v0, v0

	; GFX9: v_lshrrev_b32_e64 v0, 6, s32			; GFX9-MUBUF-NEXT: v_lshrrev_b32_e64 v0, 6, s32
				; GFX9-FLATSCR: v_mov_b32_e32 v0, s32
				; GFX9-FLATSCR: s_add_u32 [[ADD:[^,]+]], s32, 4
	; GFX9-NEXT: ds_write_b32 v0, v0			; GFX9-NEXT: ds_write_b32 v0, v0
	; GFX9-NEXT: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32			; GFX9-MUBUF-NEXT: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32
	; GFX9-NEXT: v_add_u32_e32 v0, 4, [[SCALED]]			; GFX9-MUBUF-NEXT: v_add_u32_e32 v0, 4, [[SCALED]]
				; GFX9-FLATSCR-NEXT: v_mov_b32_e32 v0, [[ADD]]
	; GFX9-NEXT: ds_write_b32 v0, v0			; GFX9-NEXT: ds_write_b32 v0, v0
	define void @func_mov_fi_i32_offset() #0 {			define void @func_mov_fi_i32_offset() #0 {
	%alloca0 = alloca i32, addrspace(5)			%alloca0 = alloca i32, addrspace(5)
	%alloca1 = alloca i32, addrspace(5)			%alloca1 = alloca i32, addrspace(5)
	store volatile i32 addrspace(5)* %alloca0, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %alloca0, i32 addrspace(5)* addrspace(3)* undef
	store volatile i32 addrspace(5)* %alloca1, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %alloca1, i32 addrspace(5)* addrspace(3)* undef
	ret void			ret void
	}			}

	; Materialize into an add of a constant offset from the FI.			; Materialize into an add of a constant offset from the FI.
	; FIXME: Should be able to merge adds			; FIXME: Should be able to merge adds

	; GCN-LABEL: {{^}}func_add_constant_to_fi_i32:			; GCN-LABEL: {{^}}func_add_constant_to_fi_i32:
	; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)

	; CI: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6			; CI: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6
	; CI-NEXT: v_add_i32_e32 v0, vcc, 4, [[SCALED]]			; CI-NEXT: v_add_i32_e32 v0, vcc, 4, [[SCALED]]

	; GFX9: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32			; GFX9-MUBUF: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32
	; GFX9-NEXT: v_add_u32_e32 v0, 4, [[SCALED]]			; GFX9-MUBUF-NEXT: v_add_u32_e32 v0, 4, [[SCALED]]

				; GFX9-FLATSCR: v_mov_b32_e32 [[ADD:v[0-9]+]], s32
				; GFX9-FLATSCR-NEXT: v_add_u32_e32 v0, 4, [[ADD]]

	; GCN-NOT: v_mov			; GCN-NOT: v_mov
	; GCN: ds_write_b32 v0, v0			; GCN: ds_write_b32 v0, v0
	define void @func_add_constant_to_fi_i32() #0 {			define void @func_add_constant_to_fi_i32() #0 {
	%alloca = alloca [2 x i32], align 4, addrspace(5)			%alloca = alloca [2 x i32], align 4, addrspace(5)
	%gep0 = getelementptr inbounds [2 x i32], [2 x i32] addrspace(5)* %alloca, i32 0, i32 1			%gep0 = getelementptr inbounds [2 x i32], [2 x i32] addrspace(5)* %alloca, i32 0, i32 1
	store volatile i32 addrspace(5)* %gep0, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %gep0, i32 addrspace(5)* addrspace(3)* undef
	ret void			ret void
	}			}

	; A user the materialized frame index can't be meaningfully folded			; A user the materialized frame index can't be meaningfully folded
	; into.			; into.

	; GCN-LABEL: {{^}}func_other_fi_user_i32:			; GCN-LABEL: {{^}}func_other_fi_user_i32:

	; CI: v_lshr_b32_e64 v0, s32, 6			; CI: v_lshr_b32_e64 v0, s32, 6

	; GFX9: v_lshrrev_b32_e64 v0, 6, s32			; GFX9-MUBUF: v_lshrrev_b32_e64 v0, 6, s32
				; GFX9-FLATSCR: v_mov_b32_e32 v0, s32

	; GCN-NEXT: v_mul_u32_u24_e32 v0, 9, v0			; GCN-NEXT: v_mul_u32_u24_e32 v0, 9, v0
	; GCN-NOT: v_mov			; GCN-NOT: v_mov
	; GCN: ds_write_b32 v0, v0			; GCN: ds_write_b32 v0, v0
	define void @func_other_fi_user_i32() #0 {			define void @func_other_fi_user_i32() #0 {
	%alloca = alloca [2 x i32], align 4, addrspace(5)			%alloca = alloca [2 x i32], align 4, addrspace(5)
	%ptrtoint = ptrtoint [2 x i32] addrspace(5)* %alloca to i32			%ptrtoint = ptrtoint [2 x i32] addrspace(5)* %alloca to i32
	%mul = mul i32 %ptrtoint, 9			%mul = mul i32 %ptrtoint, 9
	store volatile i32 %mul, i32 addrspace(3)* undef			store volatile i32 %mul, i32 addrspace(3)* undef
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}func_store_private_arg_i32_ptr:			; GCN-LABEL: {{^}}func_store_private_arg_i32_ptr:
	; GCN: v_mov_b32_e32 v1, 15{{$}}			; GCN: v_mov_b32_e32 v1, 15{{$}}
	; GCN: buffer_store_dword v1, v0, s[0:3], 0 offen{{$}}			; MUBUF: buffer_store_dword v1, v0, s[0:3], 0 offen{{$}}
				; GFX9-FLATSCR: scratch_store_dword v1, v0, off{{$}}
	define void @func_store_private_arg_i32_ptr(i32 addrspace(5)* %ptr) #0 {			define void @func_store_private_arg_i32_ptr(i32 addrspace(5)* %ptr) #0 {
	store volatile i32 15, i32 addrspace(5)* %ptr			store volatile i32 15, i32 addrspace(5)* %ptr
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}func_load_private_arg_i32_ptr:			; GCN-LABEL: {{^}}func_load_private_arg_i32_ptr:
	; GCN: s_waitcnt			; GCN: s_waitcnt
	; GCN-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen{{$}}			; MUBUF-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen{{$}}
				; GFX9-FLATSCR-NEXT: scratch_load_dword v0, v0, off{{$}}
	define void @func_load_private_arg_i32_ptr(i32 addrspace(5)* %ptr) #0 {			define void @func_load_private_arg_i32_ptr(i32 addrspace(5)* %ptr) #0 {
	%val = load volatile i32, i32 addrspace(5)* %ptr			%val = load volatile i32, i32 addrspace(5)* %ptr
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr:			; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr:
	; GCN: s_waitcnt			; GCN: s_waitcnt

	; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6			; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6
	; CI-NEXT: v_or_b32_e32 v0, 4, [[SHIFT]]			; CI-NEXT: v_or_b32_e32 v0, 4, [[SHIFT]]

	; GFX9: v_lshrrev_b32_e64 [[SHIFT:v[0-9]+]], 6, s32			; GFX9-MUBUF: v_lshrrev_b32_e64 [[SHIFT:v[0-9]+]], 6, s32
	; GFX9-NEXT: v_or_b32_e32 v0, 4, [[SHIFT]]			; GFX9-MUBUF-NEXT: v_or_b32_e32 v0, 4, [[SHIFT]]

				; GFX9-FLATSCR: v_mov_b32_e32 [[SP:v[0-9]+]], s32
				; GFX9-FLATSCR-NEXT: v_or_b32_e32 v0, 4, [[SP]]

	; GCN-NOT: v_mov			; GCN-NOT: v_mov
	; GCN: ds_write_b32 v0, v0			; GCN: ds_write_b32 v0, v0
	define void @void_func_byval_struct_i8_i32_ptr({ i8, i32 } addrspace(5)* byval %arg0) #0 {			define void @void_func_byval_struct_i8_i32_ptr({ i8, i32 } addrspace(5)* byval %arg0) #0 {
	%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0			%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0
	%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1			%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
	%load1 = load i32, i32 addrspace(5)* %gep1			%load1 = load i32, i32 addrspace(5)* %gep1
	store volatile i32 addrspace(5)* %gep1, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %gep1, i32 addrspace(5)* addrspace(3)* undef
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr_value:			; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr_value:
	; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GCN: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: buffer_load_ubyte v0, off, s[0:3], s32			; MUBUF-NEXT: buffer_load_ubyte v0, off, s[0:3], s32
	; GCN_NEXT: buffer_load_dword v1, off, s[0:3], s32 offset:4			; MUBUF-NEXT: buffer_load_dword v1, off, s[0:3], s32 offset:4
				; GFX9-FLATSCR-NEXT: scratch_load_ubyte v0, off, s32
				; GFX9-FLATSCR-NEXT: scratch_load_dword v1, off, s32 offset:4
	define void @void_func_byval_struct_i8_i32_ptr_value({ i8, i32 } addrspace(5)* byval %arg0) #0 {			define void @void_func_byval_struct_i8_i32_ptr_value({ i8, i32 } addrspace(5)* byval %arg0) #0 {
	%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0			%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0
	%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1			%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
	%load0 = load i8, i8 addrspace(5)* %gep0			%load0 = load i8, i8 addrspace(5)* %gep0
	%load1 = load i32, i32 addrspace(5)* %gep1			%load1 = load i32, i32 addrspace(5)* %gep1
	store volatile i8 %load0, i8 addrspace(3)* undef			store volatile i8 %load0, i8 addrspace(3)* undef
	store volatile i32 %load1, i32 addrspace(3)* undef			store volatile i32 %load1, i32 addrspace(3)* undef
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr_nonentry_block:			; GCN-LABEL: {{^}}void_func_byval_struct_i8_i32_ptr_nonentry_block:

	; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6			; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6

	; GFX9: v_lshrrev_b32_e64 [[SHIFT:v[0-9]+]], 6, s32			; GFX9-MUBUF: v_lshrrev_b32_e64 [[SP:v[0-9]+]], 6, s32
				; GFX9-FLATSCR: v_mov_b32_e32 [[SP:v[0-9]+]], s32

	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64

	; CI: v_add_i32_e32 [[GEP:v[0-9]+]], vcc, 4, [[SHIFT]]			; CI: v_add_i32_e32 [[GEP:v[0-9]+]], vcc, 4, [[SHIFT]]
	; CI: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4{{$}}			; CI: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4{{$}}

	; GFX9: v_add_u32_e32 [[GEP:v[0-9]+]], 4, [[SHIFT]]			; GFX9: v_add_u32_e32 [[GEP:v[0-9]+]], 4, [[SP]]
	; GFX9: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4{{$}}			; GFX9-MUBUF: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4{{$}}
				; GFX9-FLATSCR: scratch_load_dword v{{[0-9]+}}, [[SP]], off offset:4{{$}}

	; GCN: ds_write_b32 v{{[0-9]+}}, [[GEP]]			; GCN: ds_write_b32 v{{[0-9]+}}, [[GEP]]
	define void @void_func_byval_struct_i8_i32_ptr_nonentry_block({ i8, i32 } addrspace(5)* byval %arg0, i32 %arg2) #0 {			define void @void_func_byval_struct_i8_i32_ptr_nonentry_block({ i8, i32 } addrspace(5)* byval %arg0, i32 %arg2) #0 {
	%cmp = icmp eq i32 %arg2, 0			%cmp = icmp eq i32 %arg2, 0
	br i1 %cmp, label %bb, label %ret			br i1 %cmp, label %bb, label %ret

	bb:			bb:
	%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0			%gep0 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 0
	%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1			%gep1 = getelementptr inbounds { i8, i32 }, { i8, i32 } addrspace(5)* %arg0, i32 0, i32 1
	%load1 = load volatile i32, i32 addrspace(5)* %gep1			%load1 = load volatile i32, i32 addrspace(5)* %gep1
	store volatile i32 addrspace(5)* %gep1, i32 addrspace(5)* addrspace(3)* undef			store volatile i32 addrspace(5)* %gep1, i32 addrspace(5)* addrspace(3)* undef
	br label %ret			br label %ret

	ret:			ret:
	ret void			ret void
	}			}

	; Added offset can't be used with VOP3 add			; Added offset can't be used with VOP3 add
	; GCN-LABEL: {{^}}func_other_fi_user_non_inline_imm_offset_i32:			; GCN-LABEL: {{^}}func_other_fi_user_non_inline_imm_offset_i32:

	; CI-DAG: s_movk_i32 [[K:s[0-9]+\|vcc_lo\|vcc_hi]], 0x200			; CI-DAG: s_movk_i32 [[K:s[0-9]+\|vcc_lo\|vcc_hi]], 0x200
	; CI-DAG: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6			; CI-DAG: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6
	; CI: v_add_i32_e32 [[VZ:v[0-9]+]], vcc, [[K]], [[SCALED]]			; CI: v_add_i32_e32 [[VZ:v[0-9]+]], vcc, [[K]], [[SCALED]]

	; GFX9-DAG: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32			; GFX9-MUBUF-DAG: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32
	; GFX9: v_add_u32_e32 [[VZ:v[0-9]+]], 0x200, [[SCALED]]			; GFX9-MUBUF: v_add_u32_e32 [[VZ:v[0-9]+]], 0x200, [[SCALED]]

				; GFX9-FLATSCR-DAG: s_add_u32 [[SZ:[^,]+]], s32, 0x200
				; GFX9-FLATSCR: v_mov_b32_e32 [[VZ:v[0-9]+]], [[SZ]]

	; GCN: v_mul_u32_u24_e32 [[VZ]], 9, [[VZ]]			; GCN: v_mul_u32_u24_e32 [[VZ]], 9, [[VZ]]
	; GCN: ds_write_b32 v0, [[VZ]]			; GCN: ds_write_b32 v0, [[VZ]]
	define void @func_other_fi_user_non_inline_imm_offset_i32() #0 {			define void @func_other_fi_user_non_inline_imm_offset_i32() #0 {
	%alloca0 = alloca [128 x i32], align 4, addrspace(5)			%alloca0 = alloca [128 x i32], align 4, addrspace(5)
	%alloca1 = alloca [8 x i32], align 4, addrspace(5)			%alloca1 = alloca [8 x i32], align 4, addrspace(5)
	%gep0 = getelementptr inbounds [128 x i32], [128 x i32] addrspace(5)* %alloca0, i32 0, i32 65			%gep0 = getelementptr inbounds [128 x i32], [128 x i32] addrspace(5)* %alloca0, i32 0, i32 65
	%gep1 = getelementptr inbounds [8 x i32], [8 x i32] addrspace(5)* %alloca1, i32 0, i32 0			%gep1 = getelementptr inbounds [8 x i32], [8 x i32] addrspace(5)* %alloca1, i32 0, i32 0
	store volatile i32 7, i32 addrspace(5)* %gep0			store volatile i32 7, i32 addrspace(5)* %gep0
	%ptrtoint = ptrtoint i32 addrspace(5)* %gep1 to i32			%ptrtoint = ptrtoint i32 addrspace(5)* %gep1 to i32
	%mul = mul i32 %ptrtoint, 9			%mul = mul i32 %ptrtoint, 9
	store volatile i32 %mul, i32 addrspace(3)* undef			store volatile i32 %mul, i32 addrspace(3)* undef
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}func_other_fi_user_non_inline_imm_offset_i32_vcc_live:			; GCN-LABEL: {{^}}func_other_fi_user_non_inline_imm_offset_i32_vcc_live:

	; CI-DAG: s_movk_i32 [[OFFSET:s[0-9]+]], 0x200			; CI-DAG: s_movk_i32 [[OFFSET:s[0-9]+]], 0x200
	; CI-DAG: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6			; CI-DAG: v_lshr_b32_e64 [[SCALED:v[0-9]+]], s32, 6
	; CI: v_add_i32_e64 [[VZ:v[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, [[OFFSET]], [[SCALED]]			; CI: v_add_i32_e64 [[VZ:v[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, [[OFFSET]], [[SCALED]]

	; GFX9-DAG: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32			; GFX9-MUBUF-DAG: v_lshrrev_b32_e64 [[SCALED:v[0-9]+]], 6, s32
	; GFX9: v_add_u32_e32 [[VZ:v[0-9]+]], 0x200, [[SCALED]]			; GFX9-MUBUF: v_add_u32_e32 [[VZ:v[0-9]+]], 0x200, [[SCALED]]

				; GFX9-FLATSCR-DAG: s_add_u32 [[SZ:[^,]+]], s32, 0x200
				; GFX9-FLATSCR: v_mov_b32_e32 [[VZ:v[0-9]+]], [[SZ]]

	; GCN: v_mul_u32_u24_e32 [[VZ]], 9, [[VZ]]			; GCN: v_mul_u32_u24_e32 [[VZ]], 9, [[VZ]]
	; GCN: ds_write_b32 v0, [[VZ]]			; GCN: ds_write_b32 v0, [[VZ]]
	define void @func_other_fi_user_non_inline_imm_offset_i32_vcc_live() #0 {			define void @func_other_fi_user_non_inline_imm_offset_i32_vcc_live() #0 {
	%alloca0 = alloca [128 x i32], align 4, addrspace(5)			%alloca0 = alloca [128 x i32], align 4, addrspace(5)
	%alloca1 = alloca [8 x i32], align 4, addrspace(5)			%alloca1 = alloca [8 x i32], align 4, addrspace(5)
	%vcc = call i64 asm sideeffect "; def $0", "={vcc}"()			%vcc = call i64 asm sideeffect "; def $0", "={vcc}"()
	%gep0 = getelementptr inbounds [128 x i32], [128 x i32] addrspace(5)* %alloca0, i32 0, i32 65			%gep0 = getelementptr inbounds [128 x i32], [128 x i32] addrspace(5)* %alloca0, i32 0, i32 65
	%gep1 = getelementptr inbounds [8 x i32], [8 x i32] addrspace(5)* %alloca1, i32 0, i32 0			%gep1 = getelementptr inbounds [8 x i32], [8 x i32] addrspace(5)* %alloca1, i32 0, i32 0
	store volatile i32 7, i32 addrspace(5)* %gep0			store volatile i32 7, i32 addrspace(5)* %gep0
	call void asm sideeffect "; use $0", "{vcc}"(i64 %vcc)			call void asm sideeffect "; use $0", "{vcc}"(i64 %vcc)
	%ptrtoint = ptrtoint i32 addrspace(5)* %gep1 to i32			%ptrtoint = ptrtoint i32 addrspace(5)* %gep1 to i32
	%mul = mul i32 %ptrtoint, 9			%mul = mul i32 %ptrtoint, 9
	store volatile i32 %mul, i32 addrspace(3)* undef			store volatile i32 %mul, i32 addrspace(3)* undef
	ret void			ret void
	}			}

	declare void @func(<4 x float> addrspace(5)* nocapture) #0			declare void @func(<4 x float> addrspace(5)* nocapture) #0

	; undef flag not preserved in eliminateFrameIndex when handling the			; undef flag not preserved in eliminateFrameIndex when handling the
	; stores in the middle block.			; stores in the middle block.

	; GCN-LABEL: {{^}}undefined_stack_store_reg:			; GCN-LABEL: {{^}}undefined_stack_store_reg:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: buffer_store_dword v0, off, s[0:3], s33 offset:			; MUBUF: buffer_store_dword v0, off, s[0:3], s33 offset:
	; GCN: buffer_store_dword v0, off, s[0:3], s33 offset:			; MUBUF: buffer_store_dword v0, off, s[0:3], s33 offset:
	; GCN: buffer_store_dword v0, off, s[0:3], s33 offset:			; MUBUF: buffer_store_dword v0, off, s[0:3], s33 offset:
	; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33 offset:			; MUBUF: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s33 offset:
				; FLATSCR: scratch_store_dword v0, off, s33 offset:
				; FLATSCR: scratch_store_dword v0, off, s33 offset:
				; FLATSCR: scratch_store_dword v0, off, s33 offset:
				; FLATSCR: scratch_store_dword v{{[0-9]+}}, off, s33 offset:
	define void @undefined_stack_store_reg(float %arg, i32 %arg1) #0 {			define void @undefined_stack_store_reg(float %arg, i32 %arg1) #0 {
	bb:			bb:
	%tmp = alloca <4 x float>, align 16, addrspace(5)			%tmp = alloca <4 x float>, align 16, addrspace(5)
	%tmp2 = insertelement <4 x float> undef, float %arg, i32 0			%tmp2 = insertelement <4 x float> undef, float %arg, i32 0
	store <4 x float> %tmp2, <4 x float> addrspace(5)* undef			store <4 x float> %tmp2, <4 x float> addrspace(5)* undef
	%tmp3 = icmp eq i32 %arg1, 0			%tmp3 = icmp eq i32 %arg1, 0
	br i1 %tmp3, label %bb4, label %bb5			br i1 %tmp3, label %bb4, label %bb5

	bb4:			bb4:
	call void @func(<4 x float> addrspace(5)* nonnull undef)			call void @func(<4 x float> addrspace(5)* nonnull undef)
	store <4 x float> %tmp2, <4 x float> addrspace(5)* %tmp, align 16			store <4 x float> %tmp2, <4 x float> addrspace(5)* %tmp, align 16
	call void @func(<4 x float> addrspace(5)* nonnull %tmp)			call void @func(<4 x float> addrspace(5)* nonnull %tmp)
	br label %bb5			br label %bb5

	bb5:			bb5:
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}alloca_ptr_nonentry_block:			; GCN-LABEL: {{^}}alloca_ptr_nonentry_block:
	; GCN: s_and_saveexec_b64			; GCN: s_and_saveexec_b64
	; GCN: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4			; MUBUF: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s32 offset:4
				; FLATSCR: scratch_load_dword v{{[0-9]+}}, off, s32 offset:4

	; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6			; CI: v_lshr_b32_e64 [[SHIFT:v[0-9]+]], s32, 6
	; CI-NEXT: v_or_b32_e32 [[PTR:v[0-9]+]], 4, [[SHIFT]]			; CI-NEXT: v_or_b32_e32 [[PTR:v[0-9]+]], 4, [[SHIFT]]

	; GFX9: v_lshrrev_b32_e64 [[SHIFT:v[0-9]+]], 6, s32			; GFX9-MUBUF: v_lshrrev_b32_e64 [[SHIFT:v[0-9]+]], 6, s32
	; GFX9-NEXT: v_or_b32_e32 [[PTR:v[0-9]+]], 4, [[SHIFT]]			; GFX9-MUBUF-NEXT: v_or_b32_e32 [[PTR:v[0-9]+]], 4, [[SHIFT]]

				; GFX9-FLATSCR: v_mov_b32_e32 [[SP:v[0-9]+]], s32
				; GFX9-FLATSCR-NEXT: v_or_b32_e32 [[PTR:v[0-9]+]], 4, [[SP]]

	; GCN: ds_write_b32 v{{[0-9]+}}, [[PTR]]			; GCN: ds_write_b32 v{{[0-9]+}}, [[PTR]]
	define void @alloca_ptr_nonentry_block(i32 %arg0) #0 {			define void @alloca_ptr_nonentry_block(i32 %arg0) #0 {
	%alloca0 = alloca { i8, i32 }, align 4, addrspace(5)			%alloca0 = alloca { i8, i32 }, align 4, addrspace(5)
	%cmp = icmp eq i32 %arg0, 0			%cmp = icmp eq i32 %arg0, 0
	br i1 %cmp, label %bb, label %ret			br i1 %cmp, label %bb, label %ret

	bb:			bb:
	Show All 11 Lines

llvm/test/CodeGen/AMDGPU/load-hi16.ll

; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900 %s		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900,GFX900-MUBUF %s
; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX906,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX906,NO-D16-HI %s
; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX803,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX803,NO-D16-HI %s
		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900,GFX900-FLATSCR %s

; GCN-LABEL: {{^}}load_local_lo_hi_v2i16_multi_use_lo:		; GCN-LABEL: {{^}}load_local_lo_hi_v2i16_multi_use_lo:
; GFX900: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: ds_read_u16 v2, v0		; GFX900-NEXT: ds_read_u16 v2, v0
; GFX900-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0		; GFX900-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0
; GFX900-DAG: s_waitcnt lgkmcnt(0)		; GFX900-DAG: s_waitcnt lgkmcnt(0)
; GFX900-DAG: v_mov_b32_e32 v1, v2		; GFX900-DAG: v_mov_b32_e32 v1, v2
; GFX900-DAG: ds_read_u16_d16_hi v1, v0 offset:16		; GFX900-DAG: ds_read_u16_d16_hi v1, v0 offset:16
▲ Show 20 Lines • Show All 476 Lines • ▼ Show 20 Lines	entry:
%build0 = insertelement <2 x half> undef, half %reg, i32 0		%build0 = insertelement <2 x half> undef, half %reg, i32 0
%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1		%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}		; GFX900-MUBUF: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}
		; GFX900-FLATSCR: scratch_load_short_d16_hi v0, off, s32 offset:4094{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], s32 offset:4094{{$}}		; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], s32 offset:4094{{$}}
define void @load_private_hi_v2i16_reglo_vreg(i16 addrspace(5)* byval %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg(i16 addrspace(5)* byval %in, i16 %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047		%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047
%load = load i16, i16 addrspace(5)* %gep		%load = load i16, i16 addrspace(5)* %gep
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg:		; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}		; GFX900-MUBUF: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}
		; GFX900-FLATSCR: scratch_load_short_d16_hi v0, off, s32 offset:4094{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], s32 offset:4094{{$}}		; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], s32 offset:4094{{$}}
define void @load_private_hi_v2f16_reglo_vreg(half addrspace(5)* byval %in, half %reg) #0 {		define void @load_private_hi_v2f16_reglo_vreg(half addrspace(5)* byval %in, half %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds half, half addrspace(5)* %in, i64 2047		%gep = getelementptr inbounds half, half addrspace(5)* %in, i64 2047
%load = load half, half addrspace(5)* %gep		%load = load half, half addrspace(5)* %gep
%build0 = insertelement <2 x half> undef, half %reg, i32 0		%build0 = insertelement <2 x half> undef, half %reg, i32 0
%build1 = insertelement <2 x half> %build0, half %load, i32 1		%build1 = insertelement <2 x half> %build0, half %load, i32 1
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_short_d16_hi v0, off, s[0:3], 0 offset:4094{{$}}		; GFX900-MUBUFF: buffer_load_short_d16_hi v0, off, s[0:3], 0 offset:4094{{$}}
		; GFX900-FLATSCR: s_movk_i32 [[SOFF:[^,]+]], 0xffe
		; GFX900-FLATSCR: scratch_load_short_d16_hi v0, off, [[SOFF]]{{$}}
; GFX900: s_waitcnt		; GFX900: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], 0 offset:4094{{$}}		; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], 0 offset:4094{{$}}
define void @load_private_hi_v2i16_reglo_vreg_nooff(i16 addrspace(5)* byval %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_nooff(i16 addrspace(5)* byval %in, i16 %reg) #0 {
entry:		entry:
%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)		%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_nooff:		; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_nooff:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900-NEXT: buffer_load_short_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}		; GFX900-MUBUF-NEXT: buffer_load_short_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}
		; GFX900-FLATSCR-NEXT: s_movk_i32 [[SOFF:[^,]+]], 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16_hi v1, off, [[SOFF]]{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], 0 offset:4094{{$}}		; NO-D16-HI: buffer_load_ushort v{{[0-9]+}}, off, s[0:3], 0 offset:4094{{$}}
define void @load_private_hi_v2f16_reglo_vreg_nooff(half addrspace(5)* %in, half %reg) #0 {		define void @load_private_hi_v2f16_reglo_vreg_nooff(half addrspace(5)* %in, half %reg) #0 {
entry:		entry:
%load = load volatile half, half addrspace(5)* inttoptr (i32 4094 to half addrspace(5)*)		%load = load volatile half, half addrspace(5)* inttoptr (i32 4094 to half addrspace(5)*)
%build0 = insertelement <2 x half> undef, half %reg, i32 0		%build0 = insertelement <2 x half> undef, half %reg, i32 0
%build1 = insertelement <2 x half> %build0, half %load, i32 1		%build1 = insertelement <2 x half> %build0, half %load, i32 1
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_zexti8:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_zexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_ubyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}		; GFX900-MUBUF: buffer_load_ubyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}
		; GFX900-FLATSCR: scratch_load_ubyte_d16_hi v0, off, s32 offset:4095{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ubyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}		; NO-D16-HI: buffer_load_ubyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}
define void @load_private_hi_v2i16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, i16 %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_zexti8:		; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_zexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_ubyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}		; GFX900-MUBUF: buffer_load_ubyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}
		; GFX900-FLATSCR: scratch_load_ubyte_d16_hi v0, off, s32 offset:4095{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ubyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}		; NO-D16-HI: buffer_load_ubyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}
define void @load_private_hi_v2f16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, half %reg) #0 {		define void @load_private_hi_v2f16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, half %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%bitcast = bitcast i16 %ext to half		%bitcast = bitcast i16 %ext to half
%build0 = insertelement <2 x half> undef, half %reg, i32 0		%build0 = insertelement <2 x half> undef, half %reg, i32 0
%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1		%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_sexti8:		; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_sexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_sbyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}		; GFX900-MUBUF: buffer_load_sbyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}
		; GFX900-FLATSCR: scratch_load_sbyte_d16_hi v0, off, s32 offset:4095{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_sbyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}		; NO-D16-HI: buffer_load_sbyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}
define void @load_private_hi_v2f16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, half %reg) #0 {		define void @load_private_hi_v2f16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, half %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%bitcast = bitcast i16 %ext to half		%bitcast = bitcast i16 %ext to half
%build0 = insertelement <2 x half> undef, half %reg, i32 0		%build0 = insertelement <2 x half> undef, half %reg, i32 0
%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1		%build1 = insertelement <2 x half> %build0, half %bitcast, i32 1
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_sexti8:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_sexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_sbyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}		; GFX900-MUBUF: buffer_load_sbyte_d16_hi v0, off, s[0:3], s32 offset:4095{{$}}
		; GFX900-FLATSCR: scratch_load_sbyte_d16_hi v0, off, s32 offset:4095{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v0
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_sbyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}		; NO-D16-HI: buffer_load_sbyte v{{[0-9]+}}, off, s[0:3], s32 offset:4095{{$}}
define void @load_private_hi_v2i16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, i16 %reg) #0 {
entry:		entry:
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff_zexti8:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff_zexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900-NEXT: buffer_load_ubyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}
		; GFX900-FLATSCR-NEXT: s_movk_i32 [[SOFF:[^,]+]], 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16_hi v1, off, [[SOFF]]{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094{{$}}		; NO-D16-HI: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094{{$}}
define void @load_private_hi_v2i16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i16 %reg) #0 {
entry:		entry:
%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)		%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff_sexti8:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_nooff_sexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900-NEXT: buffer_load_sbyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}
		; GFX900-FLATSCR-NEXT: s_movk_i32 [[SOFF:[^,]+]], 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16_hi v1, off, [[SOFF]]{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094{{$}}		; NO-D16-HI: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094{{$}}
define void @load_private_hi_v2i16_reglo_vreg_nooff_sexti8(i8 addrspace(5)* %in, i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_nooff_sexti8(i8 addrspace(5)* %in, i16 %reg) #0 {
entry:		entry:
%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)		%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_nooff_zexti8:		; GCN-LABEL: {{^}}load_private_hi_v2f16_reglo_vreg_nooff_zexti8:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900-NEXT: buffer_load_ubyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16_hi v1, off, s[0:3], 0 offset:4094{{$}}
		; GFX900-FLATSCR-NEXT: s_movk_i32 [[SOFF:[^,]+]], 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16_hi v1, off, [[SOFF]]{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1		; GFX900-NEXT: global_store_dword v{{\[[0-9]+:[0-9]+\]}}, v1
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64

; NO-D16-HI: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094{{$}}		; NO-D16-HI: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094{{$}}
define void @load_private_hi_v2f16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, half %reg) #0 {		define void @load_private_hi_v2f16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, half %reg) #0 {
entry:		entry:
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	entry:
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

; Local object gives known offset, so requires converting from offen		; Local object gives known offset, so requires converting from offen
; to offset variant.		; to offset variant.

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_to_offset:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_to_offset:
; GFX900: buffer_store_dword		; GFX900-MUBUF: buffer_store_dword
; GFX900-NEXT: buffer_load_short_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4094
		; GFX900-FLATSCR: scratch_store_dword
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16_hi v{{[0-9]+}}, off, s32 offset:4094
define void @load_private_hi_v2i16_reglo_vreg_to_offset(i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_to_offset(i16 %reg) #0 {
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i16], align 2, addrspace(5)		%obj1 = alloca [4096 x i16], align 2, addrspace(5)
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027		%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027
%load = load i16, i16 addrspace(5)* %gep		%load = load i16, i16 addrspace(5)* %gep
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %load, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_sexti8_to_offset:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_sexti8_to_offset:
; GFX900: buffer_store_dword		; GFX900-MUBUF: buffer_store_dword
; GFX900-NEXT: buffer_load_sbyte_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4095
		; GFX900-FLATSCR: scratch_store_dword
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16_hi v{{[0-9]+}}, off, s32 offset:4095
define void @load_private_hi_v2i16_reglo_vreg_sexti8_to_offset(i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_sexti8_to_offset(i16 %reg) #0 {
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 0
%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %ext, i32 1
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_zexti8_to_offset:		; GCN-LABEL: {{^}}load_private_hi_v2i16_reglo_vreg_zexti8_to_offset:
; GFX900: buffer_store_dword		; GFX900-MUBUF: buffer_store_dword
; GFX900-NEXT: buffer_load_ubyte_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16_hi v{{[0-9]+}}, off, s[0:3], s32 offset:4095
		; GFX900-FLATSCR: scratch_store_dword
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16_hi v{{[0-9]+}}, off, s32 offset:4095
define void @load_private_hi_v2i16_reglo_vreg_zexti8_to_offset(i16 %reg) #0 {		define void @load_private_hi_v2i16_reglo_vreg_zexti8_to_offset(i16 %reg) #0 {
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	entry:
%build1 = insertelement <2 x i16> %build0, i16 %load1, i32 1		%build1 = insertelement <2 x i16> %build0, i16 %load1, i32 1
ret <2 x i16> %build1		ret <2 x i16> %build1
}		}

; FIXME: Remove m0 init and waitcnt between reads		; FIXME: Remove m0 init and waitcnt between reads
; FIXME: Is there a cost to using the extload over not?		; FIXME: Is there a cost to using the extload over not?
; GCN-LABEL: {{^}}load_private_v2i16_split:		; GCN-LABEL: {{^}}load_private_v2i16_split:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_load_ushort v0, off, s[0:3], s32{{$}}		; GFX900-MUBUF: buffer_load_ushort v0, off, s[0:3], s32{{$}}
		; GFX900-FLATSCR: scratch_load_ushort v0, off, s32{{$}}
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:2		; GFX900-MUBUF-NEXT: buffer_load_short_d16_hi v0, off, s[0:3], s32 offset:2
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16_hi v0, off, s32 offset:2
; GFX900-NEXT: s_waitcnt		; GFX900-NEXT: s_waitcnt
; GFX900-NEXT: s_setpc_b64		; GFX900-NEXT: s_setpc_b64
define <2 x i16> @load_private_v2i16_split(i16 addrspace(5)* byval %in) #0 {		define <2 x i16> @load_private_v2i16_split(i16 addrspace(5)* byval %in) #0 {
entry:		entry:
%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i32 1		%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i32 1
%load0 = load volatile i16, i16 addrspace(5)* %in		%load0 = load volatile i16, i16 addrspace(5)* %in
%load1 = load volatile i16, i16 addrspace(5)* %gep		%load1 = load volatile i16, i16 addrspace(5)* %gep
%build0 = insertelement <2 x i16> undef, i16 %load0, i32 0		%build0 = insertelement <2 x i16> undef, i16 %load0, i32 0
Show All 25 Lines

llvm/test/CodeGen/AMDGPU/load-lo16.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900 %s		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX900,GFX900-MUBUF %s
; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca,+sram-ecc -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX906,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca,+sram-ecc -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX906,NO-D16-HI %s
; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX803,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX803,NO-D16-HI %s
		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs --amdgpu-enable-flat-scratch < %s \| FileCheck -check-prefixes=GCN,GFX900,GFX900-FLATSCR %s

define <2 x i16> @load_local_lo_v2i16_undeflo(i16 addrspace(3)* %in) #0 {		define <2 x i16> @load_local_lo_v2i16_undeflo(i16 addrspace(3)* %in) #0 {
; GFX900-LABEL: load_local_lo_v2i16_undeflo:		; GFX900-LABEL: load_local_lo_v2i16_undeflo:
; GFX900: ; %bb.0: ; %entry		; GFX900: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: ds_read_u16_d16 v0, v0		; GFX900-NEXT: ds_read_u16_d16 v0, v0
; GFX900-NEXT: s_waitcnt lgkmcnt(0)		; GFX900-NEXT: s_waitcnt lgkmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-NEXT: s_setpc_b64 s[30:31]
▲ Show 20 Lines • Show All 1,159 Lines • ▼ Show 20 Lines	entry:
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%bitcast = bitcast i16 %ext to half		%bitcast = bitcast i16 %ext to half
%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg(i16 addrspace(5)* byval %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg(i16 addrspace(5)* byval %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0		; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v1, v0		; GFX803-NEXT: v_or_b32_e32 v0, v1, v0
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047		%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047
%load = load i16, i16 addrspace(5)* %gep		%load = load i16, i16 addrspace(5)* %gep
%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reghi_vreg(i16 addrspace(5)* byval %in, i16 %reg) #0 {		define void @load_private_lo_v2i16_reghi_vreg(i16 addrspace(5)* byval %in, i16 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reghi_vreg:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reghi_vreg:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: v_and_b32_e32 v1, 0xffff, v1		; GFX900-MUBUF-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX900-NEXT: v_lshl_or_b32 v0, v0, 16, v1		; GFX900-MUBUF-NEXT: v_lshl_or_b32 v0, v0, 16, v1
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reghi_vreg:		; GFX906-LABEL: load_private_lo_v2i16_reghi_vreg:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_and_b32_e32 v1, 0xffff, v1		; GFX906-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX906-NEXT: v_lshl_or_b32 v0, v0, 16, v1		; GFX906-NEXT: v_lshl_or_b32 v0, v0, 16, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg:		; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX803-NEXT: v_lshlrev_b32_e32 v0, 16, v0		; GFX803-NEXT: v_lshlrev_b32_e32 v0, 16, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v1, v0		; GFX803-NEXT: v_or_b32_e32 v0, v1, v0
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reghi_vreg:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: scratch_load_ushort v1, off, s32 offset:4094
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: v_and_b32_e32 v1, 0xffff, v1
		; GFX900-FLATSCR-NEXT: v_lshl_or_b32 v0, v0, 16, v1
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047		%gep = getelementptr inbounds i16, i16 addrspace(5)* %in, i64 2047
%load = load i16, i16 addrspace(5)* %gep		%load = load i16, i16 addrspace(5)* %gep
%build0 = insertelement <2 x i16> undef, i16 %reg, i32 1		%build0 = insertelement <2 x i16> undef, i16 %reg, i32 1
%build1 = insertelement <2 x i16> %build0, i16 %load, i32 0		%build1 = insertelement <2 x i16> %build0, i16 %load, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2f16_reglo_vreg(half addrspace(5)* byval %in, i32 %reg) #0 {		define void @load_private_lo_v2f16_reglo_vreg(half addrspace(5)* byval %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2f16_reglo_vreg:		; GFX900-MUBUF-LABEL: load_private_lo_v2f16_reglo_vreg:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg:		; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_and_b32_e32 v1, 0xffff, v1		; GFX906-NEXT: v_and_b32_e32 v1, 0xffff, v1
; GFX906-NEXT: v_lshl_or_b32 v0, v0, 16, v1		; GFX906-NEXT: v_lshl_or_b32 v0, v0, 16, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg:		; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v1, v0		; GFX803-NEXT: v_or_b32_e32 v0, v1, v0
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x half>		%reg.bc = bitcast i32 %reg to <2 x half>
%gep = getelementptr inbounds half, half addrspace(5)* %in, i64 2047		%gep = getelementptr inbounds half, half addrspace(5)* %in, i64 2047
%load = load half, half addrspace(5)* %gep		%load = load half, half addrspace(5)* %gep
%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_nooff(i16 addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_nooff(i16 addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1		; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1		; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v0, v1		; GFX803-NEXT: v_or_b32_e32 v0, v0, v1
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)		%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)
%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reghi_vreg_nooff(i16 addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reghi_vreg_nooff(i16 addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:		; GFX906-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1		; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:		; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1		; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v0, v1		; GFX803-NEXT: v_or_b32_e32 v0, v0, v1
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)		%load = load volatile i16, i16 addrspace(5)* inttoptr (i32 4094 to i16 addrspace(5)*)
%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2f16_reglo_vreg_nooff(half addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2f16_reglo_vreg_nooff(half addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:		; GFX900-MUBUF-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:		; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_lshrrev_b32_e32 v1, 16, v1		; GFX906-NEXT: v_lshrrev_b32_e32 v1, 16, v1
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_and_b32_e32 v0, 0xffff, v0		; GFX906-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX906-NEXT: v_lshl_or_b32 v0, v1, 16, v0		; GFX906-NEXT: v_lshl_or_b32 v0, v1, 16, v0
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:		; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094
; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1		; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v0, v1		; GFX803-NEXT: v_or_b32_e32 v0, v0, v1
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x half>		%reg.bc = bitcast i32 %reg to <2 x half>
%load = load volatile half, half addrspace(5)* inttoptr (i32 4094 to half addrspace(5)*)		%load = load volatile half, half addrspace(5)* inttoptr (i32 4094 to half addrspace(5)*)
%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_zexti8(i8 addrspace(5)* byval %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0		; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0
; GFX803-NEXT: s_mov_b32 s4, 0x5040c00		; GFX803-NEXT: s_mov_b32 s4, 0x5040c00
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4		; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_sexti8(i8 addrspace(5)* byval %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0		; GFX906-NEXT: v_bfi_b32 v0, v2, v1, v0
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD		; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095		%gep = getelementptr inbounds i8, i8 addrspace(5)* %in, i64 4095
%load = load i8, i8 addrspace(5)* %gep		%load = load i8, i8 addrspace(5)* %gep
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_ubyte_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1		; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v1		; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v1
; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], 0 offset:4094
; GFX803-NEXT: s_mov_b32 s4, 0x5040c00		; GFX803-NEXT: s_mov_b32 s4, 0x5040c00
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4		; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_zexti8:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)		%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_nooff_sexti8(i8 addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_nooff_sexti8(i8 addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_sbyte_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1		; GFX906-NEXT: v_bfi_b32 v0, v2, v0, v1
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:		; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_sbyte v0, off, s[0:3], 0 offset:4094
; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1		; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD		; GFX803-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_nooff_sexti8:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)		%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2f16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i32 %reg) #0 {		define void @load_private_lo_v2f16_reglo_vreg_nooff_zexti8(i8 addrspace(5)* %in, i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:		; GFX900-MUBUF-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: buffer_load_ubyte_d16 v1, off, s[0:3], 0 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16 v1, off, s[0:3], 0 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v1, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v1, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:		; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094		; GFX906-NEXT: buffer_load_ubyte v0, off, s[0:3], 0 offset:4094
; GFX906-NEXT: v_lshrrev_b32_e32 v1, 16, v1		; GFX906-NEXT: v_lshrrev_b32_e32 v1, 16, v1
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: v_and_b32_e32 v0, 0xffff, v0		; GFX906-NEXT: v_and_b32_e32 v0, 0xffff, v0
; GFX906-NEXT: v_lshl_or_b32 v0, v1, 16, v0		; GFX906-NEXT: v_lshl_or_b32 v0, v1, 16, v0
; GFX906-NEXT: global_store_dword v[0:1], v0, off		; GFX906-NEXT: global_store_dword v[0:1], v0, off
; GFX906-NEXT: s_waitcnt vmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0)
; GFX906-NEXT: s_setpc_b64 s[30:31]		; GFX906-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:		; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:
; GFX803: ; %bb.0: ; %entry		; GFX803: ; %bb.0: ; %entry
; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v1		; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v1
; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], 0 offset:4094		; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], 0 offset:4094
; GFX803-NEXT: s_mov_b32 s4, 0x5040c00		; GFX803-NEXT: s_mov_b32 s4, 0x5040c00
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4		; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_nooff_zexti8:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: s_movk_i32 s4, 0xffe
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16 v1, off, s4
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v1, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%reg.bc = bitcast i32 %reg to <2 x half>		%reg.bc = bitcast i32 %reg to <2 x half>
%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)		%load = load volatile i8, i8 addrspace(5)* inttoptr (i32 4094 to i8 addrspace(5)*)
%ext = zext i8 %load to i16		%ext = zext i8 %load to i16
%bc.ext = bitcast i16 %ext to half		%bc.ext = bitcast i16 %ext to half
%build1 = insertelement <2 x half> %reg.bc, half %bc.ext, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %bc.ext, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
▲ Show 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	entry:
%ext = sext i8 %load to i16		%ext = sext i8 %load to i16
%bitcast = bitcast i16 %ext to half		%bitcast = bitcast i16 %ext to half
%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_to_offset(i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_to_offset(i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX900-MUBUF-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX900-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX900-MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX900-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_load_short_d16 v0, off, s[0:3], s32 offset:4094
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX906-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
Show All 10 Lines
; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094		; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_e32 v0, v1, v0		; GFX803-NEXT: v_or_b32_e32 v0, v1, v0
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
		; GFX900-FLATSCR-NEXT: scratch_store_dword off, v1, s32
		; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i16], align 2, addrspace(5)		%obj1 = alloca [4096 x i16], align 2, addrspace(5)
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027		%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027
%load = load volatile i16, i16 addrspace(5)* %gep		%load = load volatile i16, i16 addrspace(5)* %gep
%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_sexti8_to_offset(i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_sexti8_to_offset(i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8_to_offset:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8_to_offset:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX900-MUBUF-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX900-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX900-MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX900-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8_to_offset:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8_to_offset:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
Show All 10 Lines
; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD		; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_sexti8_to_offset:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
		; GFX900-FLATSCR-NEXT: scratch_store_dword off, v1, s32
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load volatile i8, i8 addrspace(5)* %gep		%load = load volatile i8, i8 addrspace(5)* %gep
%load.ext = sext i8 %load to i16		%load.ext = sext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %load.ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load.ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2i16_reglo_vreg_zexti8_to_offset(i32 %reg) #0 {		define void @load_private_lo_v2i16_reglo_vreg_zexti8_to_offset(i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8_to_offset:		; GFX900-MUBUF-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8_to_offset:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX900-MUBUF-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX900-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX900-MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX900-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8_to_offset:		; GFX906-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8_to_offset:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff		; GFX906-NEXT: v_mov_b32_e32 v2, 0xffff
Show All 11 Lines
; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0
; GFX803-NEXT: s_mov_b32 s4, 0x5040c00		; GFX803-NEXT: s_mov_b32 s4, 0x5040c00
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4		; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_zexti8_to_offset:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
		; GFX900-FLATSCR-NEXT: scratch_store_dword off, v1, s32
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%reg.bc = bitcast i32 %reg to <2 x i16>		%reg.bc = bitcast i32 %reg to <2 x i16>
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load volatile i8, i8 addrspace(5)* %gep		%load = load volatile i8, i8 addrspace(5)* %gep
%load.ext = zext i8 %load to i16		%load.ext = zext i8 %load to i16
%build1 = insertelement <2 x i16> %reg.bc, i16 %load.ext, i32 0		%build1 = insertelement <2 x i16> %reg.bc, i16 %load.ext, i32 0
store <2 x i16> %build1, <2 x i16> addrspace(1)* undef		store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2f16_reglo_vreg_sexti8_to_offset(i32 %reg) #0 {		define void @load_private_lo_v2f16_reglo_vreg_sexti8_to_offset(i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2f16_reglo_vreg_sexti8_to_offset:		; GFX900-MUBUF-LABEL: load_private_lo_v2f16_reglo_vreg_sexti8_to_offset:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX900-MUBUF-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX900-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX900-MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX900-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_sbyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_sexti8_to_offset:		; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_sexti8_to_offset:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0
Show All 11 Lines
; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_sbyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0		; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD		; GFX803-NEXT: v_or_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_sexti8_to_offset:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
		; GFX900-FLATSCR-NEXT: scratch_store_dword off, v1, s32
		; GFX900-FLATSCR-NEXT: scratch_load_sbyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%reg.bc = bitcast i32 %reg to <2 x half>		%reg.bc = bitcast i32 %reg to <2 x half>
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load volatile i8, i8 addrspace(5)* %gep		%load = load volatile i8, i8 addrspace(5)* %gep
%load.ext = sext i8 %load to i16		%load.ext = sext i8 %load to i16
%bitcast = bitcast i16 %load.ext to half		%bitcast = bitcast i16 %load.ext to half
%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

define void @load_private_lo_v2f16_reglo_vreg_zexti8_to_offset(i32 %reg) #0 {		define void @load_private_lo_v2f16_reglo_vreg_zexti8_to_offset(i32 %reg) #0 {
; GFX900-LABEL: load_private_lo_v2f16_reglo_vreg_zexti8_to_offset:		; GFX900-MUBUF-LABEL: load_private_lo_v2f16_reglo_vreg_zexti8_to_offset:
; GFX900: ; %bb.0: ; %entry		; GFX900-MUBUF: ; %bb.0: ; %entry
; GFX900-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX900-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX900-MUBUF-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX900-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX900-MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX900-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_load_ubyte_d16 v0, off, s[0:3], s32 offset:4095
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: global_store_dword v[0:1], v0, off		; GFX900-MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GFX900-NEXT: s_waitcnt vmcnt(0)		; GFX900-MUBUF-NEXT: s_waitcnt vmcnt(0)
; GFX900-NEXT: s_setpc_b64 s[30:31]		; GFX900-MUBUF-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_zexti8_to_offset:		; GFX906-LABEL: load_private_lo_v2f16_reglo_vreg_zexti8_to_offset:
; GFX906: ; %bb.0: ; %entry		; GFX906: ; %bb.0: ; %entry
; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX906-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b		; GFX906-NEXT: v_mov_b32_e32 v1, 0x7b
; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32		; GFX906-NEXT: buffer_store_dword v1, off, s[0:3], s32
; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX906-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX906-NEXT: v_lshrrev_b32_e32 v0, 16, v0
Show All 12 Lines
; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095		; GFX803-NEXT: buffer_load_ubyte v1, off, s[0:3], s32 offset:4095
; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX803-NEXT: v_lshrrev_b32_e32 v0, 16, v0
; GFX803-NEXT: s_mov_b32 s4, 0x5040c00		; GFX803-NEXT: s_mov_b32 s4, 0x5040c00
; GFX803-NEXT: s_waitcnt vmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0)
; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4		; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
; GFX803-NEXT: flat_store_dword v[0:1], v0		; GFX803-NEXT: flat_store_dword v[0:1], v0
; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX803-NEXT: s_setpc_b64 s[30:31]		; GFX803-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_zexti8_to_offset:
		; GFX900-FLATSCR: ; %bb.0: ; %entry
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
		; GFX900-FLATSCR-NEXT: scratch_store_dword off, v1, s32
		; GFX900-FLATSCR-NEXT: scratch_load_ubyte_d16 v0, off, s32 offset:4095
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; GFX900-FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%reg.bc = bitcast i32 %reg to <2 x half>		%reg.bc = bitcast i32 %reg to <2 x half>
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%load = load volatile i8, i8 addrspace(5)* %gep		%load = load volatile i8, i8 addrspace(5)* %gep
%load.ext = zext i8 %load to i16		%load.ext = zext i8 %load to i16
%bitcast = bitcast i16 %load.ext to half		%bitcast = bitcast i16 %load.ext to half
%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0		%build1 = insertelement <2 x half> %reg.bc, half %bitcast, i32 0
store <2 x half> %build1, <2 x half> addrspace(1)* undef		store <2 x half> %build1, <2 x half> addrspace(1)* undef
ret void		ret void
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 < %s \| FileCheck -check-prefixes=GCN,GFX9 %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 < %s \| FileCheck -check-prefixes=GCN,GFX9,MUBUF %s
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx906 --amdgpu-enable-flat-scratch < %s \| FileCheck -check-prefixes=GCN,GFX9,FLATSCR %s

	; Make sure we use the correct frame offset is used with the local			; Make sure we use the correct frame offset is used with the local
	; frame area.			; frame area.
	;			;
	; %pin.low is allocated to offset 0.			; %pin.low is allocated to offset 0.
	;			;
	; %local.area is assigned to the local frame offset by the			; %local.area is assigned to the local frame offset by the
	; LocalStackSlotAllocation pass at offset 4096.			; LocalStackSlotAllocation pass at offset 4096.
	;			;
	; The %load1 access to %gep.large.offset initially used the stack			; The %load1 access to %gep.large.offset initially used the stack
	; pointer register and directly referenced the frame index. After			; pointer register and directly referenced the frame index. After
	; LocalStackSlotAllocation, it would no longer refer to a frame index			; LocalStackSlotAllocation, it would no longer refer to a frame index
	; so eliminateFrameIndex would not adjust the access to use the			; so eliminateFrameIndex would not adjust the access to use the
	; correct FP offset.			; correct FP offset.

	define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) {			define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) {
	; GCN-LABEL: local_stack_offset_uses_sp:			; MUBUF-LABEL: local_stack_offset_uses_sp:
	; GCN: ; %bb.0: ; %entry			; MUBUF: ; %bb.0: ; %entry
	; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0			; MUBUF-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
	; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9			; MUBUF-NEXT: s_add_u32 flat_scratch_lo, s6, s9
	; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0			; MUBUF-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
	; GCN-NEXT: s_add_u32 s0, s0, s9			; MUBUF-NEXT: s_add_u32 s0, s0, s9
	; GCN-NEXT: v_mov_b32_e32 v1, 0x3000			; MUBUF-NEXT: v_mov_b32_e32 v1, 0x3000
	; GCN-NEXT: s_addc_u32 s1, s1, 0			; MUBUF-NEXT: s_addc_u32 s1, s1, 0
	; GCN-NEXT: v_add_u32_e32 v0, 64, v1			; MUBUF-NEXT: v_add_u32_e32 v0, 64, v1
	; GCN-NEXT: v_mov_b32_e32 v2, 0			; MUBUF-NEXT: v_mov_b32_e32 v2, 0
	; GCN-NEXT: v_mov_b32_e32 v3, 0x2000			; MUBUF-NEXT: v_mov_b32_e32 v3, 0x2000
	; GCN-NEXT: s_mov_b32 s6, 0			; MUBUF-NEXT: s_mov_b32 s6, 0
	; GCN-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen			; MUBUF-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen
	; GCN-NEXT: BB0_1: ; %loadstoreloop			; MUBUF-NEXT: BB0_1: ; %loadstoreloop
	; GCN-NEXT: ; =>This Inner Loop Header: Depth=1			; MUBUF-NEXT: ; =>This Inner Loop Header: Depth=1
	; GCN-NEXT: v_add_u32_e32 v3, s6, v1			; MUBUF-NEXT: v_add_u32_e32 v3, s6, v1
	; GCN-NEXT: s_add_i32 s6, s6, 1			; MUBUF-NEXT: s_add_i32 s6, s6, 1
	; GCN-NEXT: s_cmpk_lt_u32 s6, 0x2120			; MUBUF-NEXT: s_cmpk_lt_u32 s6, 0x2120
	; GCN-NEXT: buffer_store_byte v2, v3, s[0:3], 0 offen			; MUBUF-NEXT: buffer_store_byte v2, v3, s[0:3], 0 offen
	; GCN-NEXT: s_cbranch_scc1 BB0_1			; MUBUF-NEXT: s_cbranch_scc1 BB0_1
	; GCN-NEXT: ; %bb.2: ; %split			; MUBUF-NEXT: ; %bb.2: ; %split
	; GCN-NEXT: v_mov_b32_e32 v1, 0x3000			; MUBUF-NEXT: v_mov_b32_e32 v1, 0x3000
	; GCN-NEXT: v_add_u32_e32 v1, 0x20d0, v1			; MUBUF-NEXT: v_add_u32_e32 v1, 0x20d0, v1
	; GCN-NEXT: buffer_load_dword v2, v1, s[0:3], 0 offen			; MUBUF-NEXT: buffer_load_dword v2, v1, s[0:3], 0 offen
	; GCN-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen offset:4			; MUBUF-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen offset:4
	; GCN-NEXT: buffer_load_dword v3, v0, s[0:3], 0 offen			; MUBUF-NEXT: buffer_load_dword v3, v0, s[0:3], 0 offen
	; GCN-NEXT: buffer_load_dword v4, v0, s[0:3], 0 offen offset:4			; MUBUF-NEXT: buffer_load_dword v4, v0, s[0:3], 0 offen offset:4
	; GCN-NEXT: s_waitcnt vmcnt(1)			; MUBUF-NEXT: s_waitcnt vmcnt(1)
	; GCN-NEXT: v_add_co_u32_e32 v0, vcc, v2, v3			; MUBUF-NEXT: v_add_co_u32_e32 v0, vcc, v2, v3
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v2, s4			; MUBUF-NEXT: v_mov_b32_e32 v2, s4
	; GCN-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v4, vcc			; MUBUF-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v4, vcc
	; GCN-NEXT: v_mov_b32_e32 v3, s5			; MUBUF-NEXT: v_mov_b32_e32 v3, s5
	; GCN-NEXT: global_store_dwordx2 v[2:3], v[0:1], off			; MUBUF-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
	; GCN-NEXT: s_endpgm			; MUBUF-NEXT: s_endpgm
				;
				; FLATSCR-LABEL: local_stack_offset_uses_sp:
				; FLATSCR: ; %bb.0: ; %entry
				; FLATSCR-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
				; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s6, s9
				; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
				; FLATSCR-NEXT: v_mov_b32_e32 v0, 0
				; FLATSCR-NEXT: s_movk_i32 vcc_hi, 0x2000
				; FLATSCR-NEXT: s_mov_b32 s6, 0
				; FLATSCR-NEXT: scratch_store_dword off, v0, vcc_hi
				; FLATSCR-NEXT: BB0_1: ; %loadstoreloop
				; FLATSCR-NEXT: ; =>This Inner Loop Header: Depth=1
				; FLATSCR-NEXT: s_add_u32 s7, 0x3000, s6
				; FLATSCR-NEXT: s_add_i32 s6, s6, 1
				; FLATSCR-NEXT: s_cmpk_lt_u32 s6, 0x2120
				; FLATSCR-NEXT: scratch_store_byte off, v0, s7
				; FLATSCR-NEXT: s_cbranch_scc1 BB0_1
				; FLATSCR-NEXT: ; %bb.2: ; %split
				; FLATSCR-NEXT: s_movk_i32 s6, 0x20d0
				; FLATSCR-NEXT: s_add_u32 s6, 0x3000, s6
				; FLATSCR-NEXT: scratch_load_dword v1, off, s6 offset:4
				; FLATSCR-NEXT: s_movk_i32 s6, 0x2000
				; FLATSCR-NEXT: s_add_u32 s6, 0x3000, s6
				; FLATSCR-NEXT: scratch_load_dword v0, off, s6 offset:208
				; FLATSCR-NEXT: s_movk_i32 s6, 0x3000
				; FLATSCR-NEXT: scratch_load_dword v2, off, s6 offset:68
				; FLATSCR-NEXT: s_movk_i32 s6, 0x3000
				; FLATSCR-NEXT: scratch_load_dword v3, off, s6 offset:64
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: v_add_co_u32_e32 v0, vcc, v0, v3
				; FLATSCR-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v2, vcc
				; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
				; FLATSCR-NEXT: v_mov_b32_e32 v2, s4
				; FLATSCR-NEXT: v_mov_b32_e32 v3, s5
				; FLATSCR-NEXT: global_store_dwordx2 v[2:3], v[0:1], off
				; FLATSCR-NEXT: s_endpgm
	entry:			entry:
	%pin.low = alloca i32, align 8192, addrspace(5)			%pin.low = alloca i32, align 8192, addrspace(5)
	%local.area = alloca [1060 x i64], align 4096, addrspace(5)			%local.area = alloca [1060 x i64], align 4096, addrspace(5)
	store volatile i32 0, i32 addrspace(5)* %pin.low			store volatile i32 0, i32 addrspace(5)* %pin.low
	%local.area.cast = bitcast [1060 x i64] addrspace(5)* %local.area to i8 addrspace(5)*			%local.area.cast = bitcast [1060 x i64] addrspace(5)* %local.area to i8 addrspace(5)*
	call void @llvm.memset.p5i8.i32(i8 addrspace(5)* align 4 %local.area.cast, i8 0, i32 8480, i1 true)			call void @llvm.memset.p5i8.i32(i8 addrspace(5)* align 4 %local.area.cast, i8 0, i32 8480, i1 true)
	%gep.large.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 1050			%gep.large.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 1050
	%gep.small.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 8			%gep.small.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 8
	%load0 = load volatile i64, i64 addrspace(5)* %gep.large.offset			%load0 = load volatile i64, i64 addrspace(5)* %gep.large.offset
	%load1 = load volatile i64, i64 addrspace(5)* %gep.small.offset			%load1 = load volatile i64, i64 addrspace(5)* %gep.small.offset
	%add0 = add i64 %load0, %load1			%add0 = add i64 %load0, %load1
	store volatile i64 %add0, i64 addrspace(1)* %out			store volatile i64 %add0, i64 addrspace(1)* %out
	ret void			ret void
	}			}

	define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) {			define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) {
	; GCN-LABEL: func_local_stack_offset_uses_sp:			; MUBUF-LABEL: func_local_stack_offset_uses_sp:
	; GCN: ; %bb.0: ; %entry			; MUBUF: ; %bb.0: ; %entry
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: s_add_u32 s4, s32, 0x7ffc0			; MUBUF-NEXT: s_add_u32 s4, s32, 0x7ffc0
	; GCN-NEXT: s_mov_b32 s5, s33			; MUBUF-NEXT: s_mov_b32 s5, s33
	; GCN-NEXT: s_and_b32 s33, s4, 0xfff80000			; MUBUF-NEXT: s_and_b32 s33, s4, 0xfff80000
	; GCN-NEXT: v_lshrrev_b32_e64 v3, 6, s33			; MUBUF-NEXT: v_lshrrev_b32_e64 v3, 6, s33
	; GCN-NEXT: v_add_u32_e32 v3, 0x1000, v3			; MUBUF-NEXT: v_add_u32_e32 v3, 0x1000, v3
	; GCN-NEXT: v_mov_b32_e32 v4, 0			; MUBUF-NEXT: v_mov_b32_e32 v4, 0
	; GCN-NEXT: v_add_u32_e32 v2, 64, v3			; MUBUF-NEXT: v_add_u32_e32 v2, 64, v3
	; GCN-NEXT: s_mov_b32 s4, 0			; MUBUF-NEXT: s_mov_b32 s4, 0
	; GCN-NEXT: s_add_u32 s32, s32, 0x180000			; MUBUF-NEXT: s_add_u32 s32, s32, 0x180000
	; GCN-NEXT: buffer_store_dword v4, off, s[0:3], s33			; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s33
	; GCN-NEXT: BB1_1: ; %loadstoreloop			; MUBUF-NEXT: BB1_1: ; %loadstoreloop
	; GCN-NEXT: ; =>This Inner Loop Header: Depth=1			; MUBUF-NEXT: ; =>This Inner Loop Header: Depth=1
	; GCN-NEXT: v_add_u32_e32 v5, s4, v3			; MUBUF-NEXT: v_add_u32_e32 v5, s4, v3
	; GCN-NEXT: s_add_i32 s4, s4, 1			; MUBUF-NEXT: s_add_i32 s4, s4, 1
	; GCN-NEXT: s_cmpk_lt_u32 s4, 0x2120			; MUBUF-NEXT: s_cmpk_lt_u32 s4, 0x2120
	; GCN-NEXT: buffer_store_byte v4, v5, s[0:3], 0 offen			; MUBUF-NEXT: buffer_store_byte v4, v5, s[0:3], 0 offen
	; GCN-NEXT: s_cbranch_scc1 BB1_1			; MUBUF-NEXT: s_cbranch_scc1 BB1_1
	; GCN-NEXT: ; %bb.2: ; %split			; MUBUF-NEXT: ; %bb.2: ; %split
	; GCN-NEXT: v_lshrrev_b32_e64 v3, 6, s33			; MUBUF-NEXT: v_lshrrev_b32_e64 v3, 6, s33
	; GCN-NEXT: v_add_u32_e32 v3, 0x1000, v3			; MUBUF-NEXT: v_add_u32_e32 v3, 0x1000, v3
	; GCN-NEXT: v_add_u32_e32 v3, 0x20d0, v3			; MUBUF-NEXT: v_add_u32_e32 v3, 0x20d0, v3
	; GCN-NEXT: buffer_load_dword v4, v3, s[0:3], 0 offen			; MUBUF-NEXT: buffer_load_dword v4, v3, s[0:3], 0 offen
	; GCN-NEXT: buffer_load_dword v3, v3, s[0:3], 0 offen offset:4			; MUBUF-NEXT: buffer_load_dword v3, v3, s[0:3], 0 offen offset:4
	; GCN-NEXT: buffer_load_dword v5, v2, s[0:3], 0 offen			; MUBUF-NEXT: buffer_load_dword v5, v2, s[0:3], 0 offen
	; GCN-NEXT: buffer_load_dword v6, v2, s[0:3], 0 offen offset:4			; MUBUF-NEXT: buffer_load_dword v6, v2, s[0:3], 0 offen offset:4
	; GCN-NEXT: s_sub_u32 s32, s32, 0x180000			; MUBUF-NEXT: s_sub_u32 s32, s32, 0x180000
	; GCN-NEXT: s_mov_b32 s33, s5			; MUBUF-NEXT: s_mov_b32 s33, s5
	; GCN-NEXT: s_waitcnt vmcnt(1)			; MUBUF-NEXT: s_waitcnt vmcnt(1)
	; GCN-NEXT: v_add_co_u32_e32 v2, vcc, v4, v5			; MUBUF-NEXT: v_add_co_u32_e32 v2, vcc, v4, v5
	; GCN-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v6, vcc			; MUBUF-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v6, vcc
	; GCN-NEXT: global_store_dwordx2 v[0:1], v[2:3], off			; MUBUF-NEXT: global_store_dwordx2 v[0:1], v[2:3], off
	; GCN-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: s_setpc_b64 s[30:31]			; MUBUF-NEXT: s_setpc_b64 s[30:31]
				;
				; FLATSCR-LABEL: func_local_stack_offset_uses_sp:
				; FLATSCR: ; %bb.0: ; %entry
				; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; FLATSCR-NEXT: s_add_u32 s4, s32, 0x1fff
				; FLATSCR-NEXT: s_mov_b32 s6, s33
				; FLATSCR-NEXT: s_and_b32 s33, s4, 0xffffe000
				; FLATSCR-NEXT: v_mov_b32_e32 v2, 0
				; FLATSCR-NEXT: s_mov_b32 s4, 0
				; FLATSCR-NEXT: s_add_u32 s32, s32, 0x6000
				; FLATSCR-NEXT: scratch_store_dword off, v2, s33
				; FLATSCR-NEXT: BB1_1: ; %loadstoreloop
				; FLATSCR-NEXT: ; =>This Inner Loop Header: Depth=1
				; FLATSCR-NEXT: s_add_u32 vcc_hi, s33, 0x1000
				; FLATSCR-NEXT: s_add_u32 s5, vcc_hi, s4
				; FLATSCR-NEXT: s_add_i32 s4, s4, 1
				; FLATSCR-NEXT: s_cmpk_lt_u32 s4, 0x2120
				; FLATSCR-NEXT: scratch_store_byte off, v2, s5
				; FLATSCR-NEXT: s_cbranch_scc1 BB1_1
				; FLATSCR-NEXT: ; %bb.2: ; %split
				; FLATSCR-NEXT: s_movk_i32 s4, 0x20d0
				; FLATSCR-NEXT: s_add_u32 s5, s33, 0x1000
				; FLATSCR-NEXT: s_add_u32 s4, s5, s4
				; FLATSCR-NEXT: scratch_load_dword v3, off, s4 offset:4
				; FLATSCR-NEXT: s_movk_i32 s4, 0x2000
				; FLATSCR-NEXT: s_add_u32 s5, s33, 0x1000
				; FLATSCR-NEXT: s_add_u32 s4, s5, s4
				; FLATSCR-NEXT: scratch_load_dword v2, off, s4 offset:208
				; FLATSCR-NEXT: s_add_u32 s4, s33, 0x1000
				; FLATSCR-NEXT: scratch_load_dword v4, off, s4 offset:68
				; FLATSCR-NEXT: s_add_u32 s4, s33, 0x1000
				; FLATSCR-NEXT: scratch_load_dword v5, off, s4 offset:64
				; FLATSCR-NEXT: s_sub_u32 s32, s32, 0x6000
				; FLATSCR-NEXT: s_mov_b32 s33, s6
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: v_add_co_u32_e32 v2, vcc, v2, v5
				; FLATSCR-NEXT: v_addc_co_u32_e32 v3, vcc, v3, v4, vcc
				; FLATSCR-NEXT: global_store_dwordx2 v[0:1], v[2:3], off
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%pin.low = alloca i32, align 8192, addrspace(5)			%pin.low = alloca i32, align 8192, addrspace(5)
	%local.area = alloca [1060 x i64], align 4096, addrspace(5)			%local.area = alloca [1060 x i64], align 4096, addrspace(5)
	store volatile i32 0, i32 addrspace(5)* %pin.low			store volatile i32 0, i32 addrspace(5)* %pin.low
	%local.area.cast = bitcast [1060 x i64] addrspace(5)* %local.area to i8 addrspace(5)*			%local.area.cast = bitcast [1060 x i64] addrspace(5)* %local.area to i8 addrspace(5)*
	call void @llvm.memset.p5i8.i32(i8 addrspace(5)* align 4 %local.area.cast, i8 0, i32 8480, i1 true)			call void @llvm.memset.p5i8.i32(i8 addrspace(5)* align 4 %local.area.cast, i8 0, i32 8480, i1 true)
	%gep.large.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 1050			%gep.large.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 1050
	%gep.small.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 8			%gep.small.offset = getelementptr inbounds [1060 x i64], [1060 x i64] addrspace(5)* %local.area, i64 0, i64 8
	Show All 10 Lines

llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s \| FileCheck %s			; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s \| FileCheck %s -check-prefix=MUBUF
				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -amdgpu-enable-flat-scratch < %s \| FileCheck %s -check-prefix=FLATSCR

	; Make sure there's no assertion from passing a 0 alignment value			; Make sure there's no assertion from passing a 0 alignment value
	define void @memcpy_fixed_align(i8 addrspace(5)* %dst, i8 addrspace(1)* %src) {			define void @memcpy_fixed_align(i8 addrspace(5)* %dst, i8 addrspace(1)* %src) {
	; CHECK-LABEL: memcpy_fixed_align:			; MUBUF-LABEL: memcpy_fixed_align:
	; CHECK: ; %bb.0:			; MUBUF: ; %bb.0:
	; CHECK-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; CHECK-NEXT: global_load_dword v0, v[1:2], off offset:36			; MUBUF-NEXT: global_load_dword v0, v[1:2], off offset:36
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], s32 offset:36			; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s32 offset:36
	; CHECK-NEXT: global_load_dword v0, v[1:2], off offset:32			; MUBUF-NEXT: global_load_dword v0, v[1:2], off offset:32
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], s32 offset:32			; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s32 offset:32
	; CHECK-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16			; MUBUF-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:28			; MUBUF-NEXT: buffer_store_dword v6, off, s[0:3], s32 offset:28
	; CHECK-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:24			; MUBUF-NEXT: buffer_store_dword v5, off, s[0:3], s32 offset:24
	; CHECK-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:20			; MUBUF-NEXT: buffer_store_dword v4, off, s[0:3], s32 offset:20
	; CHECK-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:16			; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:16
	; CHECK-NEXT: global_load_dwordx4 v[0:3], v[1:2], off			; MUBUF-NEXT: global_load_dwordx4 v[0:3], v[1:2], off
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:12			; MUBUF-NEXT: buffer_store_dword v3, off, s[0:3], s32 offset:12
	; CHECK-NEXT: buffer_store_dword v2, off, s[0:3], s32 offset:8			; MUBUF-NEXT: buffer_store_dword v2, off, s[0:3], s32 offset:8
	; CHECK-NEXT: buffer_store_dword v1, off, s[0:3], s32 offset:4			; MUBUF-NEXT: buffer_store_dword v1, off, s[0:3], s32 offset:4
	; CHECK-NEXT: buffer_store_dword v0, off, s[0:3], s32			; MUBUF-NEXT: buffer_store_dword v0, off, s[0:3], s32
	; CHECK-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; CHECK-NEXT: s_setpc_b64 s[30:31]			; MUBUF-NEXT: s_setpc_b64 s[30:31]
				;
				; FLATSCR-LABEL: memcpy_fixed_align:
				; FLATSCR: ; %bb.0:
				; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; FLATSCR-NEXT: global_load_dword v0, v[1:2], off offset:36
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: scratch_store_dword off, v0, s32 offset:36
				; FLATSCR-NEXT: global_load_dword v0, v[1:2], off offset:32
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: scratch_store_dword off, v0, s32 offset:32
				; FLATSCR-NEXT: global_load_dwordx4 v[3:6], v[1:2], off offset:16
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: scratch_store_dword off, v6, s32 offset:28
				; FLATSCR-NEXT: scratch_store_dword off, v5, s32 offset:24
				; FLATSCR-NEXT: scratch_store_dword off, v4, s32 offset:20
				; FLATSCR-NEXT: scratch_store_dword off, v3, s32 offset:16
				; FLATSCR-NEXT: global_load_dwordx4 v[0:3], v[1:2], off
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: scratch_store_dword off, v3, s32 offset:12
				; FLATSCR-NEXT: scratch_store_dword off, v2, s32 offset:8
				; FLATSCR-NEXT: scratch_store_dword off, v1, s32 offset:4
				; FLATSCR-NEXT: scratch_store_dword off, v0, s32
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: s_setpc_b64 s[30:31]
	%alloca = alloca [40 x i8], addrspace(5)			%alloca = alloca [40 x i8], addrspace(5)
	%cast = bitcast [40 x i8] addrspace(5)* %alloca to i8 addrspace(5)*			%cast = bitcast [40 x i8] addrspace(5)* %alloca to i8 addrspace(5)*
	call void @llvm.memcpy.p5i8.p1i8.i64(i8 addrspace(5)* align 4 dereferenceable(40) %cast, i8 addrspace(1)* align 4 dereferenceable(40) %src, i64 40, i1 false)			call void @llvm.memcpy.p5i8.p1i8.i64(i8 addrspace(5)* align 4 dereferenceable(40) %cast, i8 addrspace(1)* align 4 dereferenceable(40) %src, i64 40, i1 false)
	ret void			ret void
	}			}

	declare void @llvm.memcpy.p5i8.p1i8.i64(i8 addrspace(5)* noalias nocapture writeonly, i8 addrspace(1)* noalias nocapture readonly, i64, i1 immarg) #0			declare void @llvm.memcpy.p5i8.p1i8.i64(i8 addrspace(5)* noalias nocapture writeonly, i8 addrspace(1)* noalias nocapture readonly, i64, i1 immarg) #0

	attributes #0 = { argmemonly nounwind willreturn }			attributes #0 = { argmemonly nounwind willreturn }

llvm/test/CodeGen/AMDGPU/non-entry-alloca.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,DEFAULTSIZE %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,DEFAULTSIZE,MUBUF %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-assume-dynamic-stack-object-size=1024 < %s \| FileCheck -check-prefixes=GCN,ASSUME1024 %s		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-assume-dynamic-stack-object-size=1024 < %s \| FileCheck -check-prefixes=GCN,ASSUME1024,MUBUF %s
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch < %s \| FileCheck -check-prefixes=GCN,DEFAULTSIZE,FLATSCR %s
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs -amdgpu-enable-flat-scratch -amdgpu-assume-dynamic-stack-object-size=1024 < %s \| FileCheck -check-prefixes=GCN,ASSUME1024,FLATSCR %s

; FIXME: Generated test checks do not check metadata at the end of the		; FIXME: Generated test checks do not check metadata at the end of the
; function, so this also includes manually added checks.		; function, so this also includes manually added checks.

; Test that we can select a statically sized alloca outside of the		; Test that we can select a statically sized alloca outside of the
; entry block.		; entry block.

; FIXME: FunctionLoweringInfo unhelpfully doesn't preserve an		; FIXME: FunctionLoweringInfo unhelpfully doesn't preserve an
; alignment less than the stack alignment.		; alignment less than the stack alignment.
define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {		define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {
; GCN-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align4:		; MUBUF-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align4:
; GCN: ; %bb.0: ; %entry		; MUBUF: ; %bb.0: ; %entry
; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9		; MUBUF-NEXT: s_add_u32 flat_scratch_lo, s6, s9
; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0		; MUBUF-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
; GCN-NEXT: s_add_u32 s0, s0, s9		; MUBUF-NEXT: s_add_u32 s0, s0, s9
; GCN-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x8		; MUBUF-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x8
; GCN-NEXT: s_addc_u32 s1, s1, 0		; MUBUF-NEXT: s_addc_u32 s1, s1, 0
; GCN-NEXT: s_movk_i32 s32, 0x400		; MUBUF-NEXT: s_movk_i32 s32, 0x400
; GCN-NEXT: s_mov_b32 s33, 0		; MUBUF-NEXT: s_mov_b32 s33, 0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: s_cmp_lg_u32 s8, 0		; MUBUF-NEXT: s_cmp_lg_u32 s8, 0
; GCN-NEXT: s_cbranch_scc1 BB0_3		; MUBUF-NEXT: s_cbranch_scc1 BB0_3
; GCN-NEXT: ; %bb.1: ; %bb.0		; MUBUF-NEXT: ; %bb.1: ; %bb.0
; GCN-NEXT: s_cmp_lg_u32 s9, 0		; MUBUF-NEXT: s_cmp_lg_u32 s9, 0
; GCN-NEXT: s_cbranch_scc1 BB0_3		; MUBUF-NEXT: s_cbranch_scc1 BB0_3
; GCN-NEXT: ; %bb.2: ; %bb.1		; MUBUF-NEXT: ; %bb.2: ; %bb.1
; GCN-NEXT: s_add_i32 s6, s32, 0x1000		; MUBUF-NEXT: s_add_i32 s6, s32, 0x1000
; GCN-NEXT: v_mov_b32_e32 v1, 0		; MUBUF-NEXT: v_mov_b32_e32 v1, 0
; GCN-NEXT: v_mov_b32_e32 v2, s6		; MUBUF-NEXT: v_mov_b32_e32 v2, s6
; GCN-NEXT: s_lshl_b32 s7, s10, 2		; MUBUF-NEXT: s_lshl_b32 s7, s10, 2
; GCN-NEXT: s_mov_b32 s32, s6		; MUBUF-NEXT: s_mov_b32 s32, s6
; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen		; MUBUF-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen
; GCN-NEXT: v_mov_b32_e32 v1, 1		; MUBUF-NEXT: v_mov_b32_e32 v1, 1
; GCN-NEXT: s_add_i32 s6, s6, s7		; MUBUF-NEXT: s_add_i32 s6, s6, s7
; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen offset:4		; MUBUF-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen offset:4
; GCN-NEXT: v_mov_b32_e32 v1, s6		; MUBUF-NEXT: v_mov_b32_e32 v1, s6
; GCN-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen		; MUBUF-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen
; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0		; MUBUF-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_add_u32_e32 v2, v1, v0		; MUBUF-NEXT: v_add_u32_e32 v2, v1, v0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, s4		; MUBUF-NEXT: v_mov_b32_e32 v0, s4
; GCN-NEXT: v_mov_b32_e32 v1, s5		; MUBUF-NEXT: v_mov_b32_e32 v1, s5
; GCN-NEXT: global_store_dword v[0:1], v2, off		; MUBUF-NEXT: global_store_dword v[0:1], v2, off
; GCN-NEXT: BB0_3: ; %bb.2		; MUBUF-NEXT: BB0_3: ; %bb.2
; GCN-NEXT: v_mov_b32_e32 v0, 0		; MUBUF-NEXT: v_mov_b32_e32 v0, 0
; GCN-NEXT: global_store_dword v[0:1], v0, off		; MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_endpgm		; MUBUF-NEXT: s_endpgm
		;
		; FLATSCR-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align4:
		; FLATSCR: ; %bb.0: ; %entry
		; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s6, s9
		; FLATSCR-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x8
		; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
		; FLATSCR-NEXT: s_mov_b32 s32, 16
		; FLATSCR-NEXT: s_mov_b32 s33, 0
		; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
		; FLATSCR-NEXT: s_cmp_lg_u32 s8, 0
		; FLATSCR-NEXT: s_cbranch_scc1 BB0_3
		; FLATSCR-NEXT: ; %bb.1: ; %bb.0
		; FLATSCR-NEXT: s_cmp_lg_u32 s9, 0
		; FLATSCR-NEXT: s_cbranch_scc1 BB0_3
		; FLATSCR-NEXT: ; %bb.2: ; %bb.1
		; FLATSCR-NEXT: s_mov_b32 s6, s32
		; FLATSCR-NEXT: s_movk_i32 s7, 0x1000
		; FLATSCR-NEXT: s_add_i32 s8, s6, s7
		; FLATSCR-NEXT: s_add_u32 s6, s6, s7
		; FLATSCR-NEXT: v_mov_b32_e32 v1, 0
		; FLATSCR-NEXT: scratch_store_dword off, v1, s6
		; FLATSCR-NEXT: v_mov_b32_e32 v1, 1
		; FLATSCR-NEXT: s_lshl_b32 s6, s10, 2
		; FLATSCR-NEXT: s_mov_b32 s32, s8
		; FLATSCR-NEXT: scratch_store_dword off, v1, s8 offset:4
		; FLATSCR-NEXT: s_add_i32 s8, s8, s6
		; FLATSCR-NEXT: scratch_load_dword v1, off, s8
		; FLATSCR-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_add_u32_e32 v2, v1, v0
		; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v0, s4
		; FLATSCR-NEXT: v_mov_b32_e32 v1, s5
		; FLATSCR-NEXT: global_store_dword v[0:1], v2, off
		; FLATSCR-NEXT: BB0_3: ; %bb.2
		; FLATSCR-NEXT: v_mov_b32_e32 v0, 0
		; FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; FLATSCR-NEXT: s_endpgm

entry:		entry:
%cond0 = icmp eq i32 %arg.cond0, 0		%cond0 = icmp eq i32 %arg.cond0, 0
br i1 %cond0, label %bb.0, label %bb.2		br i1 %cond0, label %bb.0, label %bb.2

bb.0:		bb.0:
%alloca = alloca [16 x i32], align 4, addrspace(5)		%alloca = alloca [16 x i32], align 4, addrspace(5)
%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0		%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
Show All 18 Lines
}		}
; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4112		; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4112
; DEFAULTSIZE: ; ScratchSize: 4112		; DEFAULTSIZE: ; ScratchSize: 4112

; ASSUME1024: .amdhsa_private_segment_fixed_size 1040		; ASSUME1024: .amdhsa_private_segment_fixed_size 1040
; ASSUME1024: ; ScratchSize: 1040		; ASSUME1024: ; ScratchSize: 1040

define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {		define amdgpu_kernel void @kernel_non_entry_block_static_alloca_uniformly_reached_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {
; GCN-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align64:		; MUBUF-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align64:
; GCN: ; %bb.0: ; %entry		; MUBUF: ; %bb.0: ; %entry
; GCN-NEXT: s_add_u32 flat_scratch_lo, s6, s9		; MUBUF-NEXT: s_add_u32 flat_scratch_lo, s6, s9
; GCN-NEXT: s_addc_u32 flat_scratch_hi, s7, 0		; MUBUF-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
; GCN-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x8		; MUBUF-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x8
; GCN-NEXT: s_add_u32 s0, s0, s9		; MUBUF-NEXT: s_add_u32 s0, s0, s9
; GCN-NEXT: s_addc_u32 s1, s1, 0		; MUBUF-NEXT: s_addc_u32 s1, s1, 0
; GCN-NEXT: s_movk_i32 s32, 0x1000		; MUBUF-NEXT: s_movk_i32 s32, 0x1000
; GCN-NEXT: s_mov_b32 s33, 0		; MUBUF-NEXT: s_mov_b32 s33, 0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: s_cmp_lg_u32 s6, 0		; MUBUF-NEXT: s_cmp_lg_u32 s6, 0
; GCN-NEXT: s_cbranch_scc1 BB1_2		; MUBUF-NEXT: s_cbranch_scc1 BB1_2
; GCN-NEXT: ; %bb.1: ; %bb.0		; MUBUF-NEXT: ; %bb.1: ; %bb.0
; GCN-NEXT: s_add_i32 s6, s32, 0x1000		; MUBUF-NEXT: s_add_i32 s6, s32, 0x1000
; GCN-NEXT: s_and_b32 s6, s6, 0xfffff000		; MUBUF-NEXT: s_and_b32 s6, s6, 0xfffff000
; GCN-NEXT: v_mov_b32_e32 v1, 0		; MUBUF-NEXT: v_mov_b32_e32 v1, 0
; GCN-NEXT: v_mov_b32_e32 v2, s6		; MUBUF-NEXT: v_mov_b32_e32 v2, s6
; GCN-NEXT: s_lshl_b32 s7, s7, 2		; MUBUF-NEXT: s_lshl_b32 s7, s7, 2
; GCN-NEXT: s_mov_b32 s32, s6		; MUBUF-NEXT: s_mov_b32 s32, s6
; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen		; MUBUF-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen
; GCN-NEXT: v_mov_b32_e32 v1, 1		; MUBUF-NEXT: v_mov_b32_e32 v1, 1
; GCN-NEXT: s_add_i32 s6, s6, s7		; MUBUF-NEXT: s_add_i32 s6, s6, s7
; GCN-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen offset:4		; MUBUF-NEXT: buffer_store_dword v1, v2, s[0:3], 0 offen offset:4
; GCN-NEXT: v_mov_b32_e32 v1, s6		; MUBUF-NEXT: v_mov_b32_e32 v1, s6
; GCN-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen		; MUBUF-NEXT: buffer_load_dword v1, v1, s[0:3], 0 offen
; GCN-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0		; MUBUF-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_add_u32_e32 v2, v1, v0		; MUBUF-NEXT: v_add_u32_e32 v2, v1, v0
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, s4		; MUBUF-NEXT: v_mov_b32_e32 v0, s4
; GCN-NEXT: v_mov_b32_e32 v1, s5		; MUBUF-NEXT: v_mov_b32_e32 v1, s5
; GCN-NEXT: global_store_dword v[0:1], v2, off		; MUBUF-NEXT: global_store_dword v[0:1], v2, off
; GCN-NEXT: BB1_2: ; %bb.1		; MUBUF-NEXT: BB1_2: ; %bb.1
; GCN-NEXT: v_mov_b32_e32 v0, 0		; MUBUF-NEXT: v_mov_b32_e32 v0, 0
; GCN-NEXT: global_store_dword v[0:1], v0, off		; MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_endpgm		; MUBUF-NEXT: s_endpgm
		;
		; FLATSCR-LABEL: kernel_non_entry_block_static_alloca_uniformly_reached_align64:
		; FLATSCR: ; %bb.0: ; %entry
		; FLATSCR-NEXT: s_add_u32 flat_scratch_lo, s6, s9
		; FLATSCR-NEXT: s_addc_u32 flat_scratch_hi, s7, 0
		; FLATSCR-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x8
		; FLATSCR-NEXT: s_mov_b32 s32, 64
		; FLATSCR-NEXT: s_mov_b32 s33, 0
		; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
		; FLATSCR-NEXT: s_cmp_lg_u32 s6, 0
		; FLATSCR-NEXT: s_cbranch_scc1 BB1_2
		; FLATSCR-NEXT: ; %bb.1: ; %bb.0
		; FLATSCR-NEXT: s_add_i32 s6, s32, 0x1000
		; FLATSCR-NEXT: s_and_b32 s6, s6, 0xfffff000
		; FLATSCR-NEXT: v_mov_b32_e32 v1, 0
		; FLATSCR-NEXT: scratch_store_dword off, v1, s6
		; FLATSCR-NEXT: v_mov_b32_e32 v1, 1
		; FLATSCR-NEXT: s_lshl_b32 s7, s7, 2
		; FLATSCR-NEXT: s_mov_b32 s32, s6
		; FLATSCR-NEXT: scratch_store_dword off, v1, s6 offset:4
		; FLATSCR-NEXT: s_add_i32 s6, s6, s7
		; FLATSCR-NEXT: scratch_load_dword v1, off, s6
		; FLATSCR-NEXT: s_load_dwordx2 s[4:5], s[4:5], 0x0
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_add_u32_e32 v2, v1, v0
		; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
		; FLATSCR-NEXT: v_mov_b32_e32 v0, s4
		; FLATSCR-NEXT: v_mov_b32_e32 v1, s5
		; FLATSCR-NEXT: global_store_dword v[0:1], v2, off
		; FLATSCR-NEXT: BB1_2: ; %bb.1
		; FLATSCR-NEXT: v_mov_b32_e32 v0, 0
		; FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; FLATSCR-NEXT: s_endpgm
entry:		entry:
%cond = icmp eq i32 %arg.cond, 0		%cond = icmp eq i32 %arg.cond, 0
br i1 %cond, label %bb.0, label %bb.1		br i1 %cond, label %bb.0, label %bb.1

bb.0:		bb.0:
%alloca = alloca [16 x i32], align 64, addrspace(5)		%alloca = alloca [16 x i32], align 64, addrspace(5)
%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0		%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1		%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
Show All 14 Lines
; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4160		; DEFAULTSIZE: .amdhsa_private_segment_fixed_size 4160
; DEFAULTSIZE: ; ScratchSize: 4160		; DEFAULTSIZE: ; ScratchSize: 4160

; ASSUME1024: .amdhsa_private_segment_fixed_size 1088		; ASSUME1024: .amdhsa_private_segment_fixed_size 1088
; ASSUME1024: ; ScratchSize: 1088		; ASSUME1024: ; ScratchSize: 1088


define void @func_non_entry_block_static_alloca_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {		define void @func_non_entry_block_static_alloca_align4(i32 addrspace(1)* %out, i32 %arg.cond0, i32 %arg.cond1, i32 %in) {
; GCN-LABEL: func_non_entry_block_static_alloca_align4:		; MUBUF-LABEL: func_non_entry_block_static_alloca_align4:
; GCN: ; %bb.0: ; %entry		; MUBUF: ; %bb.0: ; %entry
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_mov_b32 s7, s33		; MUBUF-NEXT: s_mov_b32 s7, s33
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2		; MUBUF-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_mov_b32 s33, s32		; MUBUF-NEXT: s_mov_b32 s33, s32
; GCN-NEXT: s_add_u32 s32, s32, 0x400		; MUBUF-NEXT: s_add_u32 s32, s32, 0x400
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc		; MUBUF-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz BB2_3		; MUBUF-NEXT: s_cbranch_execz BB2_3
; GCN-NEXT: ; %bb.1: ; %bb.0		; MUBUF-NEXT: ; %bb.1: ; %bb.0
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v3		; MUBUF-NEXT: v_cmp_eq_u32_e32 vcc, 0, v3
; GCN-NEXT: s_and_b64 exec, exec, vcc		; MUBUF-NEXT: s_and_b64 exec, exec, vcc
; GCN-NEXT: s_cbranch_execz BB2_3		; MUBUF-NEXT: s_cbranch_execz BB2_3
; GCN-NEXT: ; %bb.2: ; %bb.1		; MUBUF-NEXT: ; %bb.2: ; %bb.1
; GCN-NEXT: s_add_i32 s6, s32, 0x1000		; MUBUF-NEXT: s_add_i32 s6, s32, 0x1000
; GCN-NEXT: v_mov_b32_e32 v2, 0		; MUBUF-NEXT: v_mov_b32_e32 v2, 0
; GCN-NEXT: v_mov_b32_e32 v3, s6		; MUBUF-NEXT: v_mov_b32_e32 v3, s6
; GCN-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen		; MUBUF-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen
; GCN-NEXT: v_mov_b32_e32 v2, 1		; MUBUF-NEXT: v_mov_b32_e32 v2, 1
; GCN-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen offset:4		; MUBUF-NEXT: buffer_store_dword v2, v3, s[0:3], 0 offen offset:4
; GCN-NEXT: v_lshl_add_u32 v2, v4, 2, s6		; MUBUF-NEXT: v_lshl_add_u32 v2, v4, 2, s6
; GCN-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen		; MUBUF-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen
; GCN-NEXT: v_and_b32_e32 v3, 0x3ff, v5		; MUBUF-NEXT: v_and_b32_e32 v3, 0x3ff, v5
; GCN-NEXT: s_mov_b32 s32, s6		; MUBUF-NEXT: s_mov_b32 s32, s6
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_add_u32_e32 v2, v2, v3		; MUBUF-NEXT: v_add_u32_e32 v2, v2, v3
; GCN-NEXT: global_store_dword v[0:1], v2, off		; MUBUF-NEXT: global_store_dword v[0:1], v2, off
; GCN-NEXT: BB2_3: ; %bb.2		; MUBUF-NEXT: BB2_3: ; %bb.2
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]		; MUBUF-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0		; MUBUF-NEXT: v_mov_b32_e32 v0, 0
; GCN-NEXT: global_store_dword v[0:1], v0, off		; MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_sub_u32 s32, s32, 0x400		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x400
; GCN-NEXT: s_mov_b32 s33, s7		; MUBUF-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64 s[30:31]		; MUBUF-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: func_non_entry_block_static_alloca_align4:
		; FLATSCR: ; %bb.0: ; %entry
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: s_mov_b32 s9, s33
		; FLATSCR-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
		; FLATSCR-NEXT: s_mov_b32 s33, s32
		; FLATSCR-NEXT: s_add_u32 s32, s32, 16
		; FLATSCR-NEXT: s_and_saveexec_b64 s[4:5], vcc
		; FLATSCR-NEXT: s_cbranch_execz BB2_3
		; FLATSCR-NEXT: ; %bb.1: ; %bb.0
		; FLATSCR-NEXT: v_cmp_eq_u32_e32 vcc, 0, v3
		; FLATSCR-NEXT: s_and_b64 exec, exec, vcc
		; FLATSCR-NEXT: s_cbranch_execz BB2_3
		; FLATSCR-NEXT: ; %bb.2: ; %bb.1
		; FLATSCR-NEXT: s_mov_b32 s6, s32
		; FLATSCR-NEXT: s_movk_i32 s7, 0x1000
		; FLATSCR-NEXT: s_add_i32 s8, s6, s7
		; FLATSCR-NEXT: s_add_u32 s6, s6, s7
		; FLATSCR-NEXT: v_mov_b32_e32 v2, 0
		; FLATSCR-NEXT: scratch_store_dword off, v2, s6
		; FLATSCR-NEXT: v_mov_b32_e32 v2, 1
		; FLATSCR-NEXT: scratch_store_dword off, v2, s8 offset:4
		; FLATSCR-NEXT: v_lshl_add_u32 v2, v4, 2, s8
		; FLATSCR-NEXT: scratch_load_dword v2, v2, off
		; FLATSCR-NEXT: v_and_b32_e32 v3, 0x3ff, v5
		; FLATSCR-NEXT: s_mov_b32 s32, s8
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_add_u32_e32 v2, v2, v3
		; FLATSCR-NEXT: global_store_dword v[0:1], v2, off
		; FLATSCR-NEXT: BB2_3: ; %bb.2
		; FLATSCR-NEXT: s_or_b64 exec, exec, s[4:5]
		; FLATSCR-NEXT: v_mov_b32_e32 v0, 0
		; FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 16
		; FLATSCR-NEXT: s_mov_b32 s33, s9
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]

entry:		entry:
%cond0 = icmp eq i32 %arg.cond0, 0		%cond0 = icmp eq i32 %arg.cond0, 0
br i1 %cond0, label %bb.0, label %bb.2		br i1 %cond0, label %bb.0, label %bb.2

bb.0:		bb.0:
%alloca = alloca [16 x i32], align 4, addrspace(5)		%alloca = alloca [16 x i32], align 4, addrspace(5)
%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0		%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
Show All 13 Lines	bb.1:
br label %bb.2		br label %bb.2

bb.2:		bb.2:
store volatile i32 0, i32 addrspace(1)* undef		store volatile i32 0, i32 addrspace(1)* undef
ret void		ret void
}		}

define void @func_non_entry_block_static_alloca_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {		define void @func_non_entry_block_static_alloca_align64(i32 addrspace(1)* %out, i32 %arg.cond, i32 %in) {
; GCN-LABEL: func_non_entry_block_static_alloca_align64:		; MUBUF-LABEL: func_non_entry_block_static_alloca_align64:
; GCN: ; %bb.0: ; %entry		; MUBUF: ; %bb.0: ; %entry
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: s_add_u32 s4, s32, 0xfc0		; MUBUF-NEXT: s_add_u32 s4, s32, 0xfc0
; GCN-NEXT: s_mov_b32 s7, s33		; MUBUF-NEXT: s_mov_b32 s7, s33
; GCN-NEXT: s_and_b32 s33, s4, 0xfffff000		; MUBUF-NEXT: s_and_b32 s33, s4, 0xfffff000
; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2		; MUBUF-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
; GCN-NEXT: s_add_u32 s32, s32, 0x2000		; MUBUF-NEXT: s_add_u32 s32, s32, 0x2000
; GCN-NEXT: s_and_saveexec_b64 s[4:5], vcc		; MUBUF-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GCN-NEXT: s_cbranch_execz BB3_2		; MUBUF-NEXT: s_cbranch_execz BB3_2
; GCN-NEXT: ; %bb.1: ; %bb.0		; MUBUF-NEXT: ; %bb.1: ; %bb.0
; GCN-NEXT: s_add_i32 s6, s32, 0x1000		; MUBUF-NEXT: s_add_i32 s6, s32, 0x1000
; GCN-NEXT: s_and_b32 s6, s6, 0xfffff000		; MUBUF-NEXT: s_and_b32 s6, s6, 0xfffff000
; GCN-NEXT: v_mov_b32_e32 v2, 0		; MUBUF-NEXT: v_mov_b32_e32 v2, 0
; GCN-NEXT: v_mov_b32_e32 v5, s6		; MUBUF-NEXT: v_mov_b32_e32 v5, s6
; GCN-NEXT: buffer_store_dword v2, v5, s[0:3], 0 offen		; MUBUF-NEXT: buffer_store_dword v2, v5, s[0:3], 0 offen
; GCN-NEXT: v_mov_b32_e32 v2, 1		; MUBUF-NEXT: v_mov_b32_e32 v2, 1
; GCN-NEXT: buffer_store_dword v2, v5, s[0:3], 0 offen offset:4		; MUBUF-NEXT: buffer_store_dword v2, v5, s[0:3], 0 offen offset:4
; GCN-NEXT: v_lshl_add_u32 v2, v3, 2, s6		; MUBUF-NEXT: v_lshl_add_u32 v2, v3, 2, s6
; GCN-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen		; MUBUF-NEXT: buffer_load_dword v2, v2, s[0:3], 0 offen
; GCN-NEXT: v_and_b32_e32 v3, 0x3ff, v4		; MUBUF-NEXT: v_and_b32_e32 v3, 0x3ff, v4
; GCN-NEXT: s_mov_b32 s32, s6		; MUBUF-NEXT: s_mov_b32 s32, s6
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_add_u32_e32 v2, v2, v3		; MUBUF-NEXT: v_add_u32_e32 v2, v2, v3
; GCN-NEXT: global_store_dword v[0:1], v2, off		; MUBUF-NEXT: global_store_dword v[0:1], v2, off
; GCN-NEXT: BB3_2: ; %bb.1		; MUBUF-NEXT: BB3_2: ; %bb.1
; GCN-NEXT: s_or_b64 exec, exec, s[4:5]		; MUBUF-NEXT: s_or_b64 exec, exec, s[4:5]
; GCN-NEXT: v_mov_b32_e32 v0, 0		; MUBUF-NEXT: v_mov_b32_e32 v0, 0
; GCN-NEXT: global_store_dword v[0:1], v0, off		; MUBUF-NEXT: global_store_dword v[0:1], v0, off
; GCN-NEXT: s_sub_u32 s32, s32, 0x2000		; MUBUF-NEXT: s_sub_u32 s32, s32, 0x2000
; GCN-NEXT: s_mov_b32 s33, s7		; MUBUF-NEXT: s_mov_b32 s33, s7
; GCN-NEXT: s_waitcnt vmcnt(0)		; MUBUF-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: s_setpc_b64 s[30:31]		; MUBUF-NEXT: s_setpc_b64 s[30:31]
		;
		; FLATSCR-LABEL: func_non_entry_block_static_alloca_align64:
		; FLATSCR: ; %bb.0: ; %entry
		; FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; FLATSCR-NEXT: s_add_u32 s4, s32, 63
		; FLATSCR-NEXT: s_mov_b32 s7, s33
		; FLATSCR-NEXT: s_and_b32 s33, s4, 0xffffffc0
		; FLATSCR-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
		; FLATSCR-NEXT: s_add_u32 s32, s32, 0x80
		; FLATSCR-NEXT: s_and_saveexec_b64 s[4:5], vcc
		; FLATSCR-NEXT: s_cbranch_execz BB3_2
		; FLATSCR-NEXT: ; %bb.1: ; %bb.0
		; FLATSCR-NEXT: s_add_i32 s6, s32, 0x1000
		; FLATSCR-NEXT: s_and_b32 s6, s6, 0xfffff000
		; FLATSCR-NEXT: v_mov_b32_e32 v2, 0
		; FLATSCR-NEXT: scratch_store_dword off, v2, s6
		; FLATSCR-NEXT: v_mov_b32_e32 v2, 1
		; FLATSCR-NEXT: scratch_store_dword off, v2, s6 offset:4
		; FLATSCR-NEXT: v_lshl_add_u32 v2, v3, 2, s6
		; FLATSCR-NEXT: scratch_load_dword v2, v2, off
		; FLATSCR-NEXT: v_and_b32_e32 v3, 0x3ff, v4
		; FLATSCR-NEXT: s_mov_b32 s32, s6
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: v_add_u32_e32 v2, v2, v3
		; FLATSCR-NEXT: global_store_dword v[0:1], v2, off
		; FLATSCR-NEXT: BB3_2: ; %bb.1
		; FLATSCR-NEXT: s_or_b64 exec, exec, s[4:5]
		; FLATSCR-NEXT: v_mov_b32_e32 v0, 0
		; FLATSCR-NEXT: global_store_dword v[0:1], v0, off
		; FLATSCR-NEXT: s_sub_u32 s32, s32, 0x80
		; FLATSCR-NEXT: s_mov_b32 s33, s7
		; FLATSCR-NEXT: s_waitcnt vmcnt(0)
		; FLATSCR-NEXT: s_setpc_b64 s[30:31]
entry:		entry:
%cond = icmp eq i32 %arg.cond, 0		%cond = icmp eq i32 %arg.cond, 0
br i1 %cond, label %bb.0, label %bb.1		br i1 %cond, label %bb.0, label %bb.1

bb.0:		bb.0:
%alloca = alloca [16 x i32], align 64, addrspace(5)		%alloca = alloca [16 x i32], align 64, addrspace(5)
%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0		%gep0 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 0
%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1		%gep1 = getelementptr [16 x i32], [16 x i32] addrspace(5)* %alloca, i32 0, i32 1
Show All 17 Lines

llvm/test/CodeGen/AMDGPU/scratch-simple.ll

	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=verde -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,SI,SIVI %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=verde -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,SI,SIVI,MUBUF %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx803 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,VI,SIVI %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx803 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,VI,SIVI,MUBUF %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX9,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX9,GFX9_10,MUBUF,GFX9-MUBUF,GFX9_10-MUBUF %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -filetype=obj -amdgpu-use-divergent-register-indexing < %s \| llvm-readobj -r - \| FileCheck --check-prefix=RELS %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -filetype=obj -amdgpu-use-divergent-register-indexing < %s \| llvm-readobj -r - \| FileCheck --check-prefix=RELS %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W32,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W32,GFX9_10,MUBUF,GFX10_W32-MUBUF,GFX9_10-MUBUF %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global,+wavefrontsize64 -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W64,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global,+wavefrontsize64 -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W64,GFX9_10,MUBUF,GFX10_W64-MUBUF,GFX9_10-MUBUF %s
				; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX9,GFX9_10,FLATSCR,GFX9-FLATSCR %s
				; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W32,GFX9_10,FLATSCR,GFX10_W32-FLATSCR,GFX9_10-FLATSCR %s

	; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD0 0x0			; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD0 0x0
	; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD1 0x0			; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD1 0x0

	; This used to fail due to a v_add_i32 instruction with an illegal immediate			; This used to fail due to a v_add_i32 instruction with an illegal immediate
	; operand that was created during Local Stack Slot Allocation. Test case derived			; operand that was created during Local Stack Slot Allocation. Test case derived
	; from https://bugs.freedesktop.org/show_bug.cgi?id=96602			; from https://bugs.freedesktop.org/show_bug.cgi?id=96602
	;			;
	; GCN-LABEL: {{^}}ps_main:			; GCN-LABEL: {{^}}ps_main:

	; GCN-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; MUBUF-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; GCN-DAG: s_mov_b32 s1, SCRATCH_RSRC_DWORD1			; MUBUF-DAG: s_mov_b32 s1, SCRATCH_RSRC_DWORD1
	; GCN-DAG: s_mov_b32 s2, -1			; MUBUF-DAG: s_mov_b32 s2, -1
	; SI-DAG: s_mov_b32 s3, 0xe8f000			; SI-DAG: s_mov_b32 s3, 0xe8f000
	; VI-DAG: s_mov_b32 s3, 0xe80000			; VI-DAG: s_mov_b32 s3, 0xe80000
	; GFX9-DAG: s_mov_b32 s3, 0xe00000			; GFX9-MUBUF-DAG: s_mov_b32 s3, 0xe00000
	; GFX10_W32-DAG: s_mov_b32 s3, 0x31c16000			; GFX10_W32-MUBUF-DAG: s_mov_b32 s3, 0x31c16000
	; GFX10_W64-DAG: s_mov_b32 s3, 0x31e16000			; GFX10_W64-MUBUF-DAG: s_mov_b32 s3, 0x31e16000

				; FLATSCR-NOT: SCRATCH_RSRC_DWORD

				; GFX9-FLATSCR: s_mov_b32 [[SP:[^,]+]], 0
				; GFX9-FLATSCR: scratch_store_dword off, v2, [[SP]] offset:
				; GFX9-FLATSCR: s_mov_b32 [[SP:[^,]+]], 0
				; GFX9-FLATSCR: scratch_store_dword off, v2, [[SP]] offset:

				; GFX10-FLATSCR: scratch_store_dword off, v2, null offset:
				; GFX10-FLATSCR: scratch_store_dword off, v2, null offset:

	; GCN-DAG: v_lshlrev_b32_e32 [[BYTES:v[0-9]+]], 2, v0			; GCN-DAG: v_lshlrev_b32_e32 [[BYTES:v[0-9]+]], 2, v0
	; GCN-DAG: v_and_b32_e32 [[CLAMP_IDX:v[0-9]+]], 0x1fc, [[BYTES]]			; GCN-DAG: v_and_b32_e32 [[CLAMP_IDX:v[0-9]+]], 0x1fc, [[BYTES]]
	; GCN-NOT: s_mov_b32 s0			; GCN-NOT: s_mov_b32 s0

	; GCN-DAG: v_add{{_\|_nc_}}{{i\|u}}32_e32 [[HI_OFF:v[0-9]+]],{{.*}} 0x280, [[CLAMP_IDX]]			; GCN-DAG: v_add{{_\|_nc_}}{{i\|u}}32_e32 [[HI_OFF:v[0-9]+]],{{.*}} 0x280, [[CLAMP_IDX]]
	; GCN-DAG: v_add{{_\|_nc_}}{{i\|u}}32_e32 [[LO_OFF:v[0-9]+]],{{.*}} {{v2\|0x80}}, [[CLAMP_IDX]]			; GCN-DAG: v_add{{_\|_nc_}}{{i\|u}}32_e32 [[LO_OFF:v[0-9]+]],{{.*}} {{v2\|0x80}}, [[CLAMP_IDX]]

	; GCN: buffer_load_dword {{v[0-9]+}}, [[LO_OFF]], {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; MUBUF: buffer_load_dword {{v[0-9]+}}, [[LO_OFF]], {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GCN: buffer_load_dword {{v[0-9]+}}, [[HI_OFF]], {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; MUBUF: buffer_load_dword {{v[0-9]+}}, [[HI_OFF]], {{s\[[0-9]+:[0-9]+\]}}, 0 offen
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, [[LO_OFF]], off
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, [[HI_OFF]], off
	define amdgpu_ps float @ps_main(i32 %idx) {			define amdgpu_ps float @ps_main(i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

	; GCN-LABEL: {{^}}vs_main:			; GCN-LABEL: {{^}}vs_main:
	; GCN-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; MUBUF-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; GCN-NOT: s_mov_b32 s0			; GCN-NOT: s_mov_b32 s0
	; GCN: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GCN: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; FLATSCR-NOT: SCRATCH_RSRC_DWORD

				; MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
				; MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

				; GFX9-FLATSCR: s_mov_b32 [[SP:[^,]+]], 0
				; GFX9-FLATSCR: scratch_store_dword off, v2, [[SP]] offset:
				; GFX9-FLATSCR: s_mov_b32 [[SP:[^,]+]], 0
				; GFX9-FLATSCR: scratch_store_dword off, v2, [[SP]] offset:

				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off

	define amdgpu_vs float @vs_main(i32 %idx) {			define amdgpu_vs float @vs_main(i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

	; GCN-LABEL: {{^}}cs_main:			; GCN-LABEL: {{^}}cs_main:
	; GCN-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; MUBUF-DAG: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; GCN: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GCN: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; FLATSCR-NOT: SCRATCH_RSRC_DWORD

				; MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
				; MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
	define amdgpu_cs float @cs_main(i32 %idx) {			define amdgpu_cs float @cs_main(i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

	; GCN-LABEL: {{^}}hs_main:			; GCN-LABEL: {{^}}hs_main:
	; SIVI: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; SIVI: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; SIVI-NOT: s_mov_b32 s0			; SIVI-NOT: s_mov_b32 s0
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GFX9_10: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; GFX9_10-MUBUF: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; GFX9_10-NOT: s_mov_b32 s5			; GFX9_10-NOT: s_mov_b32 s5
	; GFX9_10: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GFX9_10: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

				; FLATSCR-NOT: SCRATCH_RSRC_DWORD
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
	define amdgpu_hs float @hs_main(i32 %idx) {			define amdgpu_hs float @hs_main(i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

	; GCN-LABEL: {{^}}gs_main:			; GCN-LABEL: {{^}}gs_main:
	; SIVI: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; SIVI: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GFX9_10: s_mov_b32 s0, SCRATCH_RSRC_DWORD0			; GFX9_10-MUBUF: s_mov_b32 s0, SCRATCH_RSRC_DWORD0
	; GFX9_10: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GFX9_10: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

				; FLATSCR-NOT: SCRATCH_RSRC_DWORD
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
	define amdgpu_gs float @gs_main(i32 %idx) {			define amdgpu_gs float @gs_main(i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

	; Mesa GS and HS shaders have the preloaded scratch wave offset SGPR fixed at			; Mesa GS and HS shaders have the preloaded scratch wave offset SGPR fixed at
	; SGPR5, and the inreg implementation is used to reference it in the IR. The			; SGPR5, and the inreg implementation is used to reference it in the IR. The
	; following tests confirm the shader and anything inserted after the return			; following tests confirm the shader and anything inserted after the return
	; (i.e. SI_RETURN_TO_EPILOG) can access the scratch wave offset.			; (i.e. SI_RETURN_TO_EPILOG) can access the scratch wave offset.

	; GCN-LABEL: {{^}}hs_ir_uses_scratch_offset:			; GCN-LABEL: {{^}}hs_ir_uses_scratch_offset:
	; GCN: s_mov_b32 s8, SCRATCH_RSRC_DWORD0			; MUBUF: s_mov_b32 s8, SCRATCH_RSRC_DWORD0
				; FLATSCR-NOT: SCRATCH_RSRC_DWORD

	; SIVI-NOT: s_mov_b32 s6			; SIVI-NOT: s_mov_b32 s6
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GFX9_10-NOT: s_mov_b32 s5			; GFX9_10-NOT: s_mov_b32 s5
	; GFX9_10-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GFX9_10-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GCN-DAG: s_mov_b32 s2, s5			; GCN-DAG: s_mov_b32 s2, s5

				; FLATSCR-DAG: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR-DAG: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
	define amdgpu_hs <{i32, i32, i32, float}> @hs_ir_uses_scratch_offset(i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg %swo, i32 %idx) {			define amdgpu_hs <{i32, i32, i32, float}> @hs_ir_uses_scratch_offset(i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg %swo, i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%f = fadd float %v1, %v2			%f = fadd float %v1, %v2
	%r1 = insertvalue <{i32, i32, i32, float}> undef, i32 %swo, 2			%r1 = insertvalue <{i32, i32, i32, float}> undef, i32 %swo, 2
	%r2 = insertvalue <{i32, i32, i32, float}> %r1, float %f, 3			%r2 = insertvalue <{i32, i32, i32, float}> %r1, float %f, 3
	ret <{i32, i32, i32, float}> %r2			ret <{i32, i32, i32, float}> %r2
	}			}

	; GCN-LABEL: {{^}}gs_ir_uses_scratch_offset:			; GCN-LABEL: {{^}}gs_ir_uses_scratch_offset:
	; GCN: s_mov_b32 s8, SCRATCH_RSRC_DWORD0			; MUBUF: s_mov_b32 s8, SCRATCH_RSRC_DWORD0
				; FLATSCR-NOT: SCRATCH_RSRC_DWORD

	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; SIVI: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GFX9_10-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen
	; GFX9_10-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen			; GFX9_10-MUBUF-DAG: buffer_load_dword {{v[0-9]+}}, {{v[0-9]+}}, {{s\[[0-9]+:[0-9]+\]}}, 0 offen

	; GCN-DAG: s_mov_b32 s2, s5			; GCN-DAG: s_mov_b32 s2, s5

				; FLATSCR-DAG: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
				; FLATSCR-DAG: scratch_load_dword {{v[0-9]+}}, {{v[0-9]+}}, off
	define amdgpu_gs <{i32, i32, i32, float}> @gs_ir_uses_scratch_offset(i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg %swo, i32 %idx) {			define amdgpu_gs <{i32, i32, i32, float}> @gs_ir_uses_scratch_offset(i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg, i32 inreg %swo, i32 %idx) {
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%f = fadd float %v1, %v2			%f = fadd float %v1, %v2
	%r1 = insertvalue <{i32, i32, i32, float}> undef, i32 %swo, 2			%r1 = insertvalue <{i32, i32, i32, float}> undef, i32 %swo, 2
	%r2 = insertvalue <{i32, i32, i32, float}> %r1, float %f, 3			%r2 = insertvalue <{i32, i32, i32, float}> %r1, float %f, 3
	ret <{i32, i32, i32, float}> %r2			ret <{i32, i32, i32, float}> %r2
	}			}

llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs \| FileCheck -check-prefix=GCN %s			; RUN: llc < %s -march=amdgcn -mcpu=gfx1010 -verify-machineinstrs \| FileCheck -check-prefix=MUBUF %s
				; RUN: llc < %s -march=amdgcn -mcpu=gfx1010 -amdgpu-enable-flat-scratch -verify-machineinstrs \| FileCheck -check-prefix=FLATSCR %s

	; FIXME: The MUBUF loads in this test output are incorrect, their SOffset			; FIXME: The MUBUF loads in this test output are incorrect, their SOffset
	; should use the frame offset register, not the ABI stack pointer register. We			; should use the frame offset register, not the ABI stack pointer register. We
	; rely on the frame index argument of MUBUF stack accesses to survive until PEI			; rely on the frame index argument of MUBUF stack accesses to survive until PEI
	; so we can fix up the SOffset to use the correct frame register in			; so we can fix up the SOffset to use the correct frame register in
	; eliminateFrameIndex. Some things like LocalStackSlotAllocation can lift the			; eliminateFrameIndex. Some things like LocalStackSlotAllocation can lift the
	; frame index up into something (e.g. `v_add_nc_u32`) that we cannot fold back			; frame index up into something (e.g. `v_add_nc_u32`) that we cannot fold back
	; into the MUBUF instruction, and so we end up emitting an incorrect offset.			; into the MUBUF instruction, and so we end up emitting an incorrect offset.
	; Fixing this may involve adding stack access pseudos so that we don't have to			; Fixing this may involve adding stack access pseudos so that we don't have to
	; speculatively refer to the ABI stack pointer register at all.			; speculatively refer to the ABI stack pointer register at all.

	; An assert was hit when frame offset register was used to address FrameIndex.			; An assert was hit when frame offset register was used to address FrameIndex.
	define amdgpu_kernel void @kernel_background_evaluate(float addrspace(5)* %kg, <4 x i32> addrspace(1)* %input, <4 x float> addrspace(1)* %output, i32 %i) {			define amdgpu_kernel void @kernel_background_evaluate(float addrspace(5)* %kg, <4 x i32> addrspace(1)* %input, <4 x float> addrspace(1)* %output, i32 %i) {
	; GCN-LABEL: kernel_background_evaluate:			; MUBUF-LABEL: kernel_background_evaluate:
	; GCN: ; %bb.0: ; %entry			; MUBUF: ; %bb.0: ; %entry
	; GCN-NEXT: s_load_dword s0, s[0:1], 0x24			; MUBUF-NEXT: s_load_dword s0, s[0:1], 0x24
	; GCN-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0			; MUBUF-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0
	; GCN-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1			; MUBUF-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1
	; GCN-NEXT: s_mov_b32 s38, -1			; MUBUF-NEXT: s_mov_b32 s38, -1
	; GCN-NEXT: s_mov_b32 s39, 0x31c16000			; MUBUF-NEXT: s_mov_b32 s39, 0x31c16000
	; GCN-NEXT: s_add_u32 s36, s36, s3			; MUBUF-NEXT: s_add_u32 s36, s36, s3
	; GCN-NEXT: s_addc_u32 s37, s37, 0			; MUBUF-NEXT: s_addc_u32 s37, s37, 0
	; GCN-NEXT: v_mov_b32_e32 v1, 0x2000			; MUBUF-NEXT: v_mov_b32_e32 v1, 0x2000
	; GCN-NEXT: v_mov_b32_e32 v2, 0x4000			; MUBUF-NEXT: v_mov_b32_e32 v2, 0x4000
	; GCN-NEXT: v_mov_b32_e32 v3, 0			; MUBUF-NEXT: v_mov_b32_e32 v3, 0
	; GCN-NEXT: v_mov_b32_e32 v4, 0x400000			; MUBUF-NEXT: v_mov_b32_e32 v4, 0x400000
	; GCN-NEXT: s_mov_b32 s32, 0xc0000			; MUBUF-NEXT: s_mov_b32 s32, 0xc0000
	; GCN-NEXT: v_add_nc_u32_e64 v40, 4, 0x4000			; MUBUF-NEXT: v_add_nc_u32_e64 v40, 4, 0x4000
	; GCN-NEXT: ; implicit-def: $vcc_hi			; MUBUF-NEXT: ; implicit-def: $vcc_hi
	; GCN-NEXT: s_getpc_b64 s[4:5]			; MUBUF-NEXT: s_getpc_b64 s[4:5]
	; GCN-NEXT: s_add_u32 s4, s4, svm_eval_nodes@rel32@lo+4			; MUBUF-NEXT: s_add_u32 s4, s4, svm_eval_nodes@rel32@lo+4
	; GCN-NEXT: s_addc_u32 s5, s5, svm_eval_nodes@rel32@hi+12			; MUBUF-NEXT: s_addc_u32 s5, s5, svm_eval_nodes@rel32@hi+12
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; MUBUF-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v0, s0			; MUBUF-NEXT: v_mov_b32_e32 v0, s0
	; GCN-NEXT: s_mov_b64 s[0:1], s[36:37]			; MUBUF-NEXT: s_mov_b64 s[0:1], s[36:37]
	; GCN-NEXT: s_mov_b64 s[2:3], s[38:39]			; MUBUF-NEXT: s_mov_b64 s[2:3], s[38:39]
	; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]			; MUBUF-NEXT: s_swappc_b64 s[30:31], s[4:5]
	; GCN-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v0			; MUBUF-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v0
	; GCN-NEXT: s_and_saveexec_b32 s0, vcc_lo			; MUBUF-NEXT: s_and_saveexec_b32 s0, vcc_lo
	; GCN-NEXT: s_cbranch_execz BB0_2			; MUBUF-NEXT: s_cbranch_execz BB0_2
	; GCN-NEXT: ; %bb.1: ; %if.then4.i			; MUBUF-NEXT: ; %bb.1: ; %if.then4.i
	; GCN-NEXT: s_clause 0x1			; MUBUF-NEXT: s_clause 0x1
	; GCN-NEXT: buffer_load_dword v0, v40, s[36:39], 0 offen			; MUBUF-NEXT: buffer_load_dword v0, v40, s[36:39], 0 offen
	; GCN-NEXT: buffer_load_dword v1, v40, s[36:39], 0 offen offset:4			; MUBUF-NEXT: buffer_load_dword v1, v40, s[36:39], 0 offen offset:4
	; GCN-NEXT: s_waitcnt vmcnt(0)			; MUBUF-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_add_nc_u32_e32 v0, v1, v0			; MUBUF-NEXT: v_add_nc_u32_e32 v0, v1, v0
	; GCN-NEXT: v_mul_lo_u32 v0, 0x41c64e6d, v0			; MUBUF-NEXT: v_mul_lo_u32 v0, 0x41c64e6d, v0
	; GCN-NEXT: v_add_nc_u32_e32 v0, 0x3039, v0			; MUBUF-NEXT: v_add_nc_u32_e32 v0, 0x3039, v0
	; GCN-NEXT: buffer_store_dword v0, v0, s[36:39], 0 offen			; MUBUF-NEXT: buffer_store_dword v0, v0, s[36:39], 0 offen
	; GCN-NEXT: BB0_2: ; %shader_eval_surface.exit			; MUBUF-NEXT: BB0_2: ; %shader_eval_surface.exit
	; GCN-NEXT: s_endpgm			; MUBUF-NEXT: s_endpgm
				;
				; FLATSCR-LABEL: kernel_background_evaluate:
				; FLATSCR: ; %bb.0: ; %entry
				; FLATSCR-NEXT: s_load_dword s0, s[0:1], 0x24
				; FLATSCR-NEXT: s_mov_b32 s36, SCRATCH_RSRC_DWORD0
				; FLATSCR-NEXT: s_mov_b32 s37, SCRATCH_RSRC_DWORD1
				; FLATSCR-NEXT: s_mov_b32 s38, -1
				; FLATSCR-NEXT: s_mov_b32 s39, 0x31c16000
				; FLATSCR-NEXT: s_add_u32 s36, s36, s3
				; FLATSCR-NEXT: s_addc_u32 s37, s37, 0
				; FLATSCR-NEXT: v_mov_b32_e32 v1, 0x2000
				; FLATSCR-NEXT: v_mov_b32_e32 v2, 0x4000
				; FLATSCR-NEXT: v_mov_b32_e32 v3, 0
				; FLATSCR-NEXT: v_mov_b32_e32 v4, 0x400000
				; FLATSCR-NEXT: s_movk_i32 s32, 0x6000
				; FLATSCR-NEXT: ; implicit-def: $vcc_hi
				; FLATSCR-NEXT: s_getpc_b64 s[4:5]
				; FLATSCR-NEXT: s_add_u32 s4, s4, svm_eval_nodes@rel32@lo+4
				; FLATSCR-NEXT: s_addc_u32 s5, s5, svm_eval_nodes@rel32@hi+12
				; FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
				; FLATSCR-NEXT: v_mov_b32_e32 v0, s0
				; FLATSCR-NEXT: s_mov_b64 s[0:1], s[36:37]
				; FLATSCR-NEXT: s_mov_b64 s[2:3], s[38:39]
				; FLATSCR-NEXT: s_swappc_b64 s[30:31], s[4:5]
				; FLATSCR-NEXT: v_cmp_ne_u32_e32 vcc_lo, 0, v0
				; FLATSCR-NEXT: s_and_saveexec_b32 s0, vcc_lo
				; FLATSCR-NEXT: s_cbranch_execz BB0_2
				; FLATSCR-NEXT: ; %bb.1: ; %if.then4.i
				; FLATSCR-NEXT: s_movk_i32 vcc_lo, 0x4000
				; FLATSCR-NEXT: s_nop 0
				; FLATSCR-NEXT: s_nop 0
				; FLATSCR-NEXT: scratch_load_dword v0, off, vcc_lo offset:4
				; FLATSCR-NEXT: s_waitcnt_depctr 0xffe3
				; FLATSCR-NEXT: s_movk_i32 vcc_lo, 0x4000
				; FLATSCR-NEXT: scratch_load_dword v1, off, vcc_lo offset:8
				; FLATSCR-NEXT: s_waitcnt vmcnt(0)
				; FLATSCR-NEXT: v_add_nc_u32_e32 v0, v1, v0
				; FLATSCR-NEXT: v_mul_lo_u32 v0, 0x41c64e6d, v0
				; FLATSCR-NEXT: v_add_nc_u32_e32 v0, 0x3039, v0
				; FLATSCR-NEXT: scratch_store_dword off, v0, s0
				; FLATSCR-NEXT: BB0_2: ; %shader_eval_surface.exit
				; FLATSCR-NEXT: s_endpgm
	entry:			entry:
	%sd = alloca < 1339 x i32>, align 8192, addrspace(5)			%sd = alloca < 1339 x i32>, align 8192, addrspace(5)
	%state = alloca <4 x i32>, align 16, addrspace(5)			%state = alloca <4 x i32>, align 16, addrspace(5)
	%rslt = call i32 @svm_eval_nodes(float addrspace(5)* %kg, <1339 x i32> addrspace(5)* %sd, <4 x i32> addrspace(5)* %state, i32 0, i32 4194304)			%rslt = call i32 @svm_eval_nodes(float addrspace(5)* %kg, <1339 x i32> addrspace(5)* %sd, <4 x i32> addrspace(5)* %state, i32 0, i32 4194304)
	%cmp = icmp eq i32 %rslt, 0			%cmp = icmp eq i32 %rslt, 0
	br i1 %cmp, label %shader_eval_surface.exit, label %if.then4.i			br i1 %cmp, label %shader_eval_surface.exit, label %if.then4.i

	if.then4.i: ; preds = %entry			if.then4.i: ; preds = %entry
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/store-hi16.ll

; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX900,GFX9 %s		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX900,GFX9,GFX900-MUBUF %s
; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca,+sram-ecc -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX906,GFX9,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=gfx906 -amdgpu-sroa=0 -mattr=-promote-alloca,+sram-ecc -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX906,GFX9,NO-D16-HI %s
; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX803,NO-D16-HI %s		; RUN: llc -march=amdgcn -mcpu=fiji -amdgpu-sroa=0 -mattr=-promote-alloca -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX803,NO-D16-HI %s
		; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-sroa=0 -mattr=-promote-alloca -amdgpu-enable-flat-scratch -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap -check-prefixes=GCN,GFX900,GFX9,GFX900-FLATSCR %s

; GCN-LABEL: {{^}}store_global_hi_v2i16:		; GCN-LABEL: {{^}}store_global_hi_v2i16:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: global_store_short_d16_hi v[0:1], v2, off		; GFX900-NEXT: global_store_short_d16_hi v[0:1], v2, off

; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v2, 16, v2		; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v2, 16, v2
; GFX803-NEXT: flat_store_short v[0:1], v2		; GFX803-NEXT: flat_store_short v[0:1], v2
▲ Show 20 Lines • Show All 372 Lines • ▼ Show 20 Lines	entry:
%gep = getelementptr inbounds i8, i8* %out, i64 -4095		%gep = getelementptr inbounds i8, i8* %out, i64 -4095
store i8 %trunc, i8* %gep		store i8 %trunc, i8* %gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2i16:		; GCN-LABEL: {{^}}store_private_hi_v2i16:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}		; GFX900-MUBUF-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}
		; GFX900-FLATSCR-NEXT: scratch_store_short_d16_hi v1, v0, off

; NO-D16-HI: v_lshrrev_b32_e32 v1, 16, v1		; NO-D16-HI: v_lshrrev_b32_e32 v1, 16, v1
; NO-D16-HI: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}		; NO-D16-HI: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2i16(i16 addrspace(5)* %out, i32 %arg) #0 {		define void @store_private_hi_v2i16(i16 addrspace(5)* %out, i32 %arg) #0 {
entry:		entry:
; FIXME: ABI for pre-gfx9		; FIXME: ABI for pre-gfx9
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
store i16 %hi, i16 addrspace(5)* %out		store i16 %hi, i16 addrspace(5)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2f16:		; GCN-LABEL: {{^}}store_private_hi_v2f16:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}		; GFX900-MUBUF-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}
		; GFX900-FLATSCR-NEXT: scratch_store_short_d16_hi v1, v0, off{{$}}

; NO-D16-HI: v_lshrrev_b32_e32 v1, 16, v1		; NO-D16-HI: v_lshrrev_b32_e32 v1, 16, v1
; NO-D16-HI: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}		; NO-D16-HI: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2f16(half addrspace(5)* %out, i32 %arg) #0 {		define void @store_private_hi_v2f16(half addrspace(5)* %out, i32 %arg) #0 {
entry:		entry:
; FIXME: ABI for pre-gfx9		; FIXME: ABI for pre-gfx9
%value = bitcast i32 %arg to <2 x half>		%value = bitcast i32 %arg to <2 x half>
%hi = extractelement <2 x half> %value, i32 1		%hi = extractelement <2 x half> %value, i32 1
store half %hi, half addrspace(5)* %out		store half %hi, half addrspace(5)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_i32_shift:		; GCN-LABEL: {{^}}store_private_hi_i32_shift:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}		; GFX900-MUBUF-NEXT: buffer_store_short_d16_hi v1, v0, s[0:3], 0 offen{{$}}
		; GFX900-FLATSCR-NEXT: scratch_store_short_d16_hi v1, v0, off{{$}}

; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1		; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1
; NO-D16-HI-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}		; NO-D16-HI-NEXT: buffer_store_short v1, v0, s[0:3], 0 offen{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_i32_shift(i16 addrspace(5)* %out, i32 %value) #0 {		define void @store_private_hi_i32_shift(i16 addrspace(5)* %out, i32 %value) #0 {
entry:		entry:
%hi32 = lshr i32 %value, 16		%hi32 = lshr i32 %value, 16
%hi = trunc i32 %hi32 to i16		%hi = trunc i32 %hi32 to i16
store i16 %hi, i16 addrspace(5)* %out		store i16 %hi, i16 addrspace(5)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2i16_i8:		; GCN-LABEL: {{^}}store_private_hi_v2i16_i8:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_byte_d16_hi v1, v0, s[0:3], 0 offen{{$}}		; GFX900-MUBUF-NEXT: buffer_store_byte_d16_hi v1, v0, s[0:3], 0 offen{{$}}
		; GFX900-FLATSCR-NEXT: scratch_store_byte_d16_hi v1, v0, off{{$}}

; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1		; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1
; NO-D16-HI-NEXT: buffer_store_byte v1, v0, s[0:3], 0 offen{{$}}		; NO-D16-HI-NEXT: buffer_store_byte v1, v0, s[0:3], 0 offen{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2i16_i8(i8 addrspace(5)* %out, i32 %arg) #0 {		define void @store_private_hi_v2i16_i8(i8 addrspace(5)* %out, i32 %arg) #0 {
entry:		entry:
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
%trunc = trunc i16 %hi to i8		%trunc = trunc i16 %hi to i8
store i8 %trunc, i8 addrspace(5)* %out		store i8 %trunc, i8 addrspace(5)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_i8_shift:		; GCN-LABEL: {{^}}store_private_hi_i8_shift:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_byte_d16_hi v1, v0, s[0:3], 0 offen{{$}}		; GFX900-MUBUF-NEXT: buffer_store_byte_d16_hi v1, v0, s[0:3], 0 offen{{$}}
		; GFX900-FLATSCR-NEXT: scratch_store_byte_d16_hi v1, v0, off{{$}}

; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1		; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v1, 16, v1
; NO-D16-HI-NEXT: buffer_store_byte v1, v0, s[0:3], 0 offen{{$}}		; NO-D16-HI-NEXT: buffer_store_byte v1, v0, s[0:3], 0 offen{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_i8_shift(i8 addrspace(5)* %out, i32 %value) #0 {		define void @store_private_hi_i8_shift(i8 addrspace(5)* %out, i32 %value) #0 {
entry:		entry:
%hi32 = lshr i32 %value, 16		%hi32 = lshr i32 %value, 16
%hi = trunc i32 %hi32 to i8		%hi = trunc i32 %hi32 to i8
store i8 %hi, i8 addrspace(5)* %out		store i8 %hi, i8 addrspace(5)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2i16_max_offset:		; GCN-LABEL: {{^}}store_private_hi_v2i16_max_offset:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_store_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}		; GFX900-MUBUF: buffer_store_short_d16_hi v0, off, s[0:3], s32 offset:4094{{$}}
		; GFX900-FLATSCR: scratch_store_short_d16_hi off, v0, s32 offset:4094{{$}}

; NO-D16-HI: v_lshrrev_b32_e32 v0, 16, v0		; NO-D16-HI: v_lshrrev_b32_e32 v0, 16, v0
; NO-D16-HI-NEXT: buffer_store_short v0, off, s[0:3], s32 offset:4094{{$}}		; NO-D16-HI-NEXT: buffer_store_short v0, off, s[0:3], s32 offset:4094{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2i16_max_offset(i16 addrspace(5)* byval %out, i32 %arg) #0 {		define void @store_private_hi_v2i16_max_offset(i16 addrspace(5)* byval %out, i32 %arg) #0 {
entry:		entry:
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
%gep = getelementptr inbounds i16, i16 addrspace(5)* %out, i64 2047		%gep = getelementptr inbounds i16, i16 addrspace(5)* %out, i64 2047
store i16 %hi, i16 addrspace(5)* %gep		store i16 %hi, i16 addrspace(5)* %gep
ret void		ret void
}		}



; GCN-LABEL: {{^}}store_private_hi_v2i16_nooff:		; GCN-LABEL: {{^}}store_private_hi_v2i16_nooff:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], 0{{$}}		; GFX900-MUBUF-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], 0{{$}}
		; GFX900-FLATSCR-NEXT: s_mov_b32 [[SOFF:s[0-9]+]], 0
		; GFX900-FLATSCR-NEXT: scratch_store_short_d16_hi off, v0, [[SOFF]]{{$}}

; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; NO-D16-HI-NEXT: v_lshrrev_b32_e32 v0, 16, v0
; NO-D16-HI-NEXT: buffer_store_short v0, off, s[0:3], 0{{$}}		; NO-D16-HI-NEXT: buffer_store_short v0, off, s[0:3], 0{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2i16_nooff(i32 %arg) #0 {		define void @store_private_hi_v2i16_nooff(i32 %arg) #0 {
entry:		entry:
; FIXME: ABI for pre-gfx9		; FIXME: ABI for pre-gfx9
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
store volatile i16 %hi, i16 addrspace(5)* null		store volatile i16 %hi, i16 addrspace(5)* null
ret void		ret void
}		}


; GCN-LABEL: {{^}}store_private_hi_v2i16_i8_nooff:		; GCN-LABEL: {{^}}store_private_hi_v2i16_i8_nooff:
; GCN: s_waitcnt		; GCN: s_waitcnt

; GFX900-NEXT: buffer_store_byte_d16_hi v0, off, s[0:3], 0{{$}}		; GFX900-MUBUF-NEXT: buffer_store_byte_d16_hi v0, off, s[0:3], 0{{$}}
		; GFX900-FLATSCR-NEXT: s_mov_b32 [[SOFF:s[0-9]+]], 0
		; GFX900-FLATSCR-NEXT: scratch_store_byte_d16_hi off, v0, [[SOFF]]{{$}}

; NO-D16-HI: v_lshrrev_b32_e32 v0, 16, v0		; NO-D16-HI: v_lshrrev_b32_e32 v0, 16, v0
; NO-D16-HI: buffer_store_byte v0, off, s[0:3], 0{{$}}		; NO-D16-HI: buffer_store_byte v0, off, s[0:3], 0{{$}}

; GCN-NEXT: s_waitcnt		; GCN-NEXT: s_waitcnt
; GCN-NEXT: s_setpc_b64		; GCN-NEXT: s_setpc_b64
define void @store_private_hi_v2i16_i8_nooff(i32 %arg) #0 {		define void @store_private_hi_v2i16_i8_nooff(i32 %arg) #0 {
entry:		entry:
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	entry:
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
%gep = getelementptr inbounds i16, i16 addrspace(3)* %out, i64 32767		%gep = getelementptr inbounds i16, i16 addrspace(3)* %out, i64 32767
store i16 %hi, i16 addrspace(3)* %gep		store i16 %hi, i16 addrspace(3)* %gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2i16_to_offset:		; GCN-LABEL: {{^}}store_private_hi_v2i16_to_offset:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_store_dword		; GFX900-MUBUF: buffer_store_dword
; GFX900-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32 offset:4094		; GFX900-MUBUF-NEXT: buffer_store_short_d16_hi v0, off, s[0:3], s32 offset:4094
		; GFX900-FLATSCR: scratch_store_dword
		; GFX900-FLATSCR-NEXT: scratch_store_short_d16_hi off, v0, s32 offset:4094
define void @store_private_hi_v2i16_to_offset(i32 %arg) #0 {		define void @store_private_hi_v2i16_to_offset(i32 %arg) #0 {
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i16], align 2, addrspace(5)		%obj1 = alloca [4096 x i16], align 2, addrspace(5)
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027		%gep = getelementptr inbounds [4096 x i16], [4096 x i16] addrspace(5)* %obj1, i32 0, i32 2027
store i16 %hi, i16 addrspace(5)* %gep		store i16 %hi, i16 addrspace(5)* %gep
ret void		ret void
}		}

; GCN-LABEL: {{^}}store_private_hi_v2i16_i8_to_offset:		; GCN-LABEL: {{^}}store_private_hi_v2i16_i8_to_offset:
; GCN: s_waitcnt		; GCN: s_waitcnt
; GFX900: buffer_store_dword		; GFX900-MUBUF: buffer_store_dword
; GFX900-NEXT: buffer_store_byte_d16_hi v0, off, s[0:3], s32 offset:4095		; GFX900-MUBUF-NEXT: buffer_store_byte_d16_hi v0, off, s[0:3], s32 offset:4095
		; GFX900-FLATSCR: scratch_store_dword
		; GFX900-FLATSCR-NEXT: scratch_store_byte_d16_hi off, v0, s32 offset:4095
define void @store_private_hi_v2i16_i8_to_offset(i32 %arg) #0 {		define void @store_private_hi_v2i16_i8_to_offset(i32 %arg) #0 {
entry:		entry:
%obj0 = alloca [10 x i32], align 4, addrspace(5)		%obj0 = alloca [10 x i32], align 4, addrspace(5)
%obj1 = alloca [4096 x i8], align 2, addrspace(5)		%obj1 = alloca [4096 x i8], align 2, addrspace(5)
%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*		%bc = bitcast [10 x i32] addrspace(5)* %obj0 to i32 addrspace(5)*
store volatile i32 123, i32 addrspace(5)* %bc		store volatile i32 123, i32 addrspace(5)* %bc
%value = bitcast i32 %arg to <2 x i16>		%value = bitcast i32 %arg to <2 x i16>
%hi = extractelement <2 x i16> %value, i32 1		%hi = extractelement <2 x i16> %value, i32 1
%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055		%gep = getelementptr inbounds [4096 x i8], [4096 x i8] addrspace(5)* %obj1, i32 0, i32 4055
%trunc = trunc i16 %hi to i8		%trunc = trunc i16 %hi to i8
store i8 %trunc, i8 addrspace(5)* %gep		store i8 %trunc, i8 addrspace(5)* %gep
ret void		ret void
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use flat scratch instructions where availableClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 297963

llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

llvm/lib/Target/AMDGPU/FLATInstructions.td

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

llvm/lib/Target/AMDGPU/SIFrameLowering.cpp

llvm/lib/Target/AMDGPU/SIRegisterInfo.h

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp

llvm/test/CodeGen/AMDGPU/call-preserved-registers.ll

llvm/test/CodeGen/AMDGPU/callee-frame-setup.ll

llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

llvm/test/CodeGen/AMDGPU/flat-scratch.ll

llvm/test/CodeGen/AMDGPU/frame-index-elimination.ll

llvm/test/CodeGen/AMDGPU/load-hi16.ll

llvm/test/CodeGen/AMDGPU/load-lo16.ll

llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll

llvm/test/CodeGen/AMDGPU/memcpy-fixed-align.ll

llvm/test/CodeGen/AMDGPU/non-entry-alloca.ll

llvm/test/CodeGen/AMDGPU/scratch-simple.ll

llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll

llvm/test/CodeGen/AMDGPU/store-hi16.ll

[AMDGPU] Use flat scratch instructions where available
ClosedPublic