This is an archive of the discontinued LLVM Phabricator instance.

Differential D28920

DAG: Allow targets to override stack temp alignment
Needs ReviewPublic

Authored by arsenm on Jan 19 2017, 2:29 PM.

Download Raw Diff

This revision needs review, but all specified reviewers are disabled or inactive.

Details

Reviewers

• tstellarAMD

Summary

AMDGPU doesn't want/need stack alignment > 4 ever. The
DataLayout's preferred alignment cannot be lower than the ABI
required alignment, which defaults to the type size. For
stack temporaries the ABI alignment constraints do not matter,
so request align 4 in all cases to save space.

This will mitigate regressing the stack usage of every program
in a future commit.

Diff Detail

Event Timeline

arsenm created this revision.Jan 19 2017, 2:29 PM

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptJan 19 2017, 2:29 PM

Herald added subscribers: nhaehnle, wdng, qcolombet. · View Herald Transcript

arsenm added a child revision: D28936: AMDGPU: Always allocate emergency stack slot at offset 0.Jan 19 2017, 9:21 PM

ping

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

In D28920#656679, @hfinkel wrote:

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

In D28920#656682, @arsenm wrote:

In D28920#656679, @hfinkel wrote:

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

Another issue is this would also vary per address space to some degree. For AMDGPU we would like 8 byte alignment, but only in addrspace(3)

In D28920#656682, @arsenm wrote:

In D28920#656679, @hfinkel wrote:

The DataLayout's preferred alignment cannot be lower than the ABI required alignment, which defaults to the type size

Why not change that?

I thought about that, but it seems like a risky, IR breaking change. I assume most places using this are using this to get a bigger alignment than required and assuming it is a legal ABI alignment. In codegen there is more freedom to decide what the ABI really means. I can try changing this and see what breaks.

Yea; I figured that we'd end up replacing many uses of preferred alignment with some min-of-preferred-or-abi-alignment utility function if we did that.

Another issue is this would also vary per address space to some degree. For AMDGPU we would like 8 byte alignment, but only in addrspace(3)

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

In D28920#656702, @hfinkel wrote:

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

I'm not sure it's a frontend issue? The LangRef doesn't really define what the ABI alignment of a type really means. It seems of the few uses in clang of getPrefTypeAlignment (and all the ones I've looked at so far in llvm) are all used to create allocas and loads/stores from them.

The DataLayout has getPreferredAlignment(const Global Variable *GV), which appear to return the preferred type alignment only if it has no explicit alignment set. This is only used by a handful of places.

In D28920#657029, @arsenm wrote:

In D28920#656702, @hfinkel wrote:

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

I'm not sure it's a frontend issue? The LangRef doesn't really define what the ABI alignment of a type really means. It seems of the few uses in clang of getPrefTypeAlignment (and all the ones I've looked at so far in llvm) are all used to create allocas and loads/stores from them.

The DataLayout has getPreferredAlignment(const Global Variable *GV), which appear to return the preferred type alignment only if it has no explicit alignment set. This is only used by a handful of places.

In D28920#657526, @hfinkel wrote:

In D28920#657029, @arsenm wrote:

In D28920#656702, @hfinkel wrote:

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

Not really. Local memory doesn't behave like a stack and is just a block of memory allocated for the workgroup for the entire program. Accessing private memory is much slower, so in some cases for spills and small stack objects we could potentially optimize by writing them there instead. We don't support calls currently, and OpenCL explicitly forbids recursion so I haven't spent much time worrying about that.

In D28920#659064, @arsenm wrote:

In D28920#657526, @hfinkel wrote:

In D28920#657029, @arsenm wrote:

In D28920#656702, @hfinkel wrote:

Unless you'd like to extend stack temps to be put in other address spaces, isn't that a frontend issue (i.e. applies only to globals)?

This is something I would like to be able to do someday (but I think this would be purely a codegen problem, so wouldn't need to be seen the in the IR).

Does your backend maintain multiple stacks (e.g. some kind of "global stack" and a "local stack")? Does this depend on some non-recursion analysis? [I'm really curious what you're thinking here - not only does it affect this patch, but it also potentially affects discussions I've been having about OpenMP accelerator semantics].

Not really. Local memory doesn't behave like a stack and is just a block of memory allocated for the workgroup for the entire program. Accessing private memory is much slower, so in some cases for spills and small stack objects we could potentially optimize by writing them there instead. We don't support calls currently, and OpenCL explicitly forbids recursion so I haven't spent much time worrying about that.

Makes sense ;)

To come back to your original question, I'd think that you'd want to extend DataLayout to support per-address-space alignments in that case, just like we have per-address-space pointer sizes. You'd want this at the IR-level too.

Makes sense ;)

To come back to your original question, I'd think that you'd want to extend DataLayout to support per-address-space alignments in that case, just like we have per-address-space pointer sizes. You'd want this at the IR-level too.

I've written the patches to allow a lower preferred alignment which wasn't difficult. I think the per-address space alignment is overkill particularly for the amount of benefit. Globals/allocass are assigned the correct ABI alignment, and it would always possible to write an optimization pass to reduce the alignment when valid to do so. The one advantage I think the hook here has is that it allows handling of perverse cases that won't be explicitly listed in the datalayout (e.g. a 513 element vector).

arsenm mentioned this in D29810: [RFC] Allow datalayout to have preferred alignment < ABI.Feb 9 2017, 7:46 PM

arsenm mentioned this in D80370: [CodeGen] Ensure callers of CreateStackTemporary use sensible alignments.May 21 2020, 6:55 AM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

11 lines

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

11 lines

SelectionDAG.cpp

17 lines

TargetLoweringBase.cpp

18 lines

Target/

AMDGPU/

AMDGPUISelLowering.h

16 lines

test/

CodeGen/

AMDGPU/

insert_vector_elt.ll

9 lines

local-stack-slot-bug.ll

6 lines

vgpr-spill-emergency-stack-slot-compute.ll

4 lines

vgpr-spill-emergency-stack-slot.ll

2 lines

Diff 85035

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	protected:
/// \brief Initialize all of the actions to default values.		/// \brief Initialize all of the actions to default values.
void initActions();		void initActions();

public:		public:
const TargetMachine &getTargetMachine() const { return TM; }		const TargetMachine &getTargetMachine() const { return TM; }

virtual bool useSoftFloat() const { return false; }		virtual bool useSoftFloat() const { return false; }

		/// \p returns the alignment of that should be used for a temporary stack
		/// slot.
		virtual unsigned getStackTemporaryPreferredAlign(const DataLayout &DL,
		LLVMContext &Context,
		EVT VT,
		unsigned MinAlign = 1) const;
		virtual unsigned getStackTemporaryPreferredAlign(const DataLayout &DL,
		LLVMContext &Context,
		EVT VT1,
		EVT VT2) const;

/// Return the pointer type for the given address space, defaults to		/// Return the pointer type for the given address space, defaults to
/// the pointer type from the data layout.		/// the pointer type from the data layout.
/// FIXME: The default needs to be removed once all the code is updated.		/// FIXME: The default needs to be removed once all the code is updated.
MVT getPointerTy(const DataLayout &DL, uint32_t AS = 0) const {		MVT getPointerTy(const DataLayout &DL, uint32_t AS = 0) const {
return MVT::getIntegerVT(DL.getPointerSizeInBits(AS));		return MVT::getIntegerVT(DL.getPointerSizeInBits(AS));
}		}

/// EVT is not used in-tree, but is used by out-of-tree target.		/// EVT is not used in-tree, but is used by out-of-tree target.
▲ Show 20 Lines • Show All 2,989 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

	Show All 11 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "llvm/ADT/SetVector.h"			#include "llvm/ADT/SetVector.h"
	#include "llvm/ADT/SmallPtrSet.h"			#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/SmallSet.h"			#include "llvm/ADT/SmallSet.h"
	#include "llvm/ADT/SmallVector.h"			#include "llvm/ADT/SmallVector.h"
	#include "llvm/ADT/Triple.h"			#include "llvm/ADT/Triple.h"
	#include "llvm/CodeGen/MachineFunction.h"			#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFrameInfo.h"
	#include "llvm/CodeGen/MachineJumpTableInfo.h"			#include "llvm/CodeGen/MachineJumpTableInfo.h"
	#include "llvm/CodeGen/SelectionDAG.h"			#include "llvm/CodeGen/SelectionDAG.h"
	#include "llvm/CodeGen/SelectionDAGNodes.h"			#include "llvm/CodeGen/SelectionDAGNodes.h"
	#include "llvm/IR/CallingConv.h"			#include "llvm/IR/CallingConv.h"
	#include "llvm/IR/Constants.h"			#include "llvm/IR/Constants.h"
	#include "llvm/IR/DataLayout.h"			#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/DebugInfo.h"			#include "llvm/IR/DebugInfo.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
	▲ Show 20 Lines • Show All 1,600 Lines • ▼ Show 20 Lines

	/// Emit a store/load combination to the stack. This stores			/// Emit a store/load combination to the stack. This stores
	/// SrcOp to a stack slot of type SlotVT, truncating it if needed. It then does			/// SrcOp to a stack slot of type SlotVT, truncating it if needed. It then does
	/// a load from the stack slot to DestVT, extending it if needed.			/// a load from the stack slot to DestVT, extending it if needed.
	/// The resultant code need not be legal.			/// The resultant code need not be legal.
	SDValue SelectionDAGLegalize::EmitStackConvert(SDValue SrcOp, EVT SlotVT,			SDValue SelectionDAGLegalize::EmitStackConvert(SDValue SrcOp, EVT SlotVT,
	EVT DestVT, const SDLoc &dl) {			EVT DestVT, const SDLoc &dl) {
	// Create the stack frame object.			// Create the stack frame object.
	unsigned SrcAlign = DAG.getDataLayout().getPrefTypeAlignment(			SDValue FIPtr = DAG.CreateStackTemporary(SrcOp.getValueType(), SlotVT);
	SrcOp.getValueType().getTypeForEVT(*DAG.getContext()));
	SDValue FIPtr = DAG.CreateStackTemporary(SlotVT, SrcAlign);

	FrameIndexSDNode *StackPtrFI = cast<FrameIndexSDNode>(FIPtr);			FrameIndexSDNode *StackPtrFI = cast<FrameIndexSDNode>(FIPtr);
	int SPFI = StackPtrFI->getIndex();			int SPFI = StackPtrFI->getIndex();
	MachinePointerInfo PtrInfo =
	MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SPFI);			MachineFunction &MF = DAG.getMachineFunction();
				MachinePointerInfo PtrInfo = MachinePointerInfo::getFixedStack(MF, SPFI);
				unsigned SrcAlign = MF.getFrameInfo().getObjectAlignment(SPFI);

	unsigned SrcSize = SrcOp.getValueSizeInBits();			unsigned SrcSize = SrcOp.getValueSizeInBits();
	unsigned SlotSize = SlotVT.getSizeInBits();			unsigned SlotSize = SlotVT.getSizeInBits();
	unsigned DestSize = DestVT.getSizeInBits();			unsigned DestSize = DestVT.getSizeInBits();
	Type DestType = DestVT.getTypeForEVT(DAG.getContext());			Type DestType = DestVT.getTypeForEVT(DAG.getContext());
	unsigned DestAlign = DAG.getDataLayout().getPrefTypeAlignment(DestType);			unsigned DestAlign = DAG.getDataLayout().getPrefTypeAlignment(DestType);

	// Emit a store to the stack slot. Use a truncstore if the input value is			// Emit a store to the stack slot. Use a truncstore if the input value is
	▲ Show 20 Lines • Show All 2,919 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,810 Lines • ▼ Show 20 Lines	SDValue SelectionDAG::expandVACopy(SDNode *Node) {
const Value *VS = cast<SrcValueSDNode>(Node->getOperand(4))->getValue();		const Value *VS = cast<SrcValueSDNode>(Node->getOperand(4))->getValue();
SDValue Tmp1 =		SDValue Tmp1 =
getLoad(TLI.getPointerTy(getDataLayout()), dl, Node->getOperand(0),		getLoad(TLI.getPointerTy(getDataLayout()), dl, Node->getOperand(0),
Node->getOperand(2), MachinePointerInfo(VS));		Node->getOperand(2), MachinePointerInfo(VS));
return getStore(Tmp1.getValue(1), dl, Tmp1, Node->getOperand(1),		return getStore(Tmp1.getValue(1), dl, Tmp1, Node->getOperand(1),
MachinePointerInfo(VD));		MachinePointerInfo(VD));
}		}

SDValue SelectionDAG::CreateStackTemporary(EVT VT, unsigned minAlign) {		SDValue SelectionDAG::CreateStackTemporary(EVT VT, unsigned MinAlign) {
MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();		MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();
unsigned ByteSize = VT.getStoreSize();		unsigned StackAlign
Type Ty = VT.getTypeForEVT(getContext());		= TLI->getStackTemporaryPreferredAlign(getDataLayout(), *getContext(),
unsigned StackAlign =		VT, MinAlign);
std::max((unsigned)getDataLayout().getPrefTypeAlignment(Ty), minAlign);

		unsigned ByteSize = VT.getStoreSize();
int FrameIdx = MFI.CreateStackObject(ByteSize, StackAlign, false);		int FrameIdx = MFI.CreateStackObject(ByteSize, StackAlign, false);
return getFrameIndex(FrameIdx, TLI->getPointerTy(getDataLayout()));		return getFrameIndex(FrameIdx, TLI->getPointerTy(getDataLayout()));
}		}

SDValue SelectionDAG::CreateStackTemporary(EVT VT1, EVT VT2) {		SDValue SelectionDAG::CreateStackTemporary(EVT VT1, EVT VT2) {
unsigned Bytes = std::max(VT1.getStoreSize(), VT2.getStoreSize());		unsigned Bytes = std::max(VT1.getStoreSize(), VT2.getStoreSize());
Type Ty1 = VT1.getTypeForEVT(getContext());		unsigned Align = TLI->getStackTemporaryPreferredAlign(getDataLayout(),
Type Ty2 = VT2.getTypeForEVT(getContext());		*getContext(), VT1, VT2);
const DataLayout &DL = getDataLayout();
unsigned Align =
std::max(DL.getPrefTypeAlignment(Ty1), DL.getPrefTypeAlignment(Ty2));

MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();		MachineFrameInfo &MFI = getMachineFunction().getFrameInfo();
int FrameIdx = MFI.CreateStackObject(Bytes, Align, false);		int FrameIdx = MFI.CreateStackObject(Bytes, Align, false);
return getFrameIndex(FrameIdx, TLI->getPointerTy(getDataLayout()));		return getFrameIndex(FrameIdx, TLI->getPointerTy(getDataLayout()));
}		}

SDValue SelectionDAG::FoldSetCC(EVT VT, SDValue N1, SDValue N2,		SDValue SelectionDAG::FoldSetCC(EVT VT, SDValue N1, SDValue N2,
ISD::CondCode Cond, const SDLoc &dl) {		ISD::CondCode Cond, const SDLoc &dl) {
▲ Show 20 Lines • Show All 5,738 Lines • Show Last 20 Lines

lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 964 Lines • ▼ Show 20 Lines	void TargetLoweringBase::initActions() {
setOperationAction(ISD::TRAP, MVT::Other, Expand);		setOperationAction(ISD::TRAP, MVT::Other, Expand);

// On most systems, DEBUGTRAP and TRAP have no difference. The "Expand"		// On most systems, DEBUGTRAP and TRAP have no difference. The "Expand"
// here is to inform DAG Legalizer to replace DEBUGTRAP with TRAP.		// here is to inform DAG Legalizer to replace DEBUGTRAP with TRAP.
//		//
setOperationAction(ISD::DEBUGTRAP, MVT::Other, Expand);		setOperationAction(ISD::DEBUGTRAP, MVT::Other, Expand);
}		}

		unsigned TargetLoweringBase::getStackTemporaryPreferredAlign(
		const DataLayout &DL,
		LLVMContext &Context,
		EVT VT, unsigned MinAlign) const {
		Type *Ty = VT.getTypeForEVT(Context);
		return std::max((unsigned)DL.getPrefTypeAlignment(Ty), MinAlign);
		}

		unsigned TargetLoweringBase::getStackTemporaryPreferredAlign(const DataLayout &DL,
		LLVMContext &Context,
		EVT VT1,
		EVT VT2) const {
		Type *Ty1 = VT1.getTypeForEVT(Context);
		Type *Ty2 = VT2.getTypeForEVT(Context);

		return std::max(DL.getPrefTypeAlignment(Ty1), DL.getPrefTypeAlignment(Ty2));
		}

MVT TargetLoweringBase::getScalarShiftAmountTy(const DataLayout &DL,		MVT TargetLoweringBase::getScalarShiftAmountTy(const DataLayout &DL,
EVT) const {		EVT) const {
return MVT::getIntegerVT(8 * DL.getPointerSize(0));		return MVT::getIntegerVT(8 * DL.getPointerSize(0));
}		}

EVT TargetLoweringBase::getShiftAmountTy(EVT LHSTy,		EVT TargetLoweringBase::getShiftAmountTy(EVT LHSTy,
const DataLayout &DL) const {		const DataLayout &DL) const {
assert(LHSTy.isInteger() && "Shift amount is not an integer type!");		assert(LHSTy.isInteger() && "Shift amount is not an integer type!");
▲ Show 20 Lines • Show All 1,115 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	protected:
void AnalyzeFormalArguments(CCState &State,		void AnalyzeFormalArguments(CCState &State,
const SmallVectorImpl<ISD::InputArg> &Ins) const;		const SmallVectorImpl<ISD::InputArg> &Ins) const;
void AnalyzeReturn(CCState &State,		void AnalyzeReturn(CCState &State,
const SmallVectorImpl<ISD::OutputArg> &Outs) const;		const SmallVectorImpl<ISD::OutputArg> &Outs) const;

public:		public:
AMDGPUTargetLowering(const TargetMachine &TM, const AMDGPUSubtarget &STI);		AMDGPUTargetLowering(const TargetMachine &TM, const AMDGPUSubtarget &STI);

		// Any 4-byte aligned access is always legal, and stack objects are broken
		// into accesses of 4-byte elements, so anything higher is just wasting space.
		unsigned getStackTemporaryPreferredAlign(const DataLayout &DL,
		LLVMContext &Context,
		EVT VT,
		unsigned MinAlign = 1) const override {
		return std::max(4u, MinAlign);
		}

		unsigned getStackTemporaryPreferredAlign(const DataLayout &DL,
		LLVMContext &Context,
		EVT VT1,
		EVT VT2) const override {
		return 4;
		}

bool mayIgnoreSignedZero(SDValue Op) const {		bool mayIgnoreSignedZero(SDValue Op) const {
if (getTargetMachine().Options.UnsafeFPMath) // FIXME: nsz only		if (getTargetMachine().Options.UnsafeFPMath) // FIXME: nsz only
return true;		return true;

if (const auto *BO = dyn_cast<BinaryWithFlagsSDNode>(Op))		if (const auto *BO = dyn_cast<BinaryWithFlagsSDNode>(Op))
return BO->Flags.hasNoSignedZeros();		return BO->Flags.hasNoSignedZeros();

return false;		return false;
▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/insert_vector_elt.ll

	Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
	; GCN-LABEL: {{^}}dynamic_insertelement_v4i16:			; GCN-LABEL: {{^}}dynamic_insertelement_v4i16:
	; GCN: buffer_load_ushort v{{[0-9]+}}, off			; GCN: buffer_load_ushort v{{[0-9]+}}, off
	; GCN: buffer_load_ushort v{{[0-9]+}}, off			; GCN: buffer_load_ushort v{{[0-9]+}}, off
	; GCN: buffer_load_ushort v{{[0-9]+}}, off			; GCN: buffer_load_ushort v{{[0-9]+}}, off
	; GCN: buffer_load_ushort v{{[0-9]+}}, off			; GCN: buffer_load_ushort v{{[0-9]+}}, off

	; GCN-DAG: v_mov_b32_e32 [[BASE_FI:v[0-9]+]], 0{{$}}			; GCN-DAG: v_mov_b32_e32 [[BASE_FI:v[0-9]+]], 0{{$}}
	; GCN-DAG: s_and_b32 [[MASK_IDX:s[0-9]+]], s{{[0-9]+}}, 3{{$}}			; GCN-DAG: s_and_b32 [[MASK_IDX:s[0-9]+]], s{{[0-9]+}}, 3{{$}}
	; GCN-DAG: v_or_b32_e32 [[IDX:v[0-9]+]], [[MASK_IDX]], [[BASE_FI]]{{$}}			; GCN-DAG: v_add_i32_e32 [[IDX:v[0-9]+]], vcc, [[MASK_IDX]], [[BASE_FI]]{{$}}

	; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:6			; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:6
	; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4			; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:4
	; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2			; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offset:2
	; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+$}}			; GCN-DAG: buffer_store_short v{{[0-9]+}}, off, s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+$}}
	; GCN: buffer_store_short v{{[0-9]+}}, [[IDX]], s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offen{{$}}			; GCN: buffer_store_short v{{[0-9]+}}, [[IDX]], s{{\[[0-9]+:[0-9]+\]}}, s{{[0-9]+}} offen{{$}}

	; GCN: s_waitcnt			; GCN: s_waitcnt
	▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines

	; GCN-LABEL: {{^}}dynamic_insertelement_v3i64:			; GCN-LABEL: {{^}}dynamic_insertelement_v3i64:
	define void @dynamic_insertelement_v3i64(<3 x i64> addrspace(1)* %out, <3 x i64> %a, i32 %b) nounwind {			define void @dynamic_insertelement_v3i64(<3 x i64> addrspace(1)* %out, <3 x i64> %a, i32 %b) nounwind {
	%vecins = insertelement <3 x i64> %a, i64 5, i32 %b			%vecins = insertelement <3 x i64> %a, i64 5, i32 %b
	store <3 x i64> %vecins, <3 x i64> addrspace(1)* %out, align 32			store <3 x i64> %vecins, <3 x i64> addrspace(1)* %out, align 32
	ret void			ret void
	}			}

	; FIXME: Should be able to do without stack access. The used stack			; FIXME: Should be able to do without stack access.
	; space is also 2x what should be required.

	; GCN-LABEL: {{^}}dynamic_insertelement_v4f64:			; GCN-LABEL: {{^}}dynamic_insertelement_v4f64:
	; GCN: SCRATCH_RSRC_DWORD			; GCN: SCRATCH_RSRC_DWORD

	; Stack store			; Stack store

	; GCN-DAG: buffer_store_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}			; GCN-DAG: buffer_store_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}
	; GCN-DAG: buffer_store_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}			; GCN-DAG: buffer_store_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}

	; Write element			; Write element
	; GCN: buffer_store_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen{{$}}			; GCN: buffer_store_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen{{$}}

	; Stack reload			; Stack reload
	; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}			; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}
	; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}			; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}

	; Store result			; Store result
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: s_endpgm			; GCN: s_endpgm
	; GCN: ScratchSize: 64			; GCN: ScratchSize: 36

	define void @dynamic_insertelement_v4f64(<4 x double> addrspace(1)* %out, <4 x double> %a, i32 %b) nounwind {			define void @dynamic_insertelement_v4f64(<4 x double> addrspace(1)* %out, <4 x double> %a, i32 %b) nounwind {
	%vecins = insertelement <4 x double> %a, double 8.0, i32 %b			%vecins = insertelement <4 x double> %a, double 8.0, i32 %b
	store <4 x double> %vecins, <4 x double> addrspace(1)* %out, align 16			store <4 x double> %vecins, <4 x double> addrspace(1)* %out, align 16
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}dynamic_insertelement_v8f64:			; GCN-LABEL: {{^}}dynamic_insertelement_v8f64:
	Show All 11 Lines
	; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}			; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offset:16{{$}}
	; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}			; GCN-DAG: buffer_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, off, s{{\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}}{{$}}

	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: buffer_store_dwordx4			; GCN: buffer_store_dwordx4
	; GCN: s_endpgm			; GCN: s_endpgm
	; GCN: ScratchSize: 128			; GCN: ScratchSize: 68
	define void @dynamic_insertelement_v8f64(<8 x double> addrspace(1)* %out, <8 x double> %a, i32 %b) nounwind {			define void @dynamic_insertelement_v8f64(<8 x double> addrspace(1)* %out, <8 x double> %a, i32 %b) nounwind {
	%vecins = insertelement <8 x double> %a, double 8.0, i32 %b			%vecins = insertelement <8 x double> %a, double 8.0, i32 %b
	store <8 x double> %vecins, <8 x double> addrspace(1)* %out, align 16			store <8 x double> %vecins, <8 x double> addrspace(1)* %out, align 16
	ret void			ret void
	}			}

	declare <4 x float> @llvm.SI.gather4.lz.v2i32(<2 x i32>, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) nounwind readnone			declare <4 x float> @llvm.SI.gather4.lz.v2i32(<2 x i32>, <8 x i32>, <4 x i32>, i32, i32, i32, i32, i32, i32, i32, i32) nounwind readnone

test/CodeGen/AMDGPU/local-stack-slot-bug.ll

	; RUN: llc -march=amdgcn -mcpu=verde -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=verde -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=+vgpr-spilling -verify-machineinstrs < %s \| FileCheck %s

	; This used to fail due to a v_add_i32 instruction with an illegal immediate			; This used to fail due to a v_add_i32 instruction with an illegal immediate
	; operand that was created during Local Stack Slot Allocation. Test case derived			; operand that was created during Local Stack Slot Allocation. Test case derived
	; from https://bugs.freedesktop.org/show_bug.cgi?id=96602			; from https://bugs.freedesktop.org/show_bug.cgi?id=96602
	;			;
	; CHECK-LABEL: {{^}}main:			; CHECK-LABEL: {{^}}main:

	; CHECK-DAG: v_mov_b32_e32 [[K:v[0-9]+]], 0x200
	; CHECK-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0{{$}}
	; CHECK-DAG: v_lshlrev_b32_e32 [[BYTES:v[0-9]+]], 2, v0			; CHECK-DAG: v_lshlrev_b32_e32 [[BYTES:v[0-9]+]], 2, v0
	; CHECK-DAG: v_and_b32_e32 [[CLAMP_IDX:v[0-9]+]], 0x1fc, [[BYTES]]			; CHECK-DAG: v_and_b32_e32 [[CLAMP_IDX:v[0-9]+]], 0x1fc, [[BYTES]]

	; TODO: add 0?			; TODO: add 0?
	; CHECK-DAG: v_or_b32_e32 [[LO_OFF:v[0-9]+]], [[CLAMP_IDX]], [[ZERO]]			; CHECK-DAG: v_add_i32_e32 [[LO_OFF:v[0-9]+]], vcc, 0, [[CLAMP_IDX]]
	; CHECK-DAG: v_or_b32_e32 [[HI_OFF:v[0-9]+]], [[CLAMP_IDX]], [[K]]			; CHECK-DAG: v_add_i32_e32 [[HI_OFF:v[0-9]+]], vcc, 0x200, [[CLAMP_IDX]]

	; CHECK: buffer_load_dword {{v[0-9]+}}, [[LO_OFF]], {{s\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen			; CHECK: buffer_load_dword {{v[0-9]+}}, [[LO_OFF]], {{s\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen
	; CHECK: buffer_load_dword {{v[0-9]+}}, [[HI_OFF]], {{s\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen			; CHECK: buffer_load_dword {{v[0-9]+}}, [[HI_OFF]], {{s\[[0-9]+:[0-9]+\]}}, {{s[0-9]+}} offen
	define amdgpu_ps float @main(i32 %idx) {			define amdgpu_ps float @main(i32 %idx) {
	main_body:			main_body:
	%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx			%v1 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0xBFEA477C60000000, float 0xBFEBE5DC60000000, float 0xBFEC71C720000000, float 0xBFEBE5DC60000000, float 0xBFEA477C60000000, float 0xBFE7A693C0000000, float 0xBFE41CFEA0000000, float 0x3FDF9B13E0000000, float 0x3FDF9B1380000000, float 0x3FD5C53B80000000, float 0x3FD5C53B00000000, float 0x3FC6326AC0000000, float 0x3FC63269E0000000, float 0xBEE05CEB00000000, float 0xBEE086A320000000, float 0xBFC63269E0000000, float 0xBFC6326AC0000000, float 0xBFD5C53B80000000, float 0xBFD5C53B80000000, float 0xBFDF9B13E0000000, float 0xBFDF9B1460000000, float 0xBFE41CFE80000000, float 0x3FE7A693C0000000, float 0x3FEA477C20000000, float 0x3FEBE5DC40000000, float 0x3FEC71C6E0000000, float 0x3FEBE5DC40000000, float 0x3FEA477C20000000, float 0x3FE7A693C0000000, float 0xBFE41CFE80000000>, i32 %idx
	%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx			%v2 = extractelement <81 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFEA0000000, float 0xBFE7A693C0000000, float 0x3FE7A693C0000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFEBE5DC40000000, float 0x3FEBE5DC40000000, float 0xBFEC71C720000000, float 0x3FEC71C6E0000000, float 0xBFEBE5DC60000000, float 0x3FEBE5DC40000000, float 0xBFEA477C20000000, float 0x3FEA477C20000000, float 0xBFE7A693C0000000, float 0x3FE7A69380000000, float 0xBFE41CFEA0000000, float 0xBFDF9B13E0000000, float 0xBFD5C53B80000000, float 0xBFC6326AC0000000, float 0x3EE0789320000000, float 0x3FC6326AC0000000, float 0x3FD5C53B80000000, float 0x3FDF9B13E0000000, float 0x3FE41CFE80000000>, i32 %idx
	%r = fadd float %v1, %v2			%r = fadd float %v1, %v2
	ret float %r			ret float %r
	}			}

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll

	Show All 9 Lines
	; intermediate register class copies.			; intermediate register class copies.

	; FIXME: The same register is initialized to 0 for every spill.			; FIXME: The same register is initialized to 0 for every spill.

	; GCN-LABEL: {{^}}spill_vgpr_compute:			; GCN-LABEL: {{^}}spill_vgpr_compute:

	; HSA: enable_sgpr_private_segment_buffer = 1			; HSA: enable_sgpr_private_segment_buffer = 1
	; HSA: enable_sgpr_flat_scratch_init = 0			; HSA: enable_sgpr_flat_scratch_init = 0
	; HSA: workitem_private_segment_byte_size = 1024			; HSA: workitem_private_segment_byte_size = 540

	; GCN-NOT: flat_scr			; GCN-NOT: flat_scr

	; GCNMESA-DAG: s_mov_b32 s16, s3			; GCNMESA-DAG: s_mov_b32 s16, s3
	; GCNMESA-DAG: s_mov_b32 s12, SCRATCH_RSRC_DWORD0			; GCNMESA-DAG: s_mov_b32 s12, SCRATCH_RSRC_DWORD0
	; GCNMESA--DAG: s_mov_b32 s13, SCRATCH_RSRC_DWORD1			; GCNMESA--DAG: s_mov_b32 s13, SCRATCH_RSRC_DWORD1
	; GCNMESA-DAG: s_mov_b32 s14, -1			; GCNMESA-DAG: s_mov_b32 s14, -1
	; SIMESA-DAG: s_mov_b32 s15, 0xe8f000			; SIMESA-DAG: s_mov_b32 s15, 0xe8f000
	; VIMESA-DAG: s_mov_b32 s15, 0xe80000			; VIMESA-DAG: s_mov_b32 s15, 0xe80000


	; GCN: buffer_store_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}} ; 4-byte Folded Spill			; GCN: buffer_store_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}} ; 4-byte Folded Spill

	; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_store_dword {{v[0-9]}}, off, s[12:15], s16 offset:{{[0-9]+}}

	; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}
	; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}			; GCN: buffer_load_dword {{v[0-9]+}}, off, s[12:15], s16 offset:{{[0-9]+}}

	; GCN: NumVgprs: 256			; GCN: NumVgprs: 256
	; GCN: ScratchSize: 1024			; GCN: ScratchSize: 540

	; s[0:3] input user SGPRs. s4,s5,s6 = workgroup IDs. s8 scratch offset.			; s[0:3] input user SGPRs. s4,s5,s6 = workgroup IDs. s8 scratch offset.
	define void @spill_vgpr_compute(<4 x float> %arg6, float addrspace(1)* %arg, i32 %arg1, i32 %arg2, float %arg3, float %arg4, float %arg5) #0 {			define void @spill_vgpr_compute(<4 x float> %arg6, float addrspace(1)* %arg, i32 %arg1, i32 %arg2, float %arg3, float %arg4, float %arg5) #0 {
	bb:			bb:
	%tmp = add i32 %arg1, %arg2			%tmp = add i32 %arg1, %arg2
	%tmp7 = extractelement <4 x float> %arg6, i32 0			%tmp7 = extractelement <4 x float> %arg6, i32 0
	%tmp8 = extractelement <4 x float> %arg6, i32 1			%tmp8 = extractelement <4 x float> %arg6, i32 1
	%tmp9 = extractelement <4 x float> %arg6, i32 2			%tmp9 = extractelement <4 x float> %arg6, i32 2
	▲ Show 20 Lines • Show All 546 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll

	Show All 18 Lines
	; SI-DAG: s_mov_b32 s15, 0xe8f000			; SI-DAG: s_mov_b32 s15, 0xe8f000
	; VI-DAG: s_mov_b32 s15, 0xe80000			; VI-DAG: s_mov_b32 s15, 0xe80000

	; s11 is offset system SGPR			; s11 is offset system SGPR
	; GCN: buffer_store_dword {{v[0-9]+}}, off, s[12:15], s11 offset:{{[0-9]+}} ; 4-byte Folded Spill			; GCN: buffer_store_dword {{v[0-9]+}}, off, s[12:15], s11 offset:{{[0-9]+}} ; 4-byte Folded Spill
	; GCN: buffer_load_dword v{{[0-9]+}}, off, s[12:15], s11 offset:{{[0-9]+}} ; 4-byte Folded Reload			; GCN: buffer_load_dword v{{[0-9]+}}, off, s[12:15], s11 offset:{{[0-9]+}} ; 4-byte Folded Reload

	; GCN: NumVgprs: 256			; GCN: NumVgprs: 256
	; GCN: ScratchSize: 1024			; GCN: ScratchSize: 536

	define amdgpu_vs void @main([9 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg1, [17 x <4 x i32>] addrspace(2)* byval %arg2, [34 x <8 x i32>] addrspace(2)* byval %arg3, [16 x <16 x i8>] addrspace(2)* byval %arg4, i32 inreg %arg5, i32 inreg %arg6, i32 %arg7, i32 %arg8, i32 %arg9, i32 %arg10) #0 {			define amdgpu_vs void @main([9 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg1, [17 x <4 x i32>] addrspace(2)* byval %arg2, [34 x <8 x i32>] addrspace(2)* byval %arg3, [16 x <16 x i8>] addrspace(2)* byval %arg4, i32 inreg %arg5, i32 inreg %arg6, i32 %arg7, i32 %arg8, i32 %arg9, i32 %arg10) #0 {
	bb:			bb:
	%tmp = getelementptr [17 x <16 x i8>], [17 x <16 x i8>] addrspace(2)* %arg1, i64 0, i64 0			%tmp = getelementptr [17 x <16 x i8>], [17 x <16 x i8>] addrspace(2)* %arg1, i64 0, i64 0
	%tmp11 = load <16 x i8>, <16 x i8> addrspace(2)* %tmp, align 16, !tbaa !0			%tmp11 = load <16 x i8>, <16 x i8> addrspace(2)* %tmp, align 16, !tbaa !0
	%tmp12 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 0)			%tmp12 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 0)
	%tmp13 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 16)			%tmp13 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 16)
	%tmp14 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 32)			%tmp14 = call float @llvm.SI.load.const(<16 x i8> %tmp11, i32 32)
	▲ Show 20 Lines • Show All 466 Lines • Show Last 20 Lines