This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add support for llvm.r600.local.size.* instrics when targeting HSA
AbandonedPublic

Authored by • tstellarAMD on Aug 28 2015, 3:33 PM.

Download Raw Diff

Details

Reviewers

Summary

For HSA these values are stored in the aql structure, which can be accesed
via the dispatch pointer which is loaded into user sgprs. For
simplicity, we are currently loading the dispatch pointer for all shaders,
even when it isn't used.

Diff Detail

Event Timeline

• tstellarAMD updated this revision to Diff 33482.Aug 28 2015, 3:33 PM

• tstellarAMD retitled this revision from to AMDGPU/SI: Add support for llvm.r600.local.size.* instrics when targeting HSA.

• tstellarAMD updated this object.

• tstellarAMD added a reviewer: arsenm.

• tstellarAMD added a subscriber: llvm-commits.

Herald added a subscriber: arsenm. · View Herald TranscriptAug 28 2015, 3:33 PM

arsenm added inline comments.Aug 28 2015, 3:44 PM

lib/Target/AMDGPU/SIISelLowering.cpp
1044	Why is this an i16? We really don't want to have to do an argument extload, although from the tests it looks like that doesn't happen.

• tstellarAMD added inline comments.Aug 28 2015, 5:15 PM

lib/Target/AMDGPU/SIISelLowering.cpp
1044	The extload isn't happening, because LowerParameter only does extloads for floating-point types, I can fix that. The local size values are stored in memory as i16 values. We could use a 32-bit non-ext load for the z value, since the next 16-bits after the z value will always be 0. For x and y, we always load both and then mask/shift to get the value we need. I'm not sure if 32-bit load + mask or shift is faster than 16-bit ext load.

LGTM

lib/Target/AMDGPU/SIISelLowering.cpp
1044	The 32-bit load and mask will definitely be better because there are no scalar ext loads.

This revision is now accepted and ready to land.Aug 28 2015, 5:23 PM

arsenm added inline comments.Oct 2 2015, 3:18 PM

lib/Target/AMDGPU/SIISelLowering.cpp
1043	This should be Dim * 2 since this is an i16.

Added a fix for the offset bug, and added mask/shift to avoid using extloads.
Unfortunately, the DAGCombine still emits an extload for the local.size.x case.
We really need address space arguments for isExtLoadLegal().

arsenm added inline comments.Oct 5 2015, 11:37 AM

lib/Target/AMDGPU/SIISelLowering.cpp
476–477	I think Signed should stay the last parameter
1052–1053	The cases aren't supposed to be indented
1059–1060	Why do you want to use SRA? I would expect SRL to be preferrable
test/CodeGen/AMDGPU/work-item-intrinsics.ll
102–104	For this case specifically, it already should not happen. I added shouldReduceLoadWidth to stop producing < 32-bit type extloads. Is there somewhere missing this?
113	There should be tests that use all the combinations of local.size.x/y/z

Added more test cases, and fixes some coding style issues.

• tstellarAMD added a parent revision: D13805: DAGCombiner: Check shouldReduceLoadWidth before combining (and (load), x) -> extload.Oct 16 2015, 6:14 AM

LGTM

lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp
521	We should be able to detect if this will be needed from the IR, but I'm not sure where to put that
lib/Target/AMDGPU/SIISelLowering.cpp
632–635	I'm not sure why we have getPhysRegSubReg when getSubReg already exists. I'm already working on fixing this in other patches though

I was thinking about this, and I think it would be better if we added an intrinsic for the dispatch pointer and put the complexity of deciding what offset to read in the library. It will make it easier to detect when the dispatch pointer is a necessary input.

This has been implemented using the llvm.amdgcn.dispatch.ptr intrinsic instead.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

1 line

5 lines

78 lines

11 lines

1 line

3 lines

test/

CodeGen/

AMDGPU/

work-item-intrinsics.ll

83 lines

Diff 36525

lib/Target/AMDGPU/AMDGPUAsmPrinter.cpp

Show First 20 Lines • Show All 512 Lines • ▼ Show 20 Lines	void AMDGPUAsmPrinter::EmitAmdKernelCodeT(const MachineFunction &MF,
amd_kernel_code_t header;		amd_kernel_code_t header;

AMDGPU::initDefaultAMDKernelCodeT(header, STM.getFeatureBits());		AMDGPU::initDefaultAMDKernelCodeT(header, STM.getFeatureBits());

header.compute_pgm_resource_registers =		header.compute_pgm_resource_registers =
KernelInfo.ComputePGMRSrc1 \|		KernelInfo.ComputePGMRSrc1 \|
(KernelInfo.ComputePGMRSrc2 << 32);		(KernelInfo.ComputePGMRSrc2 << 32);
header.code_properties =		header.code_properties =
		AMD_CODE_PROPERTY_ENABLE_SGPR_DISPATCH_PTR \|
		arsenmUnsubmitted Not Done Reply Inline Actions We should be able to detect if this will be needed from the IR, but I'm not sure where to put that arsenm: We should be able to detect if this will be needed from the IR, but I'm not sure where to put…
AMD_CODE_PROPERTY_ENABLE_SGPR_KERNARG_SEGMENT_PTR \|		AMD_CODE_PROPERTY_ENABLE_SGPR_KERNARG_SEGMENT_PTR \|
AMD_CODE_PROPERTY_IS_PTR64;		AMD_CODE_PROPERTY_IS_PTR64;

header.kernarg_segment_byte_size = MFI->ABIArgOffset;		header.kernarg_segment_byte_size = MFI->ABIArgOffset;
header.wavefront_sgpr_count = KernelInfo.NumSGPR;		header.wavefront_sgpr_count = KernelInfo.NumSGPR;
header.workitem_vgpr_count = KernelInfo.NumVGPR;		header.workitem_vgpr_count = KernelInfo.NumVGPR;


Show All 25 Lines

lib/Target/AMDGPU/SIISelLowering.h

	Show All 16 Lines

	#include "AMDGPUISelLowering.h"			#include "AMDGPUISelLowering.h"
	#include "SIInstrInfo.h"			#include "SIInstrInfo.h"

	namespace llvm {			namespace llvm {

	class SITargetLowering : public AMDGPUTargetLowering {			class SITargetLowering : public AMDGPUTargetLowering {
	SDValue LowerParameter(SelectionDAG &DAG, EVT VT, EVT MemVT, SDLoc DL,			SDValue LowerParameter(SelectionDAG &DAG, EVT VT, EVT MemVT, SDLoc DL,
	SDValue Chain, unsigned Offset, bool Signed) const;			SDValue Chain, unsigned Offset, bool Signed,
				unsigned BasePtrReg = 0) const;
	SDValue LowerSampleIntrinsic(unsigned Opcode, const SDValue &Op,			SDValue LowerSampleIntrinsic(unsigned Opcode, const SDValue &Op,
	SelectionDAG &DAG) const;			SelectionDAG &DAG) const;
	SDValue LowerGlobalAddress(AMDGPUMachineFunction *MFI, SDValue Op,			SDValue LowerGlobalAddress(AMDGPUMachineFunction *MFI, SDValue Op,
	SelectionDAG &DAG) const override;			SelectionDAG &DAG) const override;

				SDValue LowerLocalSizeIntrinsic(SelectionDAG &DAG, SDLoc DL,
				unsigned Dim) const;
	SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerINTRINSIC_VOID(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFrameIndex(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFrameIndex(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerSELECT(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerSELECT(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFastFDIV(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFastFDIV(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

Show All 13 Lines

#ifdef _MSC_VER		#ifdef _MSC_VER
// Provide M_PI.		// Provide M_PI.
#define _USE_MATH_DEFINES		#define _USE_MATH_DEFINES
#include <cmath>		#include <cmath>
#endif		#endif

#include "SIISelLowering.h"		#include "SIISelLowering.h"
		#include "SIInstrInfo.h"
#include "AMDGPU.h"		#include "AMDGPU.h"
#include "AMDGPUIntrinsicInfo.h"		#include "AMDGPUIntrinsicInfo.h"
#include "AMDGPUSubtarget.h"		#include "AMDGPUSubtarget.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "llvm/ADT/BitVector.h"		#include "llvm/ADT/BitVector.h"
#include "llvm/CodeGen/CallingConvLower.h"		#include "llvm/CodeGen/CallingConvLower.h"
▲ Show 20 Lines • Show All 437 Lines • ▼ Show 20 Lines
static EVT toIntegerVT(EVT VT) {		static EVT toIntegerVT(EVT VT) {
if (VT.isVector())		if (VT.isVector())
return VT.changeVectorElementTypeToInteger();		return VT.changeVectorElementTypeToInteger();
return MVT::getIntegerVT(VT.getSizeInBits());		return MVT::getIntegerVT(VT.getSizeInBits());
}		}

SDValue SITargetLowering::LowerParameter(SelectionDAG &DAG, EVT VT, EVT MemVT,		SDValue SITargetLowering::LowerParameter(SelectionDAG &DAG, EVT VT, EVT MemVT,
SDLoc SL, SDValue Chain,		SDLoc SL, SDValue Chain,
unsigned Offset, bool Signed) const {		unsigned Offset, bool Signed,
		unsigned BasePtrReg) const {
		arsenmUnsubmitted Not Done Reply Inline Actions I think Signed should stay the last parameter arsenm: I think Signed should stay the last parameter
const DataLayout &DL = DAG.getDataLayout();		const DataLayout &DL = DAG.getDataLayout();
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
const SIRegisterInfo *TRI =		const SIRegisterInfo *TRI =
static_cast<const SIRegisterInfo*>(Subtarget->getRegisterInfo());		static_cast<const SIRegisterInfo*>(Subtarget->getRegisterInfo());
unsigned InputPtrReg = TRI->getPreloadedValue(MF, SIRegisterInfo::INPUT_PTR);
		if (!BasePtrReg)
		BasePtrReg = TRI->getPreloadedValue(MF, SIRegisterInfo::INPUT_PTR);

Type Ty = VT.getTypeForEVT(DAG.getContext());		Type Ty = VT.getTypeForEVT(DAG.getContext());

MachineRegisterInfo &MRI = DAG.getMachineFunction().getRegInfo();		MachineRegisterInfo &MRI = DAG.getMachineFunction().getRegInfo();
MVT PtrVT = getPointerTy(DL, AMDGPUAS::CONSTANT_ADDRESS);		MVT PtrVT = getPointerTy(DL, AMDGPUAS::CONSTANT_ADDRESS);
PointerType *PtrTy = PointerType::get(Ty, AMDGPUAS::CONSTANT_ADDRESS);		PointerType *PtrTy = PointerType::get(Ty, AMDGPUAS::CONSTANT_ADDRESS);
SDValue BasePtr = DAG.getCopyFromReg(Chain, SL,		SDValue BasePtr = DAG.getCopyFromReg(Chain, SL,
MRI.getLiveInVirtReg(InputPtrReg), PtrVT);		MRI.getLiveInVirtReg(BasePtrReg), PtrVT);
SDValue Ptr = DAG.getNode(ISD::ADD, SL, PtrVT, BasePtr,		SDValue Ptr = DAG.getNode(ISD::ADD, SL, PtrVT, BasePtr,
DAG.getConstant(Offset, SL, PtrVT));		DAG.getConstant(Offset, SL, PtrVT));
SDValue PtrOffset = DAG.getUNDEF(PtrVT);		SDValue PtrOffset = DAG.getUNDEF(PtrVT);
MachinePointerInfo PtrInfo(UndefValue::get(PtrTy));		MachinePointerInfo PtrInfo(UndefValue::get(PtrTy));

unsigned Align = DL.getABITypeAlignment(Ty);		unsigned Align = DL.getABITypeAlignment(Ty);

if (VT != MemVT && VT.isFloatingPoint()) {		if (VT != MemVT && VT.isFloatingPoint()) {
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	if (Info->getShaderType() == ShaderType::PIXEL &&
CCInfo.AllocateReg(AMDGPU::VGPR0);		CCInfo.AllocateReg(AMDGPU::VGPR0);
CCInfo.AllocateReg(AMDGPU::VGPR1);		CCInfo.AllocateReg(AMDGPU::VGPR1);
}		}

// The pointer to the list of arguments is stored in SGPR0, SGPR1		// The pointer to the list of arguments is stored in SGPR0, SGPR1
// The pointer to the scratch buffer is stored in SGPR2, SGPR3		// The pointer to the scratch buffer is stored in SGPR2, SGPR3
if (Info->getShaderType() == ShaderType::COMPUTE) {		if (Info->getShaderType() == ShaderType::COMPUTE) {
if (Subtarget->isAmdHsaOS())		if (Subtarget->isAmdHsaOS())
Info->NumUserSGPRs = 2; // FIXME: Need to support scratch buffers.		Info->NumUserSGPRs = 4; // FIXME: Need to support scratch buffers.
else		else
Info->NumUserSGPRs = 4;		Info->NumUserSGPRs = 4;

unsigned InputPtrReg =		unsigned InputPtrReg =
TRI->getPreloadedValue(MF, SIRegisterInfo::INPUT_PTR);		TRI->getPreloadedValue(MF, SIRegisterInfo::INPUT_PTR);
unsigned InputPtrRegLo =		unsigned InputPtrRegLo =
TRI->getPhysRegSubReg(InputPtrReg, &AMDGPU::SReg_32RegClass, 0);		TRI->getPhysRegSubReg(InputPtrReg, &AMDGPU::SReg_32RegClass, 0);
unsigned InputPtrRegHi =		unsigned InputPtrRegHi =
TRI->getPhysRegSubReg(InputPtrReg, &AMDGPU::SReg_32RegClass, 1);		TRI->getPhysRegSubReg(InputPtrReg, &AMDGPU::SReg_32RegClass, 1);

unsigned ScratchPtrReg =		unsigned ScratchPtrReg =
TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_PTR);		TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_PTR);
unsigned ScratchPtrRegLo =		unsigned ScratchPtrRegLo =
TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 0);		TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 0);
unsigned ScratchPtrRegHi =		unsigned ScratchPtrRegHi =
TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 1);		TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 1);

CCInfo.AllocateReg(InputPtrRegLo);		CCInfo.AllocateReg(InputPtrRegLo);
CCInfo.AllocateReg(InputPtrRegHi);		CCInfo.AllocateReg(InputPtrRegHi);
CCInfo.AllocateReg(ScratchPtrRegLo);		CCInfo.AllocateReg(ScratchPtrRegLo);
CCInfo.AllocateReg(ScratchPtrRegHi);		CCInfo.AllocateReg(ScratchPtrRegHi);
MF.addLiveIn(InputPtrReg, &AMDGPU::SReg_64RegClass);		MF.addLiveIn(InputPtrReg, &AMDGPU::SReg_64RegClass);
MF.addLiveIn(ScratchPtrReg, &AMDGPU::SReg_64RegClass);		MF.addLiveIn(ScratchPtrReg, &AMDGPU::SReg_64RegClass);
		if (Subtarget->isAmdHsaOS()) {
		unsigned DispatchPtrReg =
		TRI->getPreloadedValue(MF, SIRegisterInfo::DISPATCH_PTR);
		unsigned DispatchPtrRegLo =
		TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 0);
		unsigned DispatchPtrRegHi =
		TRI->getPhysRegSubReg(ScratchPtrReg, &AMDGPU::SReg_32RegClass, 1);
		arsenmUnsubmitted Not Done Reply Inline Actions I'm not sure why we have getPhysRegSubReg when getSubReg already exists. I'm already working on fixing this in other patches though arsenm: I'm not sure why we have getPhysRegSubReg when getSubReg already exists. I'm already working on…
		MF.addLiveIn(DispatchPtrReg, &AMDGPU::SReg_64RegClass);
		}
}		}

if (Info->getShaderType() == ShaderType::COMPUTE) {		if (Info->getShaderType() == ShaderType::COMPUTE) {
getOriginalFunctionArgs(DAG, DAG.getMachineFunction().getFunction(), Ins,		getOriginalFunctionArgs(DAG, DAG.getMachineFunction().getFunction(), Ins,
Splits);		Splits);
}		}

AnalyzeFormalArguments(CCInfo, Splits);		AnalyzeFormalArguments(CCInfo, Splits);
▲ Show 20 Lines • Show All 377 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::copyToM0(SelectionDAG &DAG, SDValue Chain, SDLoc DL,
// instructions and the register coalescer eliminate the extra copies.		// instructions and the register coalescer eliminate the extra copies.
SDNode *M0 = DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, V.getValueType(), V);		SDNode *M0 = DAG.getMachineNode(AMDGPU::S_MOV_B32, DL, V.getValueType(), V);
return DAG.getCopyToReg(Chain, DL, DAG.getRegister(AMDGPU::M0, MVT::i32),		return DAG.getCopyToReg(Chain, DL, DAG.getRegister(AMDGPU::M0, MVT::i32),
SDValue(M0, 0), SDValue()); // Glue		SDValue(M0, 0), SDValue()); // Glue
// A Null SDValue creates		// A Null SDValue creates
// a glue result.		// a glue result.
}		}

		SDValue SITargetLowering::LowerLocalSizeIntrinsic(SelectionDAG &DAG,
		SDLoc DL,
		unsigned Dim) const {
		MachineFunction &MF = DAG.getMachineFunction();
		const SIRegisterInfo *TRI =
		static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());

		unsigned Offset;
		unsigned BasePtr;
		EVT MemVT;
		if (Subtarget->isAmdHsaOS()) {
		BasePtr = TRI->getPreloadedValue(MF, SIRegisterInfo::DISPATCH_PTR);

		arsenmUnsubmitted Not Done Reply Inline Actions This should be Dim * 2 since this is an i16. arsenm: This should be Dim * 2 since this is an i16.
		// Local size value are 16-bits, but we always load 32-bit values and
		arsenmUnsubmitted Not Done Reply Inline Actions Why is this an i16? We really don't want to have to do an argument extload, although from the tests it looks like that doesn't happen. arsenm: Why is this an i16? We really don't want to have to do an argument extload, although from the…
		tstellarAMDAuthorUnsubmitted Not Done Reply Inline Actions The extload isn't happening, because LowerParameter only does extloads for floating-point types, I can fix that. The local size values are stored in memory as i16 values. We could use a 32-bit non-ext load for the z value, since the next 16-bits after the z value will always be 0. For x and y, we always load both and then mask/shift to get the value we need. I'm not sure if 32-bit load + mask or shift is faster than 16-bit ext load. tstellarAMD: The extload isn't happening, because LowerParameter only does extloads for floating-point types…
		arsenmUnsubmitted Not Done Reply Inline Actions The 32-bit load and mask will definitely be better because there are no scalar ext loads. arsenm: The 32-bit load and mask will definitely be better because there are no scalar ext loads.
		// then mask or shift to get the correct value. This allows use to
		// load the data with SMRD instructions which is faster than using
		// MUBUF instructions.
		Offset = SI::DispatchPacketOffset::LOCAL_SIZE_X + (4 * (Dim >> 1));
		SDValue Param = LowerParameter(DAG, MVT::i32, MVT::i32, DL, DAG.getEntryNode(),
		Offset, false, BasePtr);

		switch (Dim) {
		case 0:
		arsenmUnsubmitted Not Done Reply Inline Actions The cases aren't supposed to be indented arsenm: The cases aren't supposed to be indented
		// Clear the high bits.
		Param = DAG.getNode(ISD::AND, DL, MVT::i32, Param,
		DAG.getConstant(0xffff, DL, MVT::i32));
		break;
		case 1:
		// Get local size y from the high bits. We can use SRA here, because
		// the max value range is 0-256, so the sign bit will always be zero.
		arsenmUnsubmitted Not Done Reply Inline Actions Why do you want to use SRA? I would expect SRL to be preferrable arsenm: Why do you want to use SRA? I would expect SRL to be preferrable
		Param = DAG.getNode(ISD::SRA, DL, MVT::i32, Param,
		DAG.getConstant(16, DL, MVT::i32));
		break;
		case 2:
		// Do nothing, the 16-bits after the z dimension size are always
		// zero, so we don't need to clear them.
		break;
		}
		return Param;
		}

		BasePtr = TRI->getPreloadedValue(MF, SIRegisterInfo::INPUT_PTR);
		Offset = SI::KernelInputOffsets::LOCAL_SIZE_X + (Dim * 4);

		return LowerParameter(DAG, MVT::i32, MVT::i32, DL, DAG.getEntryNode(),
		Offset, false, BasePtr);
		}

SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,		SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
auto MFI = MF.getInfo<SIMachineFunctionInfo>();		auto MFI = MF.getInfo<SIMachineFunctionInfo>();
const SIRegisterInfo *TRI =		const SIRegisterInfo *TRI =
static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());		static_cast<const SIRegisterInfo *>(Subtarget->getRegisterInfo());

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
Show All 15 Lines	return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),
SI::KernelInputOffsets::GLOBAL_SIZE_X, false);		SI::KernelInputOffsets::GLOBAL_SIZE_X, false);
case Intrinsic::r600_read_global_size_y:		case Intrinsic::r600_read_global_size_y:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),
SI::KernelInputOffsets::GLOBAL_SIZE_Y, false);		SI::KernelInputOffsets::GLOBAL_SIZE_Y, false);
case Intrinsic::r600_read_global_size_z:		case Intrinsic::r600_read_global_size_z:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),
SI::KernelInputOffsets::GLOBAL_SIZE_Z, false);		SI::KernelInputOffsets::GLOBAL_SIZE_Z, false);
case Intrinsic::r600_read_local_size_x:		case Intrinsic::r600_read_local_size_x:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerLocalSizeIntrinsic(DAG, DL, 0);
SI::KernelInputOffsets::LOCAL_SIZE_X, false);
case Intrinsic::r600_read_local_size_y:		case Intrinsic::r600_read_local_size_y:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerLocalSizeIntrinsic(DAG, DL, 1);
SI::KernelInputOffsets::LOCAL_SIZE_Y, false);
case Intrinsic::r600_read_local_size_z:		case Intrinsic::r600_read_local_size_z:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerLocalSizeIntrinsic(DAG, DL, 2);
SI::KernelInputOffsets::LOCAL_SIZE_Z, false);

case Intrinsic::AMDGPU_read_workdim:		case Intrinsic::AMDGPU_read_workdim:
return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),		return LowerParameter(DAG, VT, VT, DL, DAG.getEntryNode(),
getImplicitParameterOffset(MFI, GRID_DIM), false);		getImplicitParameterOffset(MFI, GRID_DIM), false);

case Intrinsic::r600_read_tgid_x:		case Intrinsic::r600_read_tgid_x:
return CreateLiveInRegister(DAG, &AMDGPU::SReg_32RegClass,		return CreateLiveInRegister(DAG, &AMDGPU::SReg_32RegClass,
TRI->getPreloadedValue(MF, SIRegisterInfo::TGID_X), VT);		TRI->getPreloadedValue(MF, SIRegisterInfo::TGID_X), VT);
▲ Show 20 Lines • Show All 1,279 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstrInfo.h

Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	namespace AMDGPU {
int getAtomicNoRetOp(uint16_t Opcode);		int getAtomicNoRetOp(uint16_t Opcode);

const uint64_t RSRC_DATA_FORMAT = 0xf00000000000LL;		const uint64_t RSRC_DATA_FORMAT = 0xf00000000000LL;
const uint64_t RSRC_TID_ENABLE = 1LL << 55;		const uint64_t RSRC_TID_ENABLE = 1LL << 55;

} // End namespace AMDGPU		} // End namespace AMDGPU

namespace SI {		namespace SI {

		namespace DispatchPacketOffset {

		enum {
		LOCAL_SIZE_X = 4,
		LOCAL_SIZE_Y = 6,
		LOCAL_SIZE_Z = 8,
		};

		}

namespace KernelInputOffsets {		namespace KernelInputOffsets {

/// Offsets in bytes from the start of the input buffer		/// Offsets in bytes from the start of the input buffer
enum Offsets {		enum Offsets {
NGROUPS_X = 0,		NGROUPS_X = 0,
NGROUPS_Y = 4,		NGROUPS_Y = 4,
NGROUPS_Z = 8,		NGROUPS_Z = 8,
GLOBAL_SIZE_X = 12,		GLOBAL_SIZE_X = 12,
Show All 13 Lines

lib/Target/AMDGPU/SIRegisterInfo.h

Show First 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	public:

/// \returns True if operands defined with this operand type can accept		/// \returns True if operands defined with this operand type can accept
/// an inline constant. i.e. An integer value in the range (-16, 64) or		/// an inline constant. i.e. An integer value in the range (-16, 64) or
/// -4.0f, -2.0f, -1.0f, -0.5f, 0.0f, 0.5f, 1.0f, 2.0f, 4.0f.		/// -4.0f, -2.0f, -1.0f, -0.5f, 0.0f, 0.5f, 1.0f, 2.0f, 4.0f.
bool opCanUseInlineConstant(unsigned OpType) const;		bool opCanUseInlineConstant(unsigned OpType) const;

enum PreloadedValue {		enum PreloadedValue {
SCRATCH_PTR = 0,		SCRATCH_PTR = 0,
		DISPATCH_PTR = 1,
INPUT_PTR = 3,		INPUT_PTR = 3,
TGID_X = 10,		TGID_X = 10,
TGID_Y = 11,		TGID_Y = 11,
TGID_Z = 12,		TGID_Z = 12,
SCRATCH_WAVE_OFFSET = 14,		SCRATCH_WAVE_OFFSET = 14,
FIRST_VGPR_VALUE = 15,		FIRST_VGPR_VALUE = 15,
TIDIG_X = FIRST_VGPR_VALUE,		TIDIG_X = FIRST_VGPR_VALUE,
TIDIG_Y = 16,		TIDIG_Y = 16,
Show All 29 Lines

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 458 Lines • ▼ Show 20 Lines	if (opCanUseLiteralConstant(OpType))
return true;		return true;

return OpType == AMDGPU::OPERAND_REG_INLINE_C;		return OpType == AMDGPU::OPERAND_REG_INLINE_C;
}		}

unsigned SIRegisterInfo::getPreloadedValue(const MachineFunction &MF,		unsigned SIRegisterInfo::getPreloadedValue(const MachineFunction &MF,
enum PreloadedValue Value) const {		enum PreloadedValue Value) const {

		const AMDGPUSubtarget &STI = MF.getSubtarget<AMDGPUSubtarget>();
const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
switch (Value) {		switch (Value) {
case SIRegisterInfo::TGID_X:		case SIRegisterInfo::TGID_X:
return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 0);		return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 0);
case SIRegisterInfo::TGID_Y:		case SIRegisterInfo::TGID_Y:
return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 1);		return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 1);
case SIRegisterInfo::TGID_Z:		case SIRegisterInfo::TGID_Z:
return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 2);		return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 2);
case SIRegisterInfo::SCRATCH_WAVE_OFFSET:		case SIRegisterInfo::SCRATCH_WAVE_OFFSET:
if (MFI->getShaderType() != ShaderType::COMPUTE)		if (MFI->getShaderType() != ShaderType::COMPUTE)
return MFI->ScratchOffsetReg;		return MFI->ScratchOffsetReg;
return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 4);		return AMDGPU::SReg_32RegClass.getRegister(MFI->NumUserSGPRs + 4);
case SIRegisterInfo::SCRATCH_PTR:		case SIRegisterInfo::SCRATCH_PTR:
return AMDGPU::SGPR2_SGPR3;		return AMDGPU::SGPR2_SGPR3;
case SIRegisterInfo::INPUT_PTR:		case SIRegisterInfo::INPUT_PTR:
		return STI.isAmdHsaOS() ? AMDGPU::SGPR2_SGPR3 : AMDGPU::SGPR0_SGPR1;
		case SIRegisterInfo::DISPATCH_PTR:
return AMDGPU::SGPR0_SGPR1;		return AMDGPU::SGPR0_SGPR1;
case SIRegisterInfo::TIDIG_X:		case SIRegisterInfo::TIDIG_X:
return AMDGPU::VGPR0;		return AMDGPU::VGPR0;
case SIRegisterInfo::TIDIG_Y:		case SIRegisterInfo::TIDIG_Y:
return AMDGPU::VGPR1;		return AMDGPU::VGPR1;
case SIRegisterInfo::TIDIG_Z:		case SIRegisterInfo::TIDIG_Z:
return AMDGPU::VGPR2;		return AMDGPU::VGPR2;
}		}
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/work-item-intrinsics.ll

	; RUN: llc -march=amdgcn -mcpu=SI -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=SI -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=SI-NOHSA -check-prefix=GCN-NOHSA -check-prefix=FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=VI -check-prefix=GCN -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=VI -check-prefix=VI-NOHSA -check-prefix=GCN -check-prefix=GCN-NOHSA -check-prefix=FUNC %s
				; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefix=SI -check-prefix=GCN -check-prefix=HSA -check-prefix=FUNC %s
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s


	; FUNC-LABEL: {{^}}ngroups_x:			; FUNC-LABEL: {{^}}ngroups_x:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[0].X			; EG: MOV [[VAL]], KC0[0].X

	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0			; GCN-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @ngroups_x (i32 addrspace(1)* %out) {			define void @ngroups_x (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.ngroups.x() #0			%0 = call i32 @llvm.r600.read.ngroups.x() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}ngroups_y:			; FUNC-LABEL: {{^}}ngroups_y:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[0].Y			; EG: MOV [[VAL]], KC0[0].Y

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x1			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x1
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x4			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x4
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @ngroups_y (i32 addrspace(1)* %out) {			define void @ngroups_y (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.ngroups.y() #0			%0 = call i32 @llvm.r600.read.ngroups.y() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}ngroups_z:			; FUNC-LABEL: {{^}}ngroups_z:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[0].Z			; EG: MOV [[VAL]], KC0[0].Z

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x2			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x2
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x8			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x8
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @ngroups_z (i32 addrspace(1)* %out) {			define void @ngroups_z (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.ngroups.z() #0			%0 = call i32 @llvm.r600.read.ngroups.z() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}global_size_x:			; FUNC-LABEL: {{^}}global_size_x:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[0].W			; EG: MOV [[VAL]], KC0[0].W

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x3			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x3
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0xc			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0xc
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @global_size_x (i32 addrspace(1)* %out) {			define void @global_size_x (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.global.size.x() #0			%0 = call i32 @llvm.r600.read.global.size.x() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}global_size_y:			; FUNC-LABEL: {{^}}global_size_y:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[1].X			; EG: MOV [[VAL]], KC0[1].X

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x4			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x4
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x10			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x10
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @global_size_y (i32 addrspace(1)* %out) {			define void @global_size_y (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.global.size.y() #0			%0 = call i32 @llvm.r600.read.global.size.y() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}global_size_z:			; FUNC-LABEL: {{^}}global_size_z:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[1].Y			; EG: MOV [[VAL]], KC0[1].Y

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x5			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x5
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x14			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x14
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @global_size_z (i32 addrspace(1)* %out) {			define void @global_size_z (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.global.size.z() #0			%0 = call i32 @llvm.r600.read.global.size.z() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}local_size_x:			; FUNC-LABEL: {{^}}local_size_x:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[1].Z			; EG: MOV [[VAL]], KC0[1].Z

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x6			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x6
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x18			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x18
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; FIXME: Need to teach SelectionDAG about per-address space legalization
				; of ext loads. We should be using SMRD.
				; HSA: buffer_load_ushort [[VVAL:v[0-9]+]], s[0:3], 0 offset:4
				arsenmUnsubmitted Not Done Reply Inline Actions For this case specifically, it already should not happen. I added shouldReduceLoadWidth to stop producing < 32-bit type extloads. Is there somewhere missing this? arsenm: For this case specifically, it already should not happen. I added shouldReduceLoadWidth to stop…
				; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN: buffer_store_dword [[VVAL]]
	define void @local_size_x (i32 addrspace(1)* %out) {			define void @local_size_x (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.local.size.x() #0			%0 = call i32 @llvm.r600.read.local.size.x() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

				arsenmUnsubmitted Not Done Reply Inline Actions There should be tests that use all the combinations of local.size.x/y/z arsenm: There should be tests that use all the combinations of local.size.x/y/z
	; FUNC-LABEL: {{^}}local_size_y:			; FUNC-LABEL: {{^}}local_size_y:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[1].W			; EG: MOV [[VAL]], KC0[1].W

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x7			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x7
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x1c			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x1c
				; HSA: s_load_dword [[XY_VAL:s[0-9]+]], s[0:1], 0x1
				; HSA: s_ashr_i32 [[VAL:s[0-9]+]], [[XY_VAL]], 16
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN: buffer_store_dword [[VVAL]]
	define void @local_size_y (i32 addrspace(1)* %out) {			define void @local_size_y (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.local.size.y() #0			%0 = call i32 @llvm.r600.read.local.size.y() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}local_size_z:			; FUNC-LABEL: {{^}}local_size_z:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[2].X			; EG: MOV [[VAL]], KC0[2].X

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x8			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x8
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x20			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x20
				; HSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x2
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN: buffer_store_dword [[VVAL]]
	define void @local_size_z (i32 addrspace(1)* %out) {			define void @local_size_z (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.local.size.z() #0			%0 = call i32 @llvm.r600.read.local.size.z() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}get_work_dim:			; FUNC-LABEL: {{^}}get_work_dim:
	; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]			; EG: MEM_RAT_CACHELESS STORE_RAW [[VAL:T[0-9]+\.X]]
	; EG: MOV [[VAL]], KC0[2].Z			; EG: MOV [[VAL]], KC0[2].Z

	; SI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0xb			; SI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0xb
	; VI: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x2c			; VI-NOHSA: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x2c
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], [[VAL]]
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @get_work_dim (i32 addrspace(1)* %out) {			define void @get_work_dim (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.AMDGPU.read.workdim() #0			%0 = call i32 @llvm.AMDGPU.read.workdim() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; The tgid values are stored in sgprs offset by the number of user sgprs.			; The tgid values are stored in sgprs offset by the number of user sgprs.
	; Currently we always use exactly 2 user sgprs for the pointer to the			; Currently we always use exactly 2 user sgprs for the pointer to the
	; kernel arguments, but this may change in the future.			; kernel arguments, but this may change in the future.

	; FUNC-LABEL: {{^}}tgid_x:			; FUNC-LABEL: {{^}}tgid_x:
	; GCN: v_mov_b32_e32 [[VVAL:v[0-9]+]], s4			; GCN-NOHSA: v_mov_b32_e32 [[VVAL:v[0-9]+]], s4
	; GCN: buffer_store_dword [[VVAL]]			; GCN-NOHSA: buffer_store_dword [[VVAL]]
	define void @tgid_x (i32 addrspace(1)* %out) {			define void @tgid_x (i32 addrspace(1)* %out) {
	entry:			entry:
	%0 = call i32 @llvm.r600.read.tgid.x() #0			%0 = call i32 @llvm.r600.read.tgid.x() #0
	store i32 %0, i32 addrspace(1)* %out			store i32 %0, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}tgid_y:			; FUNC-LABEL: {{^}}tgid_y:
	▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines