This is an archive of the discontinued LLVM Phabricator instance.

Differential D80032

[AMDGPU] Always expand ext/insertelement with divergent idx
ClosedPublic

Authored by rampitec on May 15 2020, 1:38 PM.

Download Raw Diff

Details

Reviewers

arsenm

Commits

rG4eecf171645e: [AMDGPU] Always expand ext/insertelement with divergent idx

Summary

Even though series of cmd/cndmask can produce quite a lot of
code that is still better than a loop. In case of doubles we
would even produce two loops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rampitec created this revision.May 15 2020, 1:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 15 2020, 1:38 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald Transcript

arsenm added inline comments.May 18 2020, 7:53 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
102–106	I would invert this and rename it. How about -amdgpu-use-divergent-register-indexing, default false?
9542–9545	GlobalISel needs the compare and select path implemented

Inverted the option as suggested.

rampitec added inline comments.May 20 2020, 12:01 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9542–9545	Yes, although that is a separate issue. GlobalISel also needs to work with non-power of two vectors for movrel. Yet another piece of work is to tune the limits, they seem to be suboptimal at least for doubles.

rampitec added a child revision: D80322: [AMDGPU] Tune threshold for cmp/select vector lowering.May 20 2020, 2:08 PM

arsenm accepted this revision.May 20 2020, 3:28 PM

This revision is now accepted and ready to land.May 20 2020, 3:28 PM

Closed by commit rG4eecf171645e: [AMDGPU] Always expand ext/insertelement with divergent idx (authored by rampitec). · Explain WhyMay 20 2020, 4:01 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

17 lines

test/

CodeGen/

AMDGPU/

extract_vector_dynelt.ll

28 lines

indirect-addressing-si-gfx9.ll

42 lines

indirect-addressing-si-pregfx9.ll

40 lines

indirect-addressing-si.ll

148 lines

insert_vector_dynelt.ll

34 lines

scratch-simple.ll

12 lines

Diff 265376

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	static cl::opt<bool> DisableLoopAlignment(
"amdgpu-disable-loop-alignment",		"amdgpu-disable-loop-alignment",
cl::desc("Do not align and prefetch loops"),		cl::desc("Do not align and prefetch loops"),
cl::init(false));		cl::init(false));

static cl::opt<bool> VGPRReserveforSGPRSpill(		static cl::opt<bool> VGPRReserveforSGPRSpill(
"amdgpu-reserve-vgpr-for-sgpr-spill",		"amdgpu-reserve-vgpr-for-sgpr-spill",
cl::desc("Allocates one VGPR for future SGPR Spill"), cl::init(true));		cl::desc("Allocates one VGPR for future SGPR Spill"), cl::init(true));

		static cl::opt<bool> UseDivergentRegisterIndexing(
		"amdgpu-use-divergent-register-indexing",
		cl::Hidden,
		cl::desc("Use indirect register addressing for divergent indexes"),
		cl::init(false));
		arsenmUnsubmitted Done Reply Inline Actions I would invert this and rename it. How about -amdgpu-use-divergent-register-indexing, default false? arsenm: I would invert this and rename it. How about -amdgpu-use-divergent-register-indexing, default…

static bool hasFP32Denormals(const MachineFunction &MF) {		static bool hasFP32Denormals(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().allFP32Denormals();		return Info->getMode().allFP32Denormals();
}		}

static bool hasFP64FP16Denormals(const MachineFunction &MF) {		static bool hasFP64FP16Denormals(const MachineFunction &MF) {
const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = MF.getInfo<SIMachineFunctionInfo>();
return Info->getMode().allFP64FP16Denormals();		return Info->getMode().allFP64FP16Denormals();
▲ Show 20 Lines • Show All 9,418 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::performExtractVectorEltCombine(
unsigned VecSize = VecVT.getSizeInBits();		unsigned VecSize = VecVT.getSizeInBits();
unsigned EltSize = EltVT.getSizeInBits();		unsigned EltSize = EltVT.getSizeInBits();

// EXTRACT_VECTOR_ELT (<n x e>, var-idx) => n x select (e, const-idx)		// EXTRACT_VECTOR_ELT (<n x e>, var-idx) => n x select (e, const-idx)
// This elminates non-constant index and subsequent movrel or scratch access.		// This elminates non-constant index and subsequent movrel or scratch access.
// Sub-dword vectors of size 2 dword or less have better implementation.		// Sub-dword vectors of size 2 dword or less have better implementation.
// Vectors of size bigger than 8 dwords would yield too many v_cndmask_b32		// Vectors of size bigger than 8 dwords would yield too many v_cndmask_b32
// instructions.		// instructions.
if (VecSize <= 256 && (VecSize > 64 \|\| EltSize >= 32) &&		// Always do this if var-idx is divergent, otherwise it will become a loop.
		if (!UseDivergentRegisterIndexing &&
		(VecSize <= 256 \|\| N->getOperand(1)->isDivergent()) &&
		(VecSize > 64 \|\| EltSize >= 32) &&
		arsenmUnsubmitted Not Done Reply Inline Actions GlobalISel needs the compare and select path implemented arsenm: GlobalISel needs the compare and select path implemented
		rampitecAuthorUnsubmitted Done Reply Inline Actions Yes, although that is a separate issue. GlobalISel also needs to work with non-power of two vectors for movrel. Yet another piece of work is to tune the limits, they seem to be suboptimal at least for doubles. rampitec: Yes, although that is a separate issue. GlobalISel also needs to work with non-power of two…
!isa<ConstantSDNode>(N->getOperand(1))) {		!isa<ConstantSDNode>(N->getOperand(1))) {
SDLoc SL(N);		SDLoc SL(N);
SDValue Idx = N->getOperand(1);		SDValue Idx = N->getOperand(1);
SDValue V;		SDValue V;
for (unsigned I = 0, E = VecVT.getVectorNumElements(); I < E; ++I) {		for (unsigned I = 0, E = VecVT.getVectorNumElements(); I < E; ++I) {
SDValue IC = DAG.getVectorIdxConstant(I, SL);		SDValue IC = DAG.getVectorIdxConstant(I, SL);
SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT, Vec, IC);		SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT, Vec, IC);
if (I == 0)		if (I == 0)
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	SITargetLowering::performInsertVectorEltCombine(SDNode *N,
unsigned EltSize = EltVT.getSizeInBits();		unsigned EltSize = EltVT.getSizeInBits();

// INSERT_VECTOR_ELT (<n x e>, var-idx)		// INSERT_VECTOR_ELT (<n x e>, var-idx)
// => BUILD_VECTOR n x select (e, const-idx)		// => BUILD_VECTOR n x select (e, const-idx)
// This elminates non-constant index and subsequent movrel or scratch access.		// This elminates non-constant index and subsequent movrel or scratch access.
// Sub-dword vectors of size 2 dword or less have better implementation.		// Sub-dword vectors of size 2 dword or less have better implementation.
// Vectors of size bigger than 8 dwords would yield too many v_cndmask_b32		// Vectors of size bigger than 8 dwords would yield too many v_cndmask_b32
// instructions.		// instructions.
if (isa<ConstantSDNode>(Idx) \|\|		// Always do this if var-idx is divergent, otherwise it will become a loop.
VecSize > 256 \|\| (VecSize <= 64 && EltSize < 32))		if (UseDivergentRegisterIndexing \|\| isa<ConstantSDNode>(Idx) \|\|
		(VecSize > 256 && !Idx->isDivergent()) \|\|
		(VecSize <= 64 && EltSize < 32))
return SDValue();		return SDValue();

SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDLoc SL(N);		SDLoc SL(N);
SDValue Ins = N->getOperand(1);		SDValue Ins = N->getOperand(1);
EVT IdxVT = Idx.getValueType();		EVT IdxVT = Idx.getValueType();

SmallVector<SDValue, 16> Ops;		SmallVector<SDValue, 16> Ops;
▲ Show 20 Lines • Show All 1,612 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/extract_vector_dynelt.ll

	Show First 20 Lines • Show All 378 Lines • ▼ Show 20 Lines
	; GCN: store_dword v[{{[0-9:]+}}], [[RES]]			; GCN: store_dword v[{{[0-9:]+}}], [[RES]]
	define amdgpu_kernel void @bit128_extelt(i32 addrspace(1)* %out, i32 %sel) {			define amdgpu_kernel void @bit128_extelt(i32 addrspace(1)* %out, i32 %sel) {
	entry:			entry:
	%ext = extractelement <128 x i1> <i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0>, i32 %sel			%ext = extractelement <128 x i1> <i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0, i1 1, i1 0>, i32 %sel
	%zext = zext i1 %ext to i32			%zext = zext i1 %ext to i32
	store i32 %zext, i32 addrspace(1)* %out			store i32 %zext, i32 addrspace(1)* %out
	ret void			ret void
	}			}

				; GCN-LABEL: {{^}}float32_extelt_vec:
				; GCN-NOT: buffer_
				; GCN-DAG: v_cmp_eq_u32_e{{32\|64}} [[CC1:[^,]+]], 1, v0
				; GCN-DAG: v_cndmask_b32_e{{32\|64}} [[V1:v[0-9]+]], 1.0, 2.0, [[CC1]]
				; GCN-DAG: v_mov_b32_e32 [[LASTVAL:v[0-9]+]], 0x42000000
				; GCN-DAG: v_cmp_ne_u32_e32 [[LASTCC:[^,]+]], 31, v0
				; GCN-DAG: v_cndmask_b32_e{{32\|64}} v0, [[LASTVAL]], v{{[0-9]+}}, [[LASTCC]]
				define float @float32_extelt_vec(i32 %sel) {
				entry:
				%ext = extractelement <32 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0, float 17.0, float 18.0, float 19.0, float 20.0, float 21.0, float 22.0, float 23.0, float 24.0, float 25.0, float 26.0, float 27.0, float 28.0, float 29.0, float 30.0, float 31.0, float 32.0>, i32 %sel
				ret float %ext
				}

				; GCN-LABEL: {{^}}double16_extelt_vec:
				; GCN-NOT: buffer_
				; GCN-DAG: v_mov_b32_e32 [[V1HI:v[0-9]+]], 0x3ff19999
				; GCN-DAG: v_mov_b32_e32 [[V1LO:v[0-9]+]], 0x9999999a
				; GCN-DAG: v_mov_b32_e32 [[V2HI:v[0-9]+]], 0x4000cccc
				; GCN-DAG: v_mov_b32_e32 [[V2LO:v[0-9]+]], 0xcccccccd
				; GCN-DAG: v_cmp_eq_u32_e{{32\|64}} [[CC1:[^,]+]], 1, v0
				; GCN-DAG: v_cndmask_b32_e{{32\|64}} [[R1HI:v[0-9]+]], [[V1HI]], [[V2HI]], [[CC1]]
				; GCN-DAG: v_cndmask_b32_e{{32\|64}} [[R1LO:v[0-9]+]], [[V1LO]], [[V2LO]], [[CC1]]
				define double @double16_extelt_vec(i32 %sel) {
				entry:
				%ext = extractelement <16 x double> <double 1.1, double 2.1, double 3.1, double 4.1, double 5.1, double 6.1, double 7.1, double 8.1, double 9.1, double 10.1, double 11.1, double 12.1, double 13.1, double 14.1, double 15.1, double 16.1>, i32 %sel
				ret double %ext
				}

llvm/test/CodeGen/AMDGPU/indirect-addressing-si-gfx9.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,GFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,GFX9 %s

	; indexing of vectors.			; indexing of vectors.

	; Subtest below moved from file test/CodeGen/AMDGPU/indirect-addressing-si.ll			; Subtest below moved from file test/CodeGen/AMDGPU/indirect-addressing-si.ll
	; to avoid gfx9 scheduling induced issues.			; to avoid gfx9 scheduling induced issues.


	; GCN-LABEL: {{^}}insert_vgpr_offset_multiple_in_block:			; GCN-LABEL: {{^}}insert_vgpr_offset_multiple_in_block:
	; GCN-DAG: s_load_dwordx16 s{{\[}}[[S_ELT0:[0-9]+]]:[[S_ELT15:[0-9]+]]{{\]}}			; GCN-DAG: s_load_dwordx16 s{{\[}}[[S_ELT0:[0-9]+]]:[[S_ELT15:[0-9]+]]{{\]}}
	; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]			; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]
	; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62			; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62

	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT15:[0-9]+]], s[[S_ELT15]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT15:[0-9]+]], s[[S_ELT15]]
	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]

	; GCN-DAG: v_add_u32_e32 [[IDX1:v[0-9]+]], 1, [[IDX0]]			; GCN: v_cmp_eq_u32_e32
				; GCN-COUNT-32: v_cndmask_b32

	; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:			; GCN-COUNT-4: buffer_store_dwordx4
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
	; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]
	; MOVREL-NEXT: v_movreld_b32_e32 v[[VEC_ELT0]], [[INS0]]

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(DST)
	; IDXMODE-NEXT: v_mov_b32_e32 v[[VEC_ELT0]], [[INS0]]
	; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP0]]

	; FIXME: Redundant copy
	; GCN: s_mov_b64 exec, [[MASK:s\[[0-9]+:[0-9]+\]]]

	; GCN: s_mov_b64 [[MASK]], exec

	; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX1]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX1]]
	; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]
	; MOVREL-NEXT: v_movreld_b32_e32 v{{[0-9]+}}, 63

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(DST)
	; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, 63
	; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP1]]

	; GCN: buffer_store_dwordx4 v{{\[}}[[VEC_ELT0]]:

	; GCN: buffer_store_dword [[INS0]]
	define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(<16 x i32> addrspace(1)* %out0, <16 x i32> addrspace(1)* %out1, i32 addrspace(1)* %in, <16 x i32> %vec0) #0 {			define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(<16 x i32> addrspace(1)* %out0, <16 x i32> addrspace(1)* %out1, i32 addrspace(1)* %in, <16 x i32> %vec0) #0 {
	entry:			entry:
	%id = call i32 @llvm.amdgcn.workitem.id.x() #1			%id = call i32 @llvm.amdgcn.workitem.id.x() #1
	%id.ext = zext i32 %id to i64			%id.ext = zext i32 %id to i64
	%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext			%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext
	%idx0 = load volatile i32, i32 addrspace(1)* %gep			%idx0 = load volatile i32, i32 addrspace(1)* %gep
	%idx1 = add i32 %idx0, 1			%idx1 = add i32 %idx0, 1
	%live.out.val = call i32 asm sideeffect "v_mov_b32 $0, 62", "=v"()			%live.out.val = call i32 asm sideeffect "v_mov_b32 $0, 62", "=v"()
	Show All 20 Lines

llvm/test/CodeGen/AMDGPU/indirect-addressing-si-pregfx9.ll

	Show All 13 Lines
	; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]			; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]
	; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62			; GCN-DAG: v_mov_b32 [[INS0:v[0-9]+]], 62

	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT15:[0-9]+]], s[[S_ELT15]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT15:[0-9]+]], s[[S_ELT15]]
	; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]			; GCN-DAG: v_mov_b32_e32 v[[VEC_ELT0:[0-9]+]], s[[S_ELT0]]

	; GCN-DAG: v_add_{{i32\|u32}}_e32 [[IDX1:v[0-9]+]], vcc, 1, [[IDX0]]			; GCN-DAG: v_add_{{i32\|u32}}_e32 [[IDX1:v[0-9]+]], vcc, 1, [[IDX0]]

	; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:			; GCN: v_cmp_eq_u32_e32
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-COUNT-32: v_cndmask_b32
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
	; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; GCN-COUNT-4: buffer_store_dwordx4
	; MOVREL-NEXT: v_movreld_b32_e32 v[[VEC_ELT0]], [[INS0]]

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(DST)
	; IDXMODE-NEXT: v_mov_b32_e32 v[[VEC_ELT0]], [[INS0]]
	; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP0]]

	; FIXME: Redundant copy
	; GCN: s_mov_b64 exec, [[MASK:s\[[0-9]+:[0-9]+\]]]

	; GCN: s_mov_b64 [[MASK]], exec

	; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX1]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX1]]
	; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]
	; MOVREL-NEXT: v_movreld_b32_e32 v{{[0-9]+}}, 63

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(DST)
	; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, 63
	; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP1]]

	; GCN: buffer_store_dwordx4 v{{\[}}[[VEC_ELT0]]:

	; GCN: buffer_store_dword [[INS0]]
	define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(<16 x i32> addrspace(1)* %out0, <16 x i32> addrspace(1)* %out1, i32 addrspace(1)* %in, <16 x i32> %vec0) #0 {			define amdgpu_kernel void @insert_vgpr_offset_multiple_in_block(<16 x i32> addrspace(1)* %out0, <16 x i32> addrspace(1)* %out1, i32 addrspace(1)* %in, <16 x i32> %vec0) #0 {
	entry:			entry:
	%id = call i32 @llvm.amdgcn.workitem.id.x() #1			%id = call i32 @llvm.amdgcn.workitem.id.x() #1
	%id.ext = zext i32 %id to i64			%id.ext = zext i32 %id to i64
	%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext			%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext
	%idx0 = load volatile i32, i32 addrspace(1)* %gep			%idx0 = load volatile i32, i32 addrspace(1)* %gep
	%idx1 = add i32 %idx0, 1			%idx1 = add i32 %idx0, 1
	%live.out.val = call i32 asm sideeffect "v_mov_b32 $0, 62", "=v"()			%live.out.val = call i32 asm sideeffect "v_mov_b32 $0, 62", "=v"()
	Show All 20 Lines

llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	entry:
%value = extractelement <16 x i32> %or, i32 %index		%value = extractelement <16 x i32> %or, i32 %index
store i32 %value, i32 addrspace(1)* %out		store i32 %value, i32 addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}extract_neg_offset_vgpr:		; GCN-LABEL: {{^}}extract_neg_offset_vgpr:
; The offset depends on the register that holds the first element of the vector.		; The offset depends on the register that holds the first element of the vector.

; FIXME: The waitcnt for the argument load can go after the loop		; GCN: v_cmp_eq_u32_e32
; GCN: s_mov_b64 s{{\[[0-9]+:[0-9]+\]}}, exec		; GCN-COUNT-14: v_cndmask_b32
; GCN: [[LOOPBB:BB[0-9]+_[0-9]+]]:		; GCN: v_cndmask_b32_e32 [[RESULT:v[0-9]+]], 16
; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]], v{{[0-9]+}}
; GCN: s_and_saveexec_b64 vcc, vcc

; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe0
; MOVREL: v_movrels_b32_e32 [[RESULT:v[0-9]+]], v1

; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00
; IDXMODE: s_set_gpr_idx_on [[ADD_IDX]], gpr_idx(SRC0)
; IDXMODE: v_mov_b32_e32 [[RESULT:v[0-9]+]], v1
; IDXMODE: s_set_gpr_idx_off

; GCN: s_cbranch_execnz

; GCN: buffer_store_dword [[RESULT]]		; GCN: buffer_store_dword [[RESULT]]
define amdgpu_kernel void @extract_neg_offset_vgpr(i32 addrspace(1)* %out) {		define amdgpu_kernel void @extract_neg_offset_vgpr(i32 addrspace(1)* %out) {
entry:		entry:
%id = call i32 @llvm.amdgcn.workitem.id.x() #1		%id = call i32 @llvm.amdgcn.workitem.id.x() #1
%index = add i32 %id, -512		%index = add i32 %id, -512
%value = extractelement <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %index		%value = extractelement <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %index
store i32 %value, i32 addrspace(1)* %out		store i32 %value, i32 addrspace(1)* %out
ret void		ret void
▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	entry:
%value = insertelement <16 x i32> %vec, i32 5, i32 %index		%value = insertelement <16 x i32> %vec, i32 5, i32 %index
store <16 x i32> %value, <16 x i32> addrspace(1)* %out		store <16 x i32> %value, <16 x i32> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}insert_neg_offset_vgpr:		; GCN-LABEL: {{^}}insert_neg_offset_vgpr:
; The offset depends on the register that holds the first element of the vector.		; The offset depends on the register that holds the first element of the vector.

; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], 1{{$}}		; GCN: v_cmp_eq_u32_e32
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}		; GCN-COUNT-16: v_cndmask_b32
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}		; GCN-COUNT-4: buffer_store_dwordx4
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 5{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 6{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 7{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 8{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 9{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 10{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 11{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 12{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 13{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 14{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 15{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 16{{$}}

; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec
; GCN: [[LOOPBB:BB[0-9]+_[0-9]+]]:
; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]
; GCN: s_and_saveexec_b64 vcc, vcc

; MOVREL: s_add_i32 m0, [[READLANE]], 0xfffffe00
; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], 33

; IDXMODE: s_addk_i32 [[ADD_IDX:s[0-9]+]], 0xfe00{{$}}
; IDXMODE: s_set_gpr_idx_on [[ADD_IDX]], gpr_idx(DST)
; IDXMODE: v_mov_b32_e32 v{{[0-9]+}}, 33
; IDXMODE: s_set_gpr_idx_off

; GCN: s_cbranch_execnz [[LOOPBB]]
; GCN: s_mov_b64 exec, [[SAVEEXEC]]

; GCN: buffer_store_dword
define amdgpu_kernel void @insert_neg_offset_vgpr(i32 addrspace(1)* %in, <16 x i32> addrspace(1)* %out) {		define amdgpu_kernel void @insert_neg_offset_vgpr(i32 addrspace(1)* %in, <16 x i32> addrspace(1)* %out) {
entry:		entry:
%id = call i32 @llvm.amdgcn.workitem.id.x() #1		%id = call i32 @llvm.amdgcn.workitem.id.x() #1
%index = add i32 %id, -512		%index = add i32 %id, -512
%value = insertelement <16 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 33, i32 %index		%value = insertelement <16 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 33, i32 %index
store <16 x i32> %value, <16 x i32> addrspace(1)* %out		store <16 x i32> %value, <16 x i32> addrspace(1)* %out
ret void		ret void
}		}

; GCN-LABEL: {{^}}insert_neg_inline_offset_vgpr:		; GCN-LABEL: {{^}}insert_neg_inline_offset_vgpr:

; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], 1{{$}}		; GCN: v_cmp_eq_u32_e32
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], 2{{$}}		; GCN-COUNT-16: v_cndmask_b32
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT2:v[0-9]+]], 3{{$}}		; GCN-COUNT-4: buffer_store_dwordx4
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 4{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 5{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 6{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 7{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 8{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 9{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 10{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 11{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 12{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 13{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 14{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 15{{$}}
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT3:v[0-9]+]], 16{{$}}
; GCN-DAG: v_mov_b32_e32 [[VAL:v[0-9]+]], 0x1f4{{$}}

; GCN: s_mov_b64 [[SAVEEXEC:s\[[0-9]+:[0-9]+\]]], exec

; The offset depends on the register that holds the first element of the vector.
; GCN: v_readfirstlane_b32 [[READLANE:s[0-9]+]]

; MOVREL: s_add_i32 m0, [[READLANE]], -16
; MOVREL: v_movreld_b32_e32 [[VEC_ELT0]], [[VAL]]

; IDXMODE: s_add_i32 [[ADD_IDX:s[0-9]+]], [[READLANE]], -16
; IDXMODE: s_set_gpr_idx_on [[ADD_IDX]], gpr_idx(DST)
; IDXMODE: v_mov_b32_e32 [[VEC_ELT0]], [[VAL]]
; IDXMODE: s_set_gpr_idx_off

; GCN: s_cbranch_execnz
define amdgpu_kernel void @insert_neg_inline_offset_vgpr(i32 addrspace(1)* %in, <16 x i32> addrspace(1)* %out) {		define amdgpu_kernel void @insert_neg_inline_offset_vgpr(i32 addrspace(1)* %in, <16 x i32> addrspace(1)* %out) {
entry:		entry:
%id = call i32 @llvm.amdgcn.workitem.id.x() #1		%id = call i32 @llvm.amdgcn.workitem.id.x() #1
%index = add i32 %id, -16		%index = add i32 %id, -16
%value = insertelement <16 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 500, i32 %index		%value = insertelement <16 x i32> <i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 500, i32 %index
store <16 x i32> %value, <16 x i32> addrspace(1)* %out		store <16 x i32> %value, <16 x i32> addrspace(1)* %out
ret void		ret void
}		}

; When the block is split to insert the loop, make sure any other		; When the block is split to insert the loop, make sure any other
; places that need to be expanded in the same block are also handled.		; places that need to be expanded in the same block are also handled.

; GCN-LABEL: {{^}}extract_vgpr_offset_multiple_in_block:		; GCN-LABEL: {{^}}extract_vgpr_offset_multiple_in_block:

; FIXME: Why is vector copied in between?

; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]		; GCN-DAG: {{buffer\|flat\|global}}_load_dword [[IDX0:v[0-9]+]]
; GCN-DAG: s_mov_b32 [[S_ELT1:s[0-9]+]], 9		; GCN: v_cmp_eq_u32
; GCN-DAG: s_mov_b32 [[S_ELT0:s[0-9]+]], 7		; GCN: v_cndmask_b32_e64 [[RESULT0:v[0-9]+]], 16,
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT0:v[0-9]+]], [[S_ELT0]]		; GCN: v_cndmask_b32_e64 [[RESULT1:v[0-9]+]], 16,
; GCN-DAG: v_mov_b32_e32 [[VEC_ELT1:v[0-9]+]], [[S_ELT1]]

; GCN: s_mov_b64 [[MASK:s\[[0-9]+:[0-9]+\]]], exec

; GCN: s_waitcnt vmcnt(0)
; PREGFX9: v_add_{{i32\|u32}}_e32 [[IDX1:v[0-9]+]], vcc, 1, [[IDX0]]
; GFX9: v_add_{{i32\|u32}}_e32 [[IDX1:v[0-9]+]], 1, [[IDX0]]


; GCN: [[LOOP0:BB[0-9]+_[0-9]+]]:
; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
; GCN: s_and_saveexec_b64 vcc, vcc

; MOVREL: s_mov_b32 m0, [[READLANE]]
; MOVREL: v_movrels_b32_e32 [[MOVREL0:v[0-9]+]], [[VEC_ELT0]]

; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(SRC0)
; IDXMODE: v_mov_b32_e32 [[MOVREL0:v[0-9]+]], [[VEC_ELT0]]
; IDXMODE: s_set_gpr_idx_off

; GCN-NEXT: s_xor_b64 exec, exec, vcc
; GCN-NEXT: s_cbranch_execnz [[LOOP0]]

; FIXME: Redundant copy
; GCN: s_mov_b64 exec, [[MASK]]

; GCN: v_mov_b32_e32 [[VEC_ELT0_2:v[0-9]+]], [[S_ELT0]]

; GCN: s_mov_b64 [[MASK2:s\[[0-9]+:[0-9]+\]]], exec

; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:
; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX1]]
; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX1]]
; GCN: s_and_saveexec_b64 vcc, vcc

; MOVREL: s_mov_b32 m0, [[READLANE]]
; MOVREL-NEXT: v_movrels_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT0_2]]

; IDXMODE: s_set_gpr_idx_on [[READLANE]], gpr_idx(SRC0)
; IDXMODE-NEXT: v_mov_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT0_2]]
; IDXMODE: s_set_gpr_idx_off

; GCN-NEXT: s_xor_b64 exec, exec, vcc
; GCN: s_cbranch_execnz [[LOOP1]]

; GCN: buffer_store_dword [[MOVREL0]]		; GCN: buffer_store_dword [[RESULT0]]
; GCN: buffer_store_dword [[MOVREL1]]		; GCN: buffer_store_dword [[RESULT1]]
define amdgpu_kernel void @extract_vgpr_offset_multiple_in_block(i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, i32 addrspace(1)* %in) #0 {		define amdgpu_kernel void @extract_vgpr_offset_multiple_in_block(i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, i32 addrspace(1)* %in) #0 {
entry:		entry:
%id = call i32 @llvm.amdgcn.workitem.id.x() #1		%id = call i32 @llvm.amdgcn.workitem.id.x() #1
%id.ext = zext i32 %id to i64		%id.ext = zext i32 %id to i64
%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext		%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i64 %id.ext
%idx0 = load volatile i32, i32 addrspace(1)* %gep		%idx0 = load volatile i32, i32 addrspace(1)* %gep
%idx1 = add i32 %idx0, 1		%idx1 = add i32 %idx0, 1
%val0 = extractelement <16 x i32> <i32 7, i32 9, i32 11, i32 13, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %idx0		%val0 = extractelement <16 x i32> <i32 7, i32 9, i32 11, i32 13, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %idx0
▲ Show 20 Lines • Show All 218 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll

	Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines
	; GCN-DAG: v_cmp_ne_u32_e32 [[CCL:[^,]+]], s{{[0-9]+}}, [[LASTIDX]]			; GCN-DAG: v_cmp_ne_u32_e32 [[CCL:[^,]+]], s{{[0-9]+}}, [[LASTIDX]]
	; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1, v{{[0-9]+}}, [[CCL]]			; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1, v{{[0-9]+}}, [[CCL]]
	define amdgpu_kernel void @bit128_inselt(<128 x i1> addrspace(1)* %out, <128 x i1> %vec, i32 %sel) {			define amdgpu_kernel void @bit128_inselt(<128 x i1> addrspace(1)* %out, <128 x i1> %vec, i32 %sel) {
	entry:			entry:
	%v = insertelement <128 x i1> %vec, i1 1, i32 %sel			%v = insertelement <128 x i1> %vec, i1 1, i32 %sel
	store <128 x i1> %v, <128 x i1> addrspace(1)* %out			store <128 x i1> %v, <128 x i1> addrspace(1)* %out
	ret void			ret void
	}			}

				; GCN-LABEL: {{^}}float32_inselt_vec:
				; GCN-NOT: buffer_
				; GCN-COUNT-32: v_cmp_ne_u32
				; GCN-COUNT-32: v_cndmask_b32_e{{32\|64}} v{{[0-9]+}}, 1.0,
				define amdgpu_ps <32 x float> @float32_inselt_vec(<32 x float> %vec, i32 %sel) {
				entry:
				%v = insertelement <32 x float> %vec, float 1.000000e+00, i32 %sel
				ret <32 x float> %v
				}

				; GCN-LABEL: {{^}}double8_inselt_vec:
				; GCN-NOT: buffer_
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				; GCN: v_cmp_eq_u32
				; GCN-COUNT-2: v_cndmask_b32
				define <8 x double> @double8_inselt_vec(<8 x double> %vec, i32 %sel) {
				entry:
				%v = insertelement <8 x double> %vec, double 1.000000e+00, i32 %sel
				ret <8 x double> %v
				}

llvm/test/CodeGen/AMDGPU/scratch-simple.ll

	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=verde -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,SI,SIVI %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=verde -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,SI,SIVI %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx803 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,VI,SIVI %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx803 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,VI,SIVI %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX9,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX9,GFX9_10 %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -filetype=obj < %s \| llvm-readobj -r \| FileCheck --check-prefix=RELS %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx900 -filetype=obj -amdgpu-use-divergent-register-indexing < %s \| llvm-readobj -r \| FileCheck --check-prefix=RELS %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W32,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W32,GFX9_10 %s
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W64,GFX9_10 %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -mcpu=gfx1010 -mattr=-flat-for-global,+wavefrontsize64 -amdgpu-use-divergent-register-indexing -verify-machineinstrs < %s \| FileCheck --check-prefixes=GCN,GFX10_W64,GFX9_10 %s

	; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD0 0x0			; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD0 0x0
	; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD1 0x0			; RELS: R_AMDGPU_ABS32_LO SCRATCH_RSRC_DWORD1 0x0

	; This used to fail due to a v_add_i32 instruction with an illegal immediate			; This used to fail due to a v_add_i32 instruction with an illegal immediate
	; operand that was created during Local Stack Slot Allocation. Test case derived			; operand that was created during Local Stack Slot Allocation. Test case derived
	; from https://bugs.freedesktop.org/show_bug.cgi?id=96602			; from https://bugs.freedesktop.org/show_bug.cgi?id=96602
	;			;
	▲ Show 20 Lines • Show All 125 Lines • Show Last 20 Lines