This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Should always start from the first register in VGPR indexing.
AbandonedPublic

Authored by cfang on Dec 3 2018, 3:47 PM.

Download Raw Diff

Details

Reviewers

arsenm
msearles
rampitec

Summary

SunReg should always be AMDGPU::sub0. The 8-bit m0 field for the index is unsigned.
We can guarantee the index non-negative (if the program itself is correct) only when we
start from the very first register in the vector.

The original optimization shifts the base to AMDGPU::sub0 + Offset, which leads to the situation
that the index could be negative to address the registers to the left of the base (Offset). Thus the
optimization is invalid.

Diff Detail

Event Timeline

cfang created this revision.Dec 3 2018, 3:47 PM

Herald added subscribers: arphaman, t-tye, tpr and 6 others. · View Herald TranscriptDec 3 2018, 3:47 PM

We should try to use some known bits information to keep this. I have a patch to add a machine version, but there might be a better way

In D55241#1317630, @arsenm wrote:

We should try to use some known bits information to keep this. I have a patch to add a machine version, but there might be a better way

Would you please explain how would your knownbit approach resolve the negative index issue while keep the optimization for gfx9+?
Or just post your patch. Thanks.

msearles added inline comments.Dec 3 2018, 4:28 PM

lib/Target/AMDGPU/SIISelLowering.cpp
3013	Typo: SunReg (should be SubReg). Typo is repeated in the second comment block as well.

In D55241#1317631, @cfang wrote:

In D55241#1317630, @arsenm wrote:

We should try to use some known bits information to keep this. I have a patch to add a machine version, but there might be a better way

Would you please explain how would your knownbit approach resolve the negative index issue while keep the optimization for gfx9+?
Or just post your patch. Thanks.

If you know the base index isn't negative, you don't need to disable this

D30466 is the primitive computeKnownBits

In D55241#1317688, @arsenm wrote:

In D55241#1317631, @cfang wrote:

In D55241#1317630, @arsenm wrote:

We should try to use some known bits information to keep this. I have a patch to add a machine version, but there might be a better way

Would you please explain how would your knownbit approach resolve the negative index issue while keep the optimization for gfx9+?
Or just post your patch. Thanks.

If you know the base index isn't negative, you don't need to disable this

Theoretically it is correct. But in the real world applications, the index should be unknown to the compiler, and most likely a variable.
Also, as the offset is a positive number and we choose to start from offset to indirect vgpr indexing, we are sure the index is negative if we are addressing
the registers left to "offset" in the vector.

I am thinking that in very rare case that the compiler can make sure that the base index is non-negative, and doubt whether it is
worthwhile to do the optimization to save one (ADD) instruction for such case.

cfang marked an inline comment as done.Dec 5 2018, 10:28 AM

cfang added inline comments.

lib/Target/AMDGPU/SIISelLowering.cpp
3013	Thanks. Will correct the typo (if we keep the code this way ).

In D55241#1317703, @arsenm wrote:

D30466 is the primitive computeKnownBits

Has this patch been comitted to the Trunk?

Fix typos.

No, it's not committed. Variable + constant is a common case in general.

It would probably be better to do this fold in the DAG for now though

In D55241#1323977, @arsenm wrote:

No, it's not committed. Variable + constant is a common case in general.

It would probably be better to do this fold in the DAG for now though

How can you guarantee, at machine instruction level, the base is non-negative even though you do this fold in the DAG?
Something may happen in between.

Can this be closed after r349951?

Herald added a subscriber: jdoerfert. · View Herald TranscriptFeb 21 2019, 5:38 PM

cfang abandoned this revision.Feb 26 2019, 10:39 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIISelLowering.cpp

37 lines

test/

CodeGen/

AMDGPU/

indirect-addressing-si-pregfx9.ll

5 lines

indirect-addressing-si.ll

51 lines

Diff 176498

lib/Target/AMDGPU/SIISelLowering.cpp

	Show All 12 Lines

	MachineBasicBlock::iterator First = RemainderBB->begin();			MachineBasicBlock::iterator First = RemainderBB->begin();
	BuildMI(*RemainderBB, First, DL, TII->get(AMDGPU::S_MOV_B64), AMDGPU::EXEC)			BuildMI(*RemainderBB, First, DL, TII->get(AMDGPU::S_MOV_B64), AMDGPU::EXEC)
	.addReg(SaveExec);			.addReg(SaveExec);

	return InsPt;			return InsPt;
	}			}

	// Returns subreg index, offset
	static std::pair<unsigned, int>
	computeIndirectRegAndOffset(const SIRegisterInfo &TRI,
	const TargetRegisterClass *SuperRC,
	unsigned VecReg,
	int Offset) {
	int NumElts = TRI.getRegSizeInBits(*SuperRC) / 32;

	// Skip out of bounds offsets, or else we would end up using an undefined
	// register.
	if (Offset >= NumElts \|\| Offset < 0)
	return std::make_pair(AMDGPU::sub0, Offset);

	return std::make_pair(AMDGPU::sub0 + Offset, 0);
	}

	// Return true if the index is an SGPR and was set.			// Return true if the index is an SGPR and was set.
	static bool setM0ToIndexFromSGPR(const SIInstrInfo *TII,			static bool setM0ToIndexFromSGPR(const SIInstrInfo *TII,
	MachineRegisterInfo &MRI,			MachineRegisterInfo &MRI,
	MachineInstr &MI,			MachineInstr &MI,
	int Offset,			int Offset,
	bool UseGPRIdxMode,			bool UseGPRIdxMode,
	bool IsIndirectSrc) {			bool IsIndirectSrc) {
	MachineBasicBlock *MBB = MI.getParent();			MachineBasicBlock *MBB = MI.getParent();
	Show All 24 Lines
	return true;			return true;
	}			}

	// Control flow needs to be inserted if indexing with a VGPR.			// Control flow needs to be inserted if indexing with a VGPR.
	static MachineBasicBlock *emitIndirectSrc(MachineInstr &MI,			static MachineBasicBlock *emitIndirectSrc(MachineInstr &MI,
	MachineBasicBlock &MBB,			MachineBasicBlock &MBB,
	const GCNSubtarget &ST) {			const GCNSubtarget &ST) {
	const SIInstrInfo *TII = ST.getInstrInfo();			const SIInstrInfo *TII = ST.getInstrInfo();
	const SIRegisterInfo &TRI = TII->getRegisterInfo();
	MachineFunction *MF = MBB.getParent();			MachineFunction *MF = MBB.getParent();
	MachineRegisterInfo &MRI = MF->getRegInfo();			MachineRegisterInfo &MRI = MF->getRegInfo();

	unsigned Dst = MI.getOperand(0).getReg();			unsigned Dst = MI.getOperand(0).getReg();
	unsigned SrcReg = TII->getNamedOperand(MI, AMDGPU::OpName::src)->getReg();			unsigned SrcReg = TII->getNamedOperand(MI, AMDGPU::OpName::src)->getReg();
	int Offset = TII->getNamedOperand(MI, AMDGPU::OpName::offset)->getImm();			int Offset = TII->getNamedOperand(MI, AMDGPU::OpName::offset)->getImm();

	const TargetRegisterClass *VecRC = MRI.getRegClass(SrcReg);			// SunReg should always be AMDGPU::sub0. The 8-bit m0 field for the index
				msearlesUnsubmitted Not Done Reply Inline Actions Typo: SunReg (should be SubReg). Typo is repeated in the second comment block as well. msearles: Typo: SunReg (should be SubReg). Typo is repeated in the second comment block as well.
				cfangAuthorUnsubmitted Done Reply Inline Actions Thanks. Will correct the typo (if we keep the code this way ). cfang: Thanks. Will correct the typo (if we keep the code this way ).
				// is unsigned. We can guarantee the index non-negative (if the program
	unsigned SubReg;			// itself is correct) only when we start from the very first register in
	std::tie(SubReg, Offset)			// the vector.
	= computeIndirectRegAndOffset(TRI, VecRC, SrcReg, Offset);			unsigned SubReg = AMDGPU::sub0;

	bool UseGPRIdxMode = ST.useVGPRIndexMode(EnableVGPRIndexMode);			bool UseGPRIdxMode = ST.useVGPRIndexMode(EnableVGPRIndexMode);

	if (setM0ToIndexFromSGPR(TII, MRI, MI, Offset, UseGPRIdxMode, true)) {			if (setM0ToIndexFromSGPR(TII, MRI, MI, Offset, UseGPRIdxMode, true)) {
	MachineBasicBlock::iterator I(&MI);			MachineBasicBlock::iterator I(&MI);
	const DebugLoc &DL = MI.getDebugLoc();			const DebugLoc &DL = MI.getDebugLoc();

	if (UseGPRIdxMode) {			if (UseGPRIdxMode) {
	// TODO: Look at the uses to avoid the copy. This may require rescheduling			// TODO: Look at the uses to avoid the copy. This may require rescheduling
	Show All 24 Lines
	const MachineOperand *Idx = TII->getNamedOperand(MI, AMDGPU::OpName::idx);			const MachineOperand *Idx = TII->getNamedOperand(MI, AMDGPU::OpName::idx);
	const MachineOperand *Val = TII->getNamedOperand(MI, AMDGPU::OpName::val);			const MachineOperand *Val = TII->getNamedOperand(MI, AMDGPU::OpName::val);
	int Offset = TII->getNamedOperand(MI, AMDGPU::OpName::offset)->getImm();			int Offset = TII->getNamedOperand(MI, AMDGPU::OpName::offset)->getImm();
	const TargetRegisterClass *VecRC = MRI.getRegClass(SrcVec->getReg());			const TargetRegisterClass *VecRC = MRI.getRegClass(SrcVec->getReg());

	// This can be an immediate, but will be folded later.			// This can be an immediate, but will be folded later.
	assert(Val->getReg());			assert(Val->getReg());

	unsigned SubReg;			// SunReg should always be AMDGPU::sub0. The 8-bit m0 field for the index
	std::tie(SubReg, Offset) = computeIndirectRegAndOffset(TRI, VecRC,			// is unsigned. We can guarantee the index non-negative (if the program
	SrcVec->getReg(),			// itself is correct) only when we start from the very first register in
	Offset);			// the vector.
				unsigned SubReg = AMDGPU::sub0;
	bool UseGPRIdxMode = ST.useVGPRIndexMode(EnableVGPRIndexMode);			bool UseGPRIdxMode = ST.useVGPRIndexMode(EnableVGPRIndexMode);

	if (Idx->getReg() == AMDGPU::NoRegister) {			if (Idx->getReg() == AMDGPU::NoRegister) {
	MachineBasicBlock::iterator I(&MI);			MachineBasicBlock::iterator I(&MI);
	const DebugLoc &DL = MI.getDebugLoc();			const DebugLoc &DL = MI.getDebugLoc();

	assert(Offset == 0);			assert(Offset == 0);

	Show All 12 Lines

test/CodeGen/AMDGPU/indirect-addressing-si-pregfx9.ll

	Show All 12 Lines

	; GCN: s_mov_b64 [[MASK]], exec			; GCN: s_mov_b64 [[MASK]], exec

	; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]			; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
	; GCN: s_and_saveexec_b64 vcc, vcc			; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; MOVREL: s_add_i32 m0, [[READLANE]], 1
	; MOVREL-NEXT: v_movreld_b32_e32 v{{[0-9]+}}, 63			; MOVREL-NEXT: v_movreld_b32_e32 v{{[0-9]+}}, 63

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], dst			; IDXMODE: s_add_i32 [[IDX:s[0-9]+]], [[READLANE]], 1
				; IDXMODE: s_set_gpr_idx_on [[IDX]], dst
	; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, 63			; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, 63
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc			; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP1]]			; GCN: s_cbranch_execnz [[LOOP1]]

	; GCN: buffer_store_dwordx4 v{{\[}}[[VEC_ELT0]]:			; GCN: buffer_store_dwordx4 v{{\[}}[[VEC_ELT0]]:

	Show All 12 Lines

test/CodeGen/AMDGPU/indirect-addressing-si.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MOVREL,PREGFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MOVREL,PREGFX9 %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MOVREL,PREGFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,MOVREL,PREGFX9 %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-vgpr-index-mode -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,PREGFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-vgpr-index-mode -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,PREGFX9 %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,GFX9 %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,IDXMODE,GFX9 %s

	; Tests for indirect addressing on SI, which is implemented using dynamic			; Tests for indirect addressing on SI, which is implemented using dynamic
	; indexing of vectors.			; indexing of vectors.

	; GCN-LABEL: {{^}}extract_w_offset:			; GCN-LABEL: {{^}}extract_w_offset:
	; GCN-DAG: s_load_dword [[IN:s[0-9]+]]			; GCN-DAG: s_load_dword [[IN:s[0-9]+]]
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 4.0			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 4.0
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 0x40400000			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 0x40400000
	; GCN-DAG: v_mov_b32_e32 [[BASEREG:v[0-9]+]], 2.0			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 2.0
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, 1.0			; GCN-DAG: v_mov_b32_e32 [[BASEREG:v[0-9]+]], 1.0

	; MOVREL-DAG: s_mov_b32 m0, [[IN]]			; MOVREL-DAG: s_add_i32 m0, [[IN]], 1
	; MOVREL: v_movrels_b32_e32 v{{[0-9]+}}, [[BASEREG]]			; MOVREL: v_movrels_b32_e32 v{{[0-9]+}}, [[BASEREG]]

	; IDXMODE: s_set_gpr_idx_on [[IN]], src0{{$}}			; IDXMODE-DAG: s_add_i32 [[IN1:s[0-9]+]], [[IN]], 1
				; IDXMODE: s_set_gpr_idx_on [[IN1]], src0{{$}}
	; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, [[BASEREG]]			; IDXMODE-NEXT: v_mov_b32_e32 v{{[0-9]+}}, [[BASEREG]]
	; IDXMODE-NEXT: s_set_gpr_idx_off			; IDXMODE-NEXT: s_set_gpr_idx_off
	define amdgpu_kernel void @extract_w_offset(float addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @extract_w_offset(float addrspace(1)* %out, i32 %in) {
	entry:			entry:
	%idx = add i32 %in, 1			%idx = add i32 %in, 1
	%elt = extractelement <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, i32 %idx			%elt = extractelement <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, i32 %idx
	store float %elt, float addrspace(1)* %out			store float %elt, float addrspace(1)* %out
	ret void			ret void
	}			}

	; XXX: Could do v_or_b32 directly			; XXX: Could do v_or_b32 directly
	; GCN-LABEL: {{^}}extract_w_offset_salu_use_vector:			; GCN-LABEL: {{^}}extract_w_offset_salu_use_vector:
	; MOVREL: s_mov_b32 m0			; MOVREL: s_load_dword [[IDX:s[0-9]+]]
				; MOVREL: s_add_i32 m0, [[IDX]], 1
	; GCN-DAG: s_or_b32			; GCN-DAG: s_or_b32
	; GCN-DAG: s_or_b32			; GCN-DAG: s_or_b32
	; GCN-DAG: s_or_b32			; GCN-DAG: s_or_b32
	; GCN-DAG: s_or_b32			; GCN-DAG: s_or_b32
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}			; GCN-DAG: v_mov_b32_e32 v{{[0-9]+}}, s{{[0-9]+}}
	Show All 24 Lines
	%ld = load <4 x i32>, <4 x i32> addrspace(1)* %in			%ld = load <4 x i32>, <4 x i32> addrspace(1)* %in
	%value = insertelement <4 x i32> %ld, i32 5, i32 undef			%value = insertelement <4 x i32> %ld, i32 5, i32 undef
	store <4 x i32> %value, <4 x i32> addrspace(1)* %out			store <4 x i32> %value, <4 x i32> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}insert_w_offset:			; GCN-LABEL: {{^}}insert_w_offset:
	; GCN-DAG: s_load_dword [[IN:s[0-9]+]]			; GCN-DAG: s_load_dword [[IN:s[0-9]+]]
	; MOVREL-DAG: s_mov_b32 m0, [[IN]]			; MOVREL-DAG: s_add_i32 m0, [[IN]], 1
	; GCN-DAG: v_mov_b32_e32 v[[ELT0:[0-9]+]], 1.0			; GCN-DAG: v_mov_b32_e32 v[[ELT0:[0-9]+]], 1.0
	; GCN-DAG: v_mov_b32_e32 v[[ELT1:[0-9]+]], 2.0			; GCN-DAG: v_mov_b32_e32 v[[ELT1:[0-9]+]], 2.0
	; GCN-DAG: v_mov_b32_e32 v[[ELT2:[0-9]+]], 0x40400000			; GCN-DAG: v_mov_b32_e32 v[[ELT2:[0-9]+]], 0x40400000
	; GCN-DAG: v_mov_b32_e32 v[[ELT3:[0-9]+]], 4.0			; GCN-DAG: v_mov_b32_e32 v[[ELT3:[0-9]+]], 4.0
	; GCN-DAG: v_mov_b32_e32 v[[ELT15:[0-9]+]], 0x41800000			; GCN-DAG: v_mov_b32_e32 v[[ELT15:[0-9]+]], 0x41800000
	; GCN-DAG: v_mov_b32_e32 v[[INS:[0-9]+]], 0x41880000			; GCN-DAG: v_mov_b32_e32 v[[INS:[0-9]+]], 0x41880000

	; MOVREL: v_movreld_b32_e32 v[[ELT1]], v[[INS]]			; MOVREL: v_movreld_b32_e32 v[[ELT0]], v[[INS]]
	; MOVREL: buffer_store_dwordx4 v{{\[}}[[ELT0]]:[[ELT3]]{{\]}}			; MOVREL: buffer_store_dwordx4 v{{\[}}[[ELT0]]:[[ELT3]]{{\]}}
	define amdgpu_kernel void @insert_w_offset(<16 x float> addrspace(1)* %out, i32 %in) {			define amdgpu_kernel void @insert_w_offset(<16 x float> addrspace(1)* %out, i32 %in) {
	entry:			entry:
	%add = add i32 %in, 1			%add = add i32 %in, 1
	%ins = insertelement <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, float 17.0, i32 %add			%ins = insertelement <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, float 17.0, i32 %add
	store <16 x float> %ins, <16 x float> addrspace(1)* %out			store <16 x float> %ins, <16 x float> addrspace(1)* %out
	ret void			ret void
	}			}
	Show All 24 Lines
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc			; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN-NEXT: s_cbranch_execnz [[LOOP0]]			; GCN-NEXT: s_cbranch_execnz [[LOOP0]]

	; FIXME: Redundant copy			; FIXME: Redundant copy
	; GCN: s_mov_b64 exec, [[MASK]]			; GCN: s_mov_b64 exec, [[MASK]]

	; GCN: v_mov_b32_e32 [[VEC_ELT1_2:v[0-9]+]], [[S_ELT1]]			; GCN: v_mov_b32_e32 [[VEC_ELT0_2:v[0-9]+]], [[S_ELT0]]

	; GCN: s_mov_b64 [[MASK2:s\[[0-9]+:[0-9]+\]]], exec			; GCN: s_mov_b64 [[MASK2:s\[[0-9]+:[0-9]+\]]], exec

	; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:			; GCN: [[LOOP1:BB[0-9]+_[0-9]+]]:
	; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]			; GCN-NEXT: v_readfirstlane_b32 [[READLANE:s[0-9]+]], [[IDX0]]
	; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]			; GCN: v_cmp_eq_u32_e32 vcc, [[READLANE]], [[IDX0]]
	; GCN: s_and_saveexec_b64 vcc, vcc			; GCN: s_and_saveexec_b64 vcc, vcc

	; MOVREL: s_mov_b32 m0, [[READLANE]]			; MOVREL: s_add_i32 m0, [[READLANE]], 1
	; MOVREL-NEXT: v_movrels_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT1_2]]			; MOVREL-NEXT: v_movrels_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT0_2]]

	; IDXMODE: s_set_gpr_idx_on [[READLANE]], src0			; IDXMODE: s_add_i32 [[INDEX:s[0-9]+]], [[READLANE]], 1
	; IDXMODE-NEXT: v_mov_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT1_2]]			; IDXMODE: s_set_gpr_idx_on [[INDEX]], src0
				; IDXMODE-NEXT: v_mov_b32_e32 [[MOVREL1:v[0-9]+]], [[VEC_ELT0_2]]
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off

	; GCN-NEXT: s_xor_b64 exec, exec, vcc			; GCN-NEXT: s_xor_b64 exec, exec, vcc
	; GCN: s_cbranch_execnz [[LOOP1]]			; GCN: s_cbranch_execnz [[LOOP1]]

	; GCN: buffer_store_dword [[MOVREL0]]			; GCN: buffer_store_dword [[MOVREL0]]
	; GCN: buffer_store_dword [[MOVREL1]]			; GCN: buffer_store_dword [[MOVREL1]]
	define amdgpu_kernel void @extract_vgpr_offset_multiple_in_block(i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, i32 addrspace(1)* %in) #0 {			define amdgpu_kernel void @extract_vgpr_offset_multiple_in_block(i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, i32 addrspace(1)* %in) #0 {
	Show All 24 Lines
	%tmp8 = extractelement <9 x i32> %tmp7, i32 5			%tmp8 = extractelement <9 x i32> %tmp7, i32 5
	store volatile i32 %tmp6, i32 addrspace(3)* undef, align 4			store volatile i32 %tmp6, i32 addrspace(3)* undef, align 4
	store volatile i32 %tmp8, i32 addrspace(3)* undef, align 4			store volatile i32 %tmp8, i32 addrspace(3)* undef, align 4
	ret void			ret void
	}			}

	; offset puts outside of superegister bounaries, so clamp to 1st element.			; offset puts outside of superegister bounaries, so clamp to 1st element.
	; GCN-LABEL: {{^}}extract_largest_inbounds_offset:			; GCN-LABEL: {{^}}extract_largest_inbounds_offset:
	; GCN-DAG: buffer_load_dwordx4 v{{\[}}[[LO_ELT:[0-9]+]]:[[HI_ELT:[0-9]+]]{{\].* offset:48}}			; GCN-DAG: buffer_load_dwordx4 v{{\[}}[[LO_ELT:[0-9]+]]:[[HI_ELT:[0-9]+]]
	; GCN-DAG: s_load_dword [[IDX:s[0-9]+]]			; GCN-DAG: s_load_dword [[IDX0:s[0-9]+]]
	; MOVREL: s_mov_b32 m0, [[IDX]]
	; MOVREL: v_movrels_b32_e32 [[EXTRACT:v[0-9]+]], v[[HI_ELT]]

				; MOVREL: s_add_i32 m0, [[IDX0]], 15
				; MOVREL: v_movrels_b32_e32 [[EXTRACT:v[0-9]+]], v[[LO_ELT]]

				; IDXMODE: s_add_i32 [[IDX:s[0-9]+]], [[IDX0]], 15
	; IDXMODE: s_set_gpr_idx_on [[IDX]], src0			; IDXMODE: s_set_gpr_idx_on [[IDX]], src0
	; IDXMODE: v_mov_b32_e32 [[EXTRACT:v[0-9]+]], v[[HI_ELT]]			; IDXMODE: v_mov_b32_e32 [[EXTRACT:v[0-9]+]], v[[LO_ELT]]
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off

	; GCN: buffer_store_dword [[EXTRACT]]			; GCN: buffer_store_dword [[EXTRACT]]
	define amdgpu_kernel void @extract_largest_inbounds_offset(i32 addrspace(1)* %out, <16 x i32> addrspace(1)* %in, i32 %idx) {			define amdgpu_kernel void @extract_largest_inbounds_offset(i32 addrspace(1)* %out, <16 x i32> addrspace(1)* %in, i32 %idx) {
	entry:			entry:
	%ld = load volatile <16 x i32>, <16 x i32> addrspace(1)* %in			%ld = load volatile <16 x i32>, <16 x i32> addrspace(1)* %in
	%offset = add i32 %idx, 15			%offset = add i32 %idx, 15
	%value = extractelement <16 x i32> %ld, i32 %offset			%value = extractelement <16 x i32> %ld, i32 %offset
	Show All 23 Lines
	}			}

	; Test that the or is folded into the base address register instead of			; Test that the or is folded into the base address register instead of
	; added to m0			; added to m0

	; GCN-LABEL: {{^}}extractelement_v16i32_or_index:			; GCN-LABEL: {{^}}extractelement_v16i32_or_index:
	; GCN: s_load_dword [[IDX_IN:s[0-9]+]]			; GCN: s_load_dword [[IDX_IN:s[0-9]+]]
	; GCN: s_lshl_b32 [[IDX_SHL:s[0-9]+]], [[IDX_IN]]			; GCN: s_lshl_b32 [[IDX_SHL:s[0-9]+]], [[IDX_IN]]
	; GCN-NOT: [[IDX_SHL]]

	; MOVREL: s_mov_b32 m0, [[IDX_SHL]]			; MOVREL: s_add_i32 m0, [[IDX_SHL]], 1
	; MOVREL: v_movrels_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}			; MOVREL: v_movrels_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}

	; IDXMODE: s_set_gpr_idx_on [[IDX_SHL]], src0			; IDXMODE: s_add_i32 [[IDX_FIN:s[0-9]+]], [[IDX_SHL]], 1
				; IDXMODE: s_set_gpr_idx_on [[IDX_FIN]], src0
	; IDXMODE: v_mov_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}			; IDXMODE: v_mov_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off
	define amdgpu_kernel void @extractelement_v16i32_or_index(i32 addrspace(1)* %out, <16 x i32> addrspace(1)* %in, i32 %idx.in) {			define amdgpu_kernel void @extractelement_v16i32_or_index(i32 addrspace(1)* %out, <16 x i32> addrspace(1)* %in, i32 %idx.in) {
	entry:			entry:
	%ld = load volatile <16 x i32>, <16 x i32> addrspace(1)* %in			%ld = load volatile <16 x i32>, <16 x i32> addrspace(1)* %in
	%idx.shl = shl i32 %idx.in, 2			%idx.shl = shl i32 %idx.in, 2
	%idx = or i32 %idx.shl, 1			%idx = or i32 %idx.shl, 1
	%value = extractelement <16 x i32> %ld, i32 %idx			%value = extractelement <16 x i32> %ld, i32 %idx
	store i32 %value, i32 addrspace(1)* %out			store i32 %value, i32 addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}insertelement_v16f32_or_index:			; GCN-LABEL: {{^}}insertelement_v16f32_or_index:
	; GCN: s_load_dword [[IDX_IN:s[0-9]+]]			; GCN: s_load_dword [[IDX_IN:s[0-9]+]]
	; GCN: s_lshl_b32 [[IDX_SHL:s[0-9]+]], [[IDX_IN]]			; GCN: s_lshl_b32 [[IDX_SHL:s[0-9]+]], [[IDX_IN]]
	; GCN-NOT: [[IDX_SHL]]

	; MOVREL: s_mov_b32 m0, [[IDX_SHL]]			; MOVREL: s_add_i32 m0, [[IDX_SHL]], 1
	; MOVREL: v_movreld_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}			; MOVREL: v_movreld_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}

	; IDXMODE: s_set_gpr_idx_on [[IDX_SHL]], dst			; IDXMODE: s_add_i32 [[IDX_FIN:s[0-9]+]], [[IDX_SHL]], 1
				; IDXMODE: s_set_gpr_idx_on [[IDX_FIN]], dst
	; IDXMODE: v_mov_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}			; IDXMODE: v_mov_b32_e32 v{{[0-9]+}}, v{{[0-9]+}}
	; IDXMODE: s_set_gpr_idx_off			; IDXMODE: s_set_gpr_idx_off
	define amdgpu_kernel void @insertelement_v16f32_or_index(<16 x float> addrspace(1)* %out, <16 x float> %a, i32 %idx.in) nounwind {			define amdgpu_kernel void @insertelement_v16f32_or_index(<16 x float> addrspace(1)* %out, <16 x float> %a, i32 %idx.in) nounwind {
	%idx.shl = shl i32 %idx.in, 2			%idx.shl = shl i32 %idx.in, 2
	%idx = or i32 %idx.shl, 1			%idx = or i32 %idx.shl, 1
	%vecins = insertelement <16 x float> %a, float 5.000000e+00, i32 %idx			%vecins = insertelement <16 x float> %a, float 5.000000e+00, i32 %idx
	store <16 x float> %vecins, <16 x float> addrspace(1)* %out, align 64			store <16 x float> %vecins, <16 x float> addrspace(1)* %out, align 64
	ret void			ret void
	Show All 12 Lines