This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Do not combine loads/store across physreg defs
ClosedPublic

Authored by nhaehnle on Nov 22 2017, 4:20 AM.

Download Raw Diff

Details

Reviewers

arsenm
mareko
rampitec

Commits

rG770397f4cdcf: AMDGPU: Do not combine loads/store across physreg defs
rL325677: AMDGPU: Do not combine loads/store across physreg defs

Summary

Since this pass operates on machine SSA form, this should only really
affect M0 in practice.

Fixes various piglit variable-indexing/vs-varying-array-mat4-index-*

Change-Id: Ib2a1dc3a8d7b08225a8da49a86f533faa0986aa8
Fixes: r317751 ("AMDGPU: Merge S_BUFFER_LOAD_DWORD_IMM into x2, x4")

Diff Detail

Repository

rL LLVM

Build Status

Buildable 14335
Build 14335: arc lint + arc unit

Event Timeline

nhaehnle created this revision.Nov 22 2017, 4:20 AM

Harbormaster completed remote builds in B12395: Diff 123905.Nov 22 2017, 4:20 AM

Herald added subscribers: t-tye, tpr, dstuttard and 3 others. · View Herald TranscriptNov 22 2017, 4:20 AM

nhaehnle added a parent revision: D40342: AMDGPU: Consistently check for immediates in SIInstrInfo::FoldImmediate.Nov 22 2017, 4:21 AM

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

arsenm added inline comments.Nov 27 2017, 10:14 AM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
365	There could also be a call instruction, and the function may modify m0 or anything else
test/CodeGen/AMDGPU/llvm.amdgcn.interp.ll
185	Can you add a test with a no inline call to make sure that works correctly?

In D40343#933121, @rampitec wrote:

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

The LDS use here is a bit of a red herring; it's really not about that. The original case where I found the bug had no "proper" LDS instructions at all, only v_interp, and the merged memory instructions were buffer instructions. You could probably construct a similar case also with e.g. only relative indexing with different indices, or with relative indexing and s_sendmsg (which uses M0).

The real problem is that the pass assumes machine-SSA form, and this assumption is broken with physreg-defs. Here's what happens:

%vreg0 = mem-instruction-1
M0 = def-1
use(%vreg0, M0)
M0 = def-2
...
mem-instruction-2

Without this fix, this gets changed to:

M0 = def-1
M0 = def-2
...
%vreg0, ... = merged-mem-instruction
use(%vreg0, M0)

So the %vreg0-use gets moved (because it depends on mem-instruction-1), without regard for the fact that the instruction reads a register that is later overwritten by a different value.

This possible write-after-read hazard of registers is something that the pass simply hasn't tracked before, because for virtual registers in machine-SSA its unnecessary. It's only with physical registers that it becomes necessary; we don't have many of those at this stage in the flow in practice, but M0 is not a special case.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
365	To clarify, this means that we should bail out entirely when we encounter a call instruction, right?

In D40343#937312, @nhaehnle wrote:
In D40343#933121, @rampitec wrote:

As far as I understand that is only a concern if defined register is M0 since it is read by the ds_* instructions. I.e. it should be better to check to M0, not just any physreg.
Also this should not be a concern on GFX9 since we have lds instructions which do not read M0 there (check Subtarget->ldsRequiresM0Init()).

The LDS use here is a bit of a red herring; it's really not about that. The original case where I found the bug had no "proper" LDS instructions at all, only v_interp, and the merged memory instructions were buffer instructions. You could probably construct a similar case also with e.g. only relative indexing with different indices, or with relative indexing and s_sendmsg (which uses M0).

The real problem is that the pass assumes machine-SSA form, and this assumption is broken with physreg-defs. Here's what happens:
%vreg0 = mem-instruction-1
M0 = def-1
use(%vreg0, M0)
M0 = def-2
...
mem-instruction-2
Without this fix, this gets changed to:
M0 = def-1
M0 = def-2
...
%vreg0, ... = merged-mem-instruction
use(%vreg0, M0)
So the %vreg0-use gets moved (because it depends on mem-instruction-1), without regard for the fact that the instruction reads a register that is later overwritten by a different value.

This possible write-after-read hazard of registers is something that the pass simply hasn't tracked before, because for virtual registers in machine-SSA its unnecessary. It's only with physical registers that it becomes necessary; we don't have many of those at this stage in the flow in practice, but M0 is not a special case.

I see. But my point is we can easily get the situation:

m0 = def-1
ds_read
m0 = def-2
ds_read

That worth nothing that defs are euqal in our case. Unless we are going to prove defs are equal we shall be unable to combine those two ds_reads. Now with your patch we will be unable to combine them.
This is fine unless you think about GFX9 versions of ds_read which do not imp-use M0. We still shall be able to combine them, but we will not after this patch.

So what I mean not just bail out on physreg def, but check if the register is actually used by any instructions you are going to move before the required def.
Does it make sense to you?

This pass should probably ignore Subtarget->ldsRequiresM0Init(). The instructions are selected to a set without m0, so this should be able to just check the register uses

In D40343#938031, @arsenm wrote:

This pass should probably ignore Subtarget->ldsRequiresM0Init(). The instructions are selected to a set without m0, so this should be able to just check the register uses

Right, I do not propose to check for a feature. I propose to check for actual register uses.

arsenm added inline comments.Nov 28 2017, 12:14 PM

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
365	I think this would just fall out naturally from physreg defs (although I guess you could have a void call that doesn't modify any visible registers), but you might need a call check

We need this fix in LLVM 6.0, right?

Add a non-inline call test.

I would like to commit this as-is, since it's a bug fix. I have a change which tracks def-use as well, I'll upload it in a moment as a related revision, and would like to commit it separately.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
365	I've added a test case to ds_read2.ll. It passes without modifications to the code.

nhaehnle added a child revision: D42647: AMDGPU: Track physreg uses in SILoadStoreOptimizer.Jan 29 2018, 8:30 AM

mareko accepted this revision.Jan 29 2018, 2:26 PM

This revision is now accepted and ready to land.Jan 29 2018, 2:26 PM

Closed by commit rL325677: AMDGPU: Do not combine loads/store across physreg defs (authored by nha). · Explain WhyFeb 21 2018, 5:34 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SILoadStoreOptimizer.cpp

20 lines

test/

CodeGen/

AMDGPU/

ds_read2.ll

19 lines

llvm.amdgcn.interp.ll

34 lines

smrd.ll

45 lines

Diff 131807

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp

Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	for (MachineInstr *InstToMove : InstsToMove) {
if (!InstToMove->mayLoadOrStore())		if (!InstToMove->mayLoadOrStore())
continue;		continue;
if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))		if (!memAccessesCanBeReordered(MemOp, *InstToMove, TII, AA))
return false;		return false;
}		}
return true;		return true;
}		}

		static bool
		hasPhysRegDef(MachineInstr &MI) {
		for (const MachineOperand &Def : MI.defs()) {
		if (Def.isReg() &&
		TargetRegisterInfo::isPhysicalRegister(Def.getReg()))
		return true;
		}
		return false;
		}

bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI) {		bool SILoadStoreOptimizer::offsetsCanBeCombined(CombineInfo &CI) {
// XXX - Would the same offset be OK? Is there any reason this would happen or		// XXX - Would the same offset be OK? Is there any reason this would happen or
// be useful?		// be useful?
if (CI.Offset0 == CI.Offset1)		if (CI.Offset0 == CI.Offset1)
return false;		return false;

// This won't be valid if the offset isn't aligned.		// This won't be valid if the offset isn't aligned.
if ((CI.Offset0 % CI.EltSize != 0) \|\| (CI.Offset1 % CI.EltSize != 0))		if ((CI.Offset0 % CI.EltSize != 0) \|\| (CI.Offset1 % CI.EltSize != 0))
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	if (MBBI->getOpcode() != CI.I->getOpcode()) {
// be merged into.		// be merged into.

if (MBBI->hasUnmodeledSideEffects()) {		if (MBBI->hasUnmodeledSideEffects()) {
// We can't re-order this instruction with respect to other memory		// We can't re-order this instruction with respect to other memory
// operations, so we fail both conditions mentioned above.		// operations, so we fail both conditions mentioned above.
return false;		return false;
}		}

		if (hasPhysRegDef(*MBBI)) {
		// We could re-order this instruction in theory, but it would require
		// tracking physreg defs and uses. This should only affect M0 in
		// practice.
		arsenmUnsubmitted Not Done Reply Inline Actions There could also be a call instruction, and the function may modify m0 or anything else arsenm: There could also be a call instruction, and the function may modify m0 or anything else
		nhaehnleAuthorUnsubmitted Not Done Reply Inline Actions To clarify, this means that we should bail out entirely when we encounter a call instruction, right? nhaehnle: To clarify, this means that we should bail out entirely when we encounter a call instruction…
		arsenmUnsubmitted Not Done Reply Inline Actions I think this would just fall out naturally from physreg defs (although I guess you could have a void call that doesn't modify any visible registers), but you might need a call check arsenm: I think this would just fall out naturally from physreg defs (although I guess you could have a…
		nhaehnleAuthorUnsubmitted Not Done Reply Inline Actions I've added a test case to ds_read2.ll. It passes without modifications to the code. nhaehnle: I've added a test case to ds_read2.ll. It passes without modifications to the code.
		return false;
		}

if (MBBI->mayLoadOrStore() &&		if (MBBI->mayLoadOrStore() &&
(!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|		(!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|
!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))) {		!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))) {
// We fail condition #1, but we may still be able to satisfy condition		// We fail condition #1, but we may still be able to satisfy condition
// #2. Add this instruction to the move list and then we will check		// #2. Add this instruction to the move list and then we will check
// if condition #2 holds once we have selected the matching instruction.		// if condition #2 holds once we have selected the matching instruction.
CI.InstsToMove.push_back(&*MBBI);		CI.InstsToMove.push_back(&*MBBI);
addDefsToList(*MBBI, DefsToMove);		addDefsToList(*MBBI, DefsToMove);
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	for ( ; MBBI != E; ++MBBI) {
}		}

// We've found a load/store that we couldn't merge for some reason.		// We've found a load/store that we couldn't merge for some reason.
// We could potentially keep looking, but we'd need to make sure that		// We could potentially keep looking, but we'd need to make sure that
// it was safe to move I and also all the instruction in InstsToMove		// it was safe to move I and also all the instruction in InstsToMove
// down past this instruction.		// down past this instruction.
// check if we can move I across MBBI and if we can move all I's users		// check if we can move I across MBBI and if we can move all I's users
if (!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|		if (!memAccessesCanBeReordered(CI.I, MBBI, TII, AA) \|\|
!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA))		!canMoveInstsAcrossMemOp(*MBBI, CI.InstsToMove, TII, AA) \|\|
		hasPhysRegDef(*MBBI))
break;		break;
}		}
return false;		return false;
}		}

unsigned SILoadStoreOptimizer::read2Opcode(unsigned EltSize) const {		unsigned SILoadStoreOptimizer::read2Opcode(unsigned EltSize) const {
if (STM->ldsRequiresM0Init())		if (STM->ldsRequiresM0Init())
return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;		return (EltSize == 4) ? AMDGPU::DS_READ2_B32 : AMDGPU::DS_READ2_B64;
▲ Show 20 Lines • Show All 502 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/ds_read2.ll

Show First 20 Lines • Show All 607 Lines • ▼ Show 20 Lines	bb:
%tmp27 = load float, float addrspace(3)* %tmp13		%tmp27 = load float, float addrspace(3)* %tmp13
%tmp28 = load float, float addrspace(3)* %tmp14		%tmp28 = load float, float addrspace(3)* %tmp14
%tmp29 = fmul float %tmp27, %tmp28		%tmp29 = fmul float %tmp27, %tmp28
%tmp30 = fsub float %tmp26, %tmp29		%tmp30 = fsub float %tmp26, %tmp29
store float %tmp30, float addrspace(1)* %tmp		store float %tmp30, float addrspace(1)* %tmp
ret void		ret void
}		}

		; GCN-LABEL: ds_read_call_read:
		; GCN: ds_read_b32
		; GCN: s_swappc_b64
		; GCN: ds_read_b32
		define amdgpu_kernel void @ds_read_call_read(i32 addrspace(1)* %out, i32 addrspace(3)* %arg) {
		%x = call i32 @llvm.amdgcn.workitem.id.x()
		%arrayidx0 = getelementptr i32, i32 addrspace(3)* %arg, i32 %x
		%arrayidx1 = getelementptr i32, i32 addrspace(3)* %arrayidx0, i32 1
		%v0 = load i32, i32 addrspace(3)* %arrayidx0, align 4
		call void @void_func_void()
		%v1 = load i32, i32 addrspace(3)* %arrayidx1, align 4
		%r = add i32 %v0, %v1
		store i32 %r, i32 addrspace(1)* %out, align 4
		ret void
		}

		declare void @void_func_void() #3

declare i32 @llvm.amdgcn.workgroup.id.x() #1		declare i32 @llvm.amdgcn.workgroup.id.x() #1
declare i32 @llvm.amdgcn.workgroup.id.y() #1		declare i32 @llvm.amdgcn.workgroup.id.y() #1
declare i32 @llvm.amdgcn.workitem.id.x() #1		declare i32 @llvm.amdgcn.workitem.id.x() #1
declare i32 @llvm.amdgcn.workitem.id.y() #1		declare i32 @llvm.amdgcn.workitem.id.y() #1

declare void @llvm.amdgcn.s.barrier() #2		declare void @llvm.amdgcn.s.barrier() #2

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone speculatable }		attributes #1 = { nounwind readnone speculatable }
attributes #2 = { convergent nounwind }		attributes #2 = { convergent nounwind }
		attributes #3 = { nounwind noinline }

test/CodeGen/AMDGPU/llvm.amdgcn.interp.ll

Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	bb:
store volatile float %mov_10, float addrspace(1)* undef		store volatile float %mov_10, float addrspace(1)* undef
store volatile float %mov_11, float addrspace(1)* undef		store volatile float %mov_11, float addrspace(1)* undef
store volatile float %mov_12, float addrspace(1)* undef		store volatile float %mov_12, float addrspace(1)* undef
ret void		ret void
}		}

; SI won't merge ds memory operations, because of the signed offset bug, so		; SI won't merge ds memory operations, because of the signed offset bug, so
; we only have check lines for VI.		; we only have check lines for VI.
; VI-LABEL: v_interp_readnone:		;
; VI: s_mov_b32 m0, 0		; TODO: VI won't merge them either, because we are conservative about moving
; VI-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0		; instructions past changes to physregs.
; VI-DAG: v_interp_mov_f32 v{{[0-9]+}}, p0, attr0.x{{$}}		;
; VI: s_mov_b32 m0, -1{{$}}		; TODO-VI-LABEL: v_interp_readnone:
; VI: ds_write2_b32 v{{[0-9]+}}, [[ZERO]], [[ZERO]] offset1:4		; TODO-VI: s_mov_b32 m0, 0
define amdgpu_ps void @v_interp_readnone(float addrspace(3)* %lds) #0 {		; TODO-VI-DAG: v_mov_b32_e32 [[ZERO:v[0-9]+]], 0
bb:		; TODO-VI-DAG: v_interp_mov_f32 v{{[0-9]+}}, p0, attr0.x{{$}}
store float 0.000000e+00, float addrspace(3)* %lds		; TODO-VI: s_mov_b32 m0, -1{{$}}
%tmp1 = call float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 0)		; TODO-VI: ds_write2_b32 v{{[0-9]+}}, [[ZERO]], [[ZERO]] offset1:4
%tmp2 = getelementptr float, float addrspace(3)* %lds, i32 4		;define amdgpu_ps void @v_interp_readnone(float addrspace(3)* %lds) #0 {
store float 0.000000e+00, float addrspace(3)* %tmp2		;bb:
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0		; store float 0.000000e+00, float addrspace(3)* %lds
ret void		; %tmp1 = call float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 0)
}		; %tmp2 = getelementptr float, float addrspace(3)* %lds, i32 4
		; store float 0.000000e+00, float addrspace(3)* %tmp2
		; call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %tmp1, float %tmp1, float %tmp1, float %tmp1, i1 true, i1 true) #0
		; ret void
		;}

; Thest that v_interp_p1 uses different source and destination registers		; Thest that v_interp_p1 uses different source and destination registers
; on 16 bank LDS chips.		; on 16 bank LDS chips.

		arsenmUnsubmitted Done Reply Inline Actions Can you add a test with a no inline call to make sure that works correctly? arsenm: Can you add a test with a no inline call to make sure that works correctly?
; GCN-LABEL: {{^}}v_interp_p1_bank16_bug:		; GCN-LABEL: {{^}}v_interp_p1_bank16_bug:
; 16BANK-NOT: v_interp_p1_f32 [[DST:v[0-9]+]], [[DST]]		; 16BANK-NOT: v_interp_p1_f32 [[DST:v[0-9]+]], [[DST]]
define amdgpu_ps void @v_interp_p1_bank16_bug([6 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg13, [17 x <4 x i32>] addrspace(2)* byval %arg14, [34 x <8 x i32>] addrspace(2)* byval %arg15, float inreg %arg16, i32 inreg %arg17, <2 x i32> %arg18, <2 x i32> %arg19, <2 x i32> %arg20, <3 x i32> %arg21, <2 x i32> %arg22, <2 x i32> %arg23, <2 x i32> %arg24, float %arg25, float %arg26, float %arg27, float %arg28, float %arg29, float %arg30, i32 %arg31, float %arg32, float %arg33) #0 {		define amdgpu_ps void @v_interp_p1_bank16_bug([6 x <16 x i8>] addrspace(2)* byval %arg, [17 x <16 x i8>] addrspace(2)* byval %arg13, [17 x <4 x i32>] addrspace(2)* byval %arg14, [34 x <8 x i32>] addrspace(2)* byval %arg15, float inreg %arg16, i32 inreg %arg17, <2 x i32> %arg18, <2 x i32> %arg19, <2 x i32> %arg20, <3 x i32> %arg21, <2 x i32> %arg22, <2 x i32> %arg23, <2 x i32> %arg24, float %arg25, float %arg26, float %arg27, float %arg28, float %arg29, float %arg30, i32 %arg31, float %arg32, float %arg33) #0 {
main_body:		main_body:
%i.i = extractelement <2 x i32> %arg19, i32 0		%i.i = extractelement <2 x i32> %arg19, i32 0
%j.i = extractelement <2 x i32> %arg19, i32 1		%j.i = extractelement <2 x i32> %arg19, i32 1
%i.f.i = bitcast i32 %i.i to float		%i.f.i = bitcast i32 %i.i to float
%j.f.i = bitcast i32 %j.i to float		%j.f.i = bitcast i32 %j.i to float
Show All 33 Lines

test/CodeGen/AMDGPU/smrd.ll

Show First 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	main_body:
%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)		%r4 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 16)
%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)		%r5 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 28)
%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)		%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 32)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0
ret void		ret void
}		}

		; GCN-LABEL: {{^}}smrd_imm_nomerge_m0:
		;
		; In principle we could merge the loads here as well, but it would require
		; careful tracking of physical registers since both v_interp* and v_movrel*
		; instructions (or gpr idx mode) use M0.
		;
		; GCN: s_buffer_load_dword
		; GCN: s_buffer_load_dword
		define amdgpu_ps float @smrd_imm_nomerge_m0(<4 x i32> inreg %desc, i32 inreg %prim, float %u, float %v) #0 {
		main_body:
		%idx1.f = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 0)
		%idx1 = bitcast float %idx1.f to i32

		%v0.x1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 0, i32 0, i32 %prim)
		%v0.x = call nsz float @llvm.amdgcn.interp.p2(float %v0.x1, float %v, i32 0, i32 0, i32 %prim)
		%v0.y1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 0, i32 1, i32 %prim)
		%v0.y = call nsz float @llvm.amdgcn.interp.p2(float %v0.y1, float %v, i32 0, i32 1, i32 %prim)
		%v0.z1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 0, i32 2, i32 %prim)
		%v0.z = call nsz float @llvm.amdgcn.interp.p2(float %v0.z1, float %v, i32 0, i32 2, i32 %prim)
		%v0.tmp0 = insertelement <3 x float> undef, float %v0.x, i32 0
		%v0.tmp1 = insertelement <3 x float> %v0.tmp0, float %v0.y, i32 1
		%v0 = insertelement <3 x float> %v0.tmp1, float %v0.z, i32 2
		%a = extractelement <3 x float> %v0, i32 %idx1

		%v1.x1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 1, i32 0, i32 %prim)
		%v1.x = call nsz float @llvm.amdgcn.interp.p2(float %v1.x1, float %v, i32 1, i32 0, i32 %prim)
		%v1.y1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 1, i32 1, i32 %prim)
		%v1.y = call nsz float @llvm.amdgcn.interp.p2(float %v1.y1, float %v, i32 1, i32 1, i32 %prim)
		%v1.z1 = call nsz float @llvm.amdgcn.interp.p1(float %u, i32 1, i32 2, i32 %prim)
		%v1.z = call nsz float @llvm.amdgcn.interp.p2(float %v1.z1, float %v, i32 1, i32 2, i32 %prim)
		%v1.tmp0 = insertelement <3 x float> undef, float %v0.x, i32 0
		%v1.tmp1 = insertelement <3 x float> %v0.tmp0, float %v0.y, i32 1
		%v1 = insertelement <3 x float> %v0.tmp1, float %v0.z, i32 2

		%b = extractelement <3 x float> %v1, i32 %idx1
		%c = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 4)

		%res.tmp = fadd float %a, %b
		%res = fadd float %res.tmp, %c
		ret float %res
		}

; GCN-LABEL: {{^}}smrd_vgpr_merged:		; GCN-LABEL: {{^}}smrd_vgpr_merged:
; GCN-NEXT: %bb.		; GCN-NEXT: %bb.

; SICIVI-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4		; SICIVI-NEXT: buffer_load_dwordx4 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:4
; SICIVI-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:28		; SICIVI-NEXT: buffer_load_dwordx2 v[{{[0-9]}}:{{[0-9]}}], v0, s[0:3], 0 offen offset:28

; GFX9: buffer_load_dword		; GFX9: buffer_load_dword
; GFX9: buffer_load_dword		; GFX9: buffer_load_dword
Show All 17 Lines	main_body:
%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %a6)		%r6 = call float @llvm.SI.load.const.v4i32(<4 x i32> %desc, i32 %a6)
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r1, float %r2, float %r3, float %r4, i1 true, i1 true) #0
call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0		call void @llvm.amdgcn.exp.f32(i32 0, i32 15, float %r5, float %r6, float undef, float undef, i1 true, i1 true) #0
ret void		ret void
}		}

declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0		declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #0
declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1		declare float @llvm.SI.load.const.v4i32(<4 x i32>, i32) #1
		declare float @llvm.amdgcn.interp.p1(float, i32, i32, i32) #2
		declare float @llvm.amdgcn.interp.p2(float, float, i32, i32, i32) #2

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }
		attributes #2 = { nounwind readnone speculatable }