This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT
AbandonedPublic

Authored by hsmhsm on Apr 28 2022, 10:45 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad
rampitec
tpr

Summary

When the target vector size is <= 64, we can directly custom lower INSERT_SUBVECTOR
instead of taking it through INSERT_VECTOR_ELT.

Custom lowering of INSERT_SUBVECTOR for the cases where target vector size > 64 is not
really happening at the moment. We need to handle it in a generic way for all possible
allowed vec sizes. This patch is a kind of pre-checkin patch to handle it.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,030 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
	60,140 ms	x64 debian > libFuzzer.libFuzzer::large.test
	60,030 ms	x64 debian > libFuzzer.libFuzzer::out-of-process-fuzz.test
	60,020 ms	x64 debian > libFuzzer.libFuzzer::value-profile-load.test

Event Timeline

hsmhsm created this revision.Apr 28 2022, 10:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 10:45 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 6 others. · View Herald Transcript

hsmhsm requested review of this revision.Apr 28 2022, 10:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 10:45 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B161927: Diff 425974.Apr 28 2022, 11:51 PM

hsmhsm edited the summary of this revision. (Show Details)Apr 29 2022, 1:08 AM

hsmhsm added a reviewer: tpr.

foad added inline comments.Apr 29 2022, 1:33 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801	I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue().

hsmhsm added inline comments.Apr 29 2022, 1:41 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801	Hmm, I did not realize it. But, I was told that returning SDValue() means treat it as legal. And, hence, no default expansion takes place which otherwise would introduce stack access? I myself do not understand it - I need to further explore it in detail.

hsmhsm marked an inline comment as not done.Apr 29 2022, 3:30 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801	Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example. define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) { %vecins = insertelement <2 x i16> %a, i16 100, i32 0 store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16 ret void } Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby, INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a. s_load_dword s2, s[4:5], 0x8 s_load_dwordx2 s[0:1], s[4:5], 0x0 v_mov_b32_e32 v0, 0 s_waitcnt lgkmcnt(0) s_pack_lh_b32_b16 s2, 0x64, s2 v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] s_endpgm When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one. s_load_dword s2, s[4:5], 0x8 s_load_dwordx2 s[0:1], s[4:5], 0x0 v_mov_b32_e32 v0, 0 s_waitcnt lgkmcnt(0) s_and_b32 s2, s2, 0xffff0000 s_or_b32 s2, s2, 0x64 v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] s_endpgm Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it.

hsmhsm marked an inline comment as not done.Apr 29 2022, 3:44 AM

hsmhsm marked an inline comment as not done.Apr 29 2022, 3:49 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

5801

Nevertheless, for dynamic index we need above custom lowering, otherwise, there will be stack access as below.

define amdgpu_kernel void @foo(<2 x i16> addrspace(1)* %out, <2 x i16> %a, i32 %idx) {
  %vecins = insertelement <2 x i16> %a, i16 100, i32 %idx
  store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
  ret void
}

With default expansion of INSERT_VECTOR_ELT, we get below assembly for gfx90a.

s_add_u32 s0, s0, s7
s_load_dword s6, s[4:5], 0xc
s_load_dword s7, s[4:5], 0x8
s_addc_u32 s1, s1, 0
v_mov_b32_e32 v0, 4
s_load_dwordx2 s[4:5], s[4:5], 0x0
s_waitcnt lgkmcnt(0)
s_and_b32 s6, s6, 1
v_mov_b32_e32 v1, s7
s_lshl_b32 s6, s6, 1
buffer_store_dword v1, off, s[0:3], 0 offset:4
v_or_b32_e32 v0, s6, v0
v_mov_b32_e32 v1, 0x64
buffer_store_short v1, v0, s[0:3], 0 offen
buffer_load_dword v0, off, s[0:3], 0 offset:4
v_mov_b32_e32 v1, 0
s_waitcnt vmcnt(0)
global_store_dword v1, v0, s[4:5]
s_endpgm

This patch helped me to understand - what is going on with custom lowering of few vector operations, but, otherwise, the change within this patch itself for INSERT_SUBVECTOR does not any make sense. Hence abandoning it. We be coming up with update patch(es).

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

86 lines

test/

CodeGen/

AMDGPU/

vector_shuffle.packed.ll

16 lines

Diff 425974

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,727 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::lowerADDRSPACECAST(SDValue Op,
const MachineFunction &MF = DAG.getMachineFunction();		const MachineFunction &MF = DAG.getMachineFunction();
DiagnosticInfoUnsupported InvalidAddrSpaceCast(		DiagnosticInfoUnsupported InvalidAddrSpaceCast(
MF.getFunction(), "invalid addrspacecast", SL.getDebugLoc());		MF.getFunction(), "invalid addrspacecast", SL.getDebugLoc());
DAG.getContext()->diagnose(InvalidAddrSpaceCast);		DAG.getContext()->diagnose(InvalidAddrSpaceCast);

return DAG.getUNDEF(ASC->getValueType(0));		return DAG.getUNDEF(ASC->getValueType(0));
}		}

		static SDValue InsertVecElt(SDValue Vec, SDValue InsVal, SDValue Idx, SDLoc &SL,
		SelectionDAG &DAG) {
		if (isa<ConstantSDNode>(Idx))
		return SDValue();

		EVT VecVT = Vec.getValueType();
		EVT EltVT = VecVT.getVectorElementType();
		unsigned VecSize = VecVT.getSizeInBits();
		unsigned EltSize = EltVT.getSizeInBits();

		assert(VecSize <= 64);

		MVT IntVT = MVT::getIntegerVT(VecSize);

		// Avoid stack access for dynamic indexing.
		// v_bfi_b32 (v_bfm_b32 16, (shl idx, 16)), val, vec

		// Create a congruent vector with the target value in each element so that
		// the required element can be masked and ORed into the target vector.
		SDValue ExtVal = DAG.getNode(ISD::BITCAST, SL, IntVT,
		DAG.getSplatBuildVector(VecVT, SL, InsVal));

		assert(isPowerOf2_32(EltSize));
		SDValue ScaleFactor = DAG.getConstant(Log2_32(EltSize), SL, MVT::i32);

		// Convert vector index to bit-index.
		SDValue ScaledIdx = DAG.getNode(ISD::SHL, SL, MVT::i32, Idx, ScaleFactor);

		SDValue BCVec = DAG.getNode(ISD::BITCAST, SL, IntVT, Vec);
		SDValue BFM = DAG.getNode(ISD::SHL, SL, IntVT,
		DAG.getConstant(0xffff, SL, IntVT), ScaledIdx);

		SDValue LHS = DAG.getNode(ISD::AND, SL, IntVT, BFM, ExtVal);
		SDValue RHS =
		DAG.getNode(ISD::AND, SL, IntVT, DAG.getNOT(SL, BFM, IntVT), BCVec);

		SDValue BFI = DAG.getNode(ISD::OR, SL, IntVT, LHS, RHS);
		return DAG.getNode(ISD::BITCAST, SL, VecVT, BFI);
		}

// This lowers an INSERT_SUBVECTOR by extracting the individual elements from		// This lowers an INSERT_SUBVECTOR by extracting the individual elements from
// the small vector and inserting them into the big vector. That is better than		// the small vector and inserting them into the big vector. That is better than
// the default expansion of doing it via a stack slot. Even though the use of		// the default expansion of doing it via a stack slot. Even though the use of
// the stack slot would be optimized away afterwards, the stack slot itself		// the stack slot would be optimized away afterwards, the stack slot itself
// remains.		// remains.
SDValue SITargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,		SDValue SITargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Vec = Op.getOperand(0);		SDValue Vec = Op.getOperand(0);
SDValue Ins = Op.getOperand(1);		SDValue Ins = Op.getOperand(1);
SDValue Idx = Op.getOperand(2);		SDValue Idx = Op.getOperand(2);
EVT VecVT = Vec.getValueType();		EVT VecVT = Vec.getValueType();
EVT InsVT = Ins.getValueType();		EVT InsVT = Ins.getValueType();
EVT EltVT = VecVT.getVectorElementType();		EVT EltVT = VecVT.getVectorElementType();
		unsigned VecSize = VecVT.getSizeInBits();
unsigned InsNumElts = InsVT.getVectorNumElements();		unsigned InsNumElts = InsVT.getVectorNumElements();
unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();		unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
SDLoc SL(Op);		SDLoc SL(Op);

for (unsigned I = 0; I != InsNumElts; ++I) {		for (unsigned I = 0; I != InsNumElts; ++I) {
SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT, Ins,		SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT, Ins,
DAG.getConstant(I, SL, MVT::i32));		DAG.getConstant(I, SL, MVT::i32));
Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, SL, VecVT, Vec, Elt,		SDValue CurIdx = DAG.getConstant(IdxVal + I, SL, MVT::i32);
DAG.getConstant(IdxVal + I, SL, MVT::i32));		if (VecSize <= 64) {
		// We can directly custom lower when the target vector size is <= 64
		// instead of again taking it through INSERT_VECTOR_ELT.
		Vec = InsertVecElt(Vec, Elt, CurIdx, SL, DAG);
		foadUnsubmitted Not Done Reply Inline Actions I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue(). foad: I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything…
		hsmhsmAuthorUnsubmitted Not Done Reply Inline Actions Hmm, I did not realize it. But, I was told that returning SDValue() means treat it as legal. And, hence, no default expansion takes place which otherwise would introduce stack access? I myself do not understand it - I need to further explore it in detail. hsmhsm: Hmm, I did not realize it. But, I was told that returning SDValue() means treat it as legal.
		hsmhsmAuthorUnsubmitted Not Done Reply Inline Actions Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example. define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) { %vecins = insertelement <2 x i16> %a, i16 100, i32 0 store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16 ret void } Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby, INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a. s_load_dword s2, s[4:5], 0x8 s_load_dwordx2 s[0:1], s[4:5], 0x0 v_mov_b32_e32 v0, 0 s_waitcnt lgkmcnt(0) s_pack_lh_b32_b16 s2, 0x64, s2 v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] s_endpgm When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one. s_load_dword s2, s[4:5], 0x8 s_load_dwordx2 s[0:1], s[4:5], 0x0 v_mov_b32_e32 v0, 0 s_waitcnt lgkmcnt(0) s_and_b32 s2, s2, 0xffff0000 s_or_b32 s2, s2, 0x64 v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] s_endpgm Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it. hsmhsm: Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example. ```…
		hsmhsmAuthorUnsubmitted Not Done Reply Inline Actions Nevertheless, for dynamic index we need above custom lowering, otherwise, there will be stack access as below. define amdgpu_kernel void @foo(<2 x i16> addrspace(1)* %out, <2 x i16> %a, i32 %idx) { %vecins = insertelement <2 x i16> %a, i16 100, i32 %idx store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16 ret void } With default expansion of INSERT_VECTOR_ELT, we get below assembly for gfx90a. s_add_u32 s0, s0, s7 s_load_dword s6, s[4:5], 0xc s_load_dword s7, s[4:5], 0x8 s_addc_u32 s1, s1, 0 v_mov_b32_e32 v0, 4 s_load_dwordx2 s[4:5], s[4:5], 0x0 s_waitcnt lgkmcnt(0) s_and_b32 s6, s6, 1 v_mov_b32_e32 v1, s7 s_lshl_b32 s6, s6, 1 buffer_store_dword v1, off, s[0:3], 0 offset:4 v_or_b32_e32 v0, s6, v0 v_mov_b32_e32 v1, 0x64 buffer_store_short v1, v0, s[0:3], 0 offen buffer_load_dword v0, off, s[0:3], 0 offset:4 v_mov_b32_e32 v1, 0 s_waitcnt vmcnt(0) global_store_dword v1, v0, s[4:5] s_endpgm hsmhsm: Nevertheless, for dynamic index we need above custom lowering, otherwise, there will be stack…
		} else {
		// TODO: These cases really are not getting custom lowered at the moment.
		// We should custom lower all possible vec sizes in a generic way.
		Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, SL, VecVT, Vec, Elt, CurIdx);
		}
}		}

return Vec;		return Vec;
}		}

SDValue SITargetLowering::lowerINSERT_VECTOR_ELT(SDValue Op,		SDValue SITargetLowering::lowerINSERT_VECTOR_ELT(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Vec = Op.getOperand(0);		SDValue Vec = Op.getOperand(0);
SDValue InsVal = Op.getOperand(1);		SDValue InsVal = Op.getOperand(1);
SDValue Idx = Op.getOperand(2);		SDValue Idx = Op.getOperand(2);
EVT VecVT = Vec.getValueType();		EVT VecVT = Vec.getValueType();
EVT EltVT = VecVT.getVectorElementType();		EVT EltVT = VecVT.getVectorElementType();
unsigned VecSize = VecVT.getSizeInBits();		unsigned VecSize = VecVT.getSizeInBits();
unsigned EltSize = EltVT.getSizeInBits();		unsigned EltSize = EltVT.getSizeInBits();


assert(VecSize <= 64);		assert(VecSize <= 64);

unsigned NumElts = VecVT.getVectorNumElements();		unsigned NumElts = VecVT.getVectorNumElements();
SDLoc SL(Op);		SDLoc SL(Op);
auto KIdx = dyn_cast<ConstantSDNode>(Idx);		auto KIdx = dyn_cast<ConstantSDNode>(Idx);

if (NumElts == 4 && EltSize == 16 && KIdx) {		if (NumElts == 4 && EltSize == 16 && KIdx) {
SDValue BCVec = DAG.getNode(ISD::BITCAST, SL, MVT::v2i32, Vec);		SDValue BCVec = DAG.getNode(ISD::BITCAST, SL, MVT::v2i32, Vec);
Show All 17 Lines	if (NumElts == 4 && EltSize == 16 && KIdx) {

SDValue Concat = InsertLo ?		SDValue Concat = InsertLo ?
DAG.getBuildVector(MVT::v2i32, SL, { InsHalf, HiHalf }) :		DAG.getBuildVector(MVT::v2i32, SL, { InsHalf, HiHalf }) :
DAG.getBuildVector(MVT::v2i32, SL, { LoHalf, InsHalf });		DAG.getBuildVector(MVT::v2i32, SL, { LoHalf, InsHalf });

return DAG.getNode(ISD::BITCAST, SL, VecVT, Concat);		return DAG.getNode(ISD::BITCAST, SL, VecVT, Concat);
}		}

if (isa<ConstantSDNode>(Idx))		return InsertVecElt(Vec, InsVal, Idx, SL, DAG);
return SDValue();

MVT IntVT = MVT::getIntegerVT(VecSize);

// Avoid stack access for dynamic indexing.
// v_bfi_b32 (v_bfm_b32 16, (shl idx, 16)), val, vec

// Create a congruent vector with the target value in each element so that
// the required element can be masked and ORed into the target vector.
SDValue ExtVal = DAG.getNode(ISD::BITCAST, SL, IntVT,
DAG.getSplatBuildVector(VecVT, SL, InsVal));

assert(isPowerOf2_32(EltSize));
SDValue ScaleFactor = DAG.getConstant(Log2_32(EltSize), SL, MVT::i32);

// Convert vector index to bit-index.
SDValue ScaledIdx = DAG.getNode(ISD::SHL, SL, MVT::i32, Idx, ScaleFactor);

SDValue BCVec = DAG.getNode(ISD::BITCAST, SL, IntVT, Vec);
SDValue BFM = DAG.getNode(ISD::SHL, SL, IntVT,
DAG.getConstant(0xffff, SL, IntVT),
ScaledIdx);

SDValue LHS = DAG.getNode(ISD::AND, SL, IntVT, BFM, ExtVal);
SDValue RHS = DAG.getNode(ISD::AND, SL, IntVT,
DAG.getNOT(SL, BFM, IntVT), BCVec);

SDValue BFI = DAG.getNode(ISD::OR, SL, IntVT, LHS, RHS);
return DAG.getNode(ISD::BITCAST, SL, VecVT, BFI);
}		}

SDValue SITargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,		SDValue SITargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc SL(Op);		SDLoc SL(Op);

EVT ResultVT = Op.getValueType();		EVT ResultVT = Op.getValueType();
SDValue Vec = Op.getOperand(0);		SDValue Vec = Op.getOperand(0);
▲ Show 20 Lines • Show All 6,838 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll

Show First 20 Lines • Show All 1,289 Lines • ▼ Show 20 Lines	; GFX10-NEXT: s_setpc_b64 s[30:31]
%val1 = load <6 x half>, <6 x half> addrspace(1)* %arg1		%val1 = load <6 x half>, <6 x half> addrspace(1)* %arg1
%shuffle = shufflevector <6 x half> %val0, <6 x half> %val1, <6 x i32> <i32 4, i32 5, i32 2, i32 3, i32 6, i32 7>		%shuffle = shufflevector <6 x half> %val0, <6 x half> %val1, <6 x i32> <i32 4, i32 5, i32 2, i32 3, i32 6, i32 7>
ret <6 x half> %shuffle		ret <6 x half> %shuffle
}		}

define amdgpu_kernel void @fma_shuffle(<4 x half> addrspace(1)* nocapture readonly %A, <4 x half> addrspace(1)* nocapture readonly %B, <4 x half> addrspace(1)* nocapture %C) {		define amdgpu_kernel void @fma_shuffle(<4 x half> addrspace(1)* nocapture readonly %A, <4 x half> addrspace(1)* nocapture readonly %B, <4 x half> addrspace(1)* nocapture %C) {
; GFX9-LABEL: fma_shuffle:		; GFX9-LABEL: fma_shuffle:
; GFX9: ; %bb.0: ; %entry		; GFX9: ; %bb.0: ; %entry
; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0		; GFX9-NEXT: s_add_u32 s0, s0, s7
		; GFX9-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x0
; GFX9-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x10		; GFX9-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x10
; GFX9-NEXT: v_lshlrev_b32_e32 v6, 3, v0		; GFX9-NEXT: v_lshlrev_b32_e32 v6, 3, v0
		; GFX9-NEXT: s_addc_u32 s1, s1, 0
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: global_load_dwordx2 v[0:1], v6, s[0:1]		; GFX9-NEXT: global_load_dwordx2 v[0:1], v6, s[8:9]
; GFX9-NEXT: global_load_dwordx2 v[2:3], v6, s[2:3]		; GFX9-NEXT: global_load_dwordx2 v[2:3], v6, s[10:11]
; GFX9-NEXT: global_load_dwordx2 v[4:5], v6, s[6:7]		; GFX9-NEXT: global_load_dwordx2 v[4:5], v6, s[6:7]
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_pk_fma_f16 v4, v0, v2, v4 op_sel_hi:[0,1,1]		; GFX9-NEXT: v_pk_fma_f16 v4, v0, v2, v4 op_sel_hi:[0,1,1]
; GFX9-NEXT: v_pk_fma_f16 v2, v1, v2, v5 op_sel_hi:[0,1,1]		; GFX9-NEXT: v_pk_fma_f16 v2, v1, v2, v5 op_sel_hi:[0,1,1]
; GFX9-NEXT: v_pk_fma_f16 v0, v0, v3, v4 op_sel:[1,0,0]		; GFX9-NEXT: v_pk_fma_f16 v0, v0, v3, v4 op_sel:[1,0,0]
; GFX9-NEXT: v_pk_fma_f16 v1, v1, v3, v2 op_sel:[1,0,0]		; GFX9-NEXT: v_pk_fma_f16 v1, v1, v3, v2 op_sel:[1,0,0]
; GFX9-NEXT: global_store_dwordx2 v6, v[0:1], s[6:7]		; GFX9-NEXT: global_store_dwordx2 v6, v[0:1], s[6:7]
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
;		;
; GFX10-LABEL: fma_shuffle:		; GFX10-LABEL: fma_shuffle:
; GFX10: ; %bb.0: ; %entry		; GFX10: ; %bb.0: ; %entry
		; GFX10-NEXT: s_add_u32 s0, s0, s7
; GFX10-NEXT: s_clause 0x1		; GFX10-NEXT: s_clause 0x1
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0		; GFX10-NEXT: s_load_dwordx4 s[8:11], s[4:5], 0x0
; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x10		; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x10
; GFX10-NEXT: v_lshlrev_b32_e32 v6, 3, v0		; GFX10-NEXT: v_lshlrev_b32_e32 v6, 3, v0
		; GFX10-NEXT: s_addc_u32 s1, s1, 0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)		; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_clause 0x2		; GFX10-NEXT: s_clause 0x2
; GFX10-NEXT: global_load_dwordx2 v[0:1], v6, s[0:1]		; GFX10-NEXT: global_load_dwordx2 v[0:1], v6, s[8:9]
; GFX10-NEXT: global_load_dwordx2 v[2:3], v6, s[2:3]		; GFX10-NEXT: global_load_dwordx2 v[2:3], v6, s[10:11]
; GFX10-NEXT: global_load_dwordx2 v[4:5], v6, s[6:7]		; GFX10-NEXT: global_load_dwordx2 v[4:5], v6, s[6:7]
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_pk_fma_f16 v4, v0, v2, v4 op_sel_hi:[0,1,1]		; GFX10-NEXT: v_pk_fma_f16 v4, v0, v2, v4 op_sel_hi:[0,1,1]
; GFX10-NEXT: v_pk_fma_f16 v2, v1, v2, v5 op_sel_hi:[0,1,1]		; GFX10-NEXT: v_pk_fma_f16 v2, v1, v2, v5 op_sel_hi:[0,1,1]
; GFX10-NEXT: v_pk_fma_f16 v0, v0, v3, v4 op_sel:[1,0,0]		; GFX10-NEXT: v_pk_fma_f16 v0, v0, v3, v4 op_sel:[1,0,0]
; GFX10-NEXT: v_pk_fma_f16 v1, v1, v3, v2 op_sel:[1,0,0]		; GFX10-NEXT: v_pk_fma_f16 v1, v1, v3, v2 op_sel:[1,0,0]
; GFX10-NEXT: global_store_dwordx2 v6, v[0:1], s[6:7]		; GFX10-NEXT: global_store_dwordx2 v6, v[0:1], s[6:7]
; GFX10-NEXT: s_endpgm		; GFX10-NEXT: s_endpgm
▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELTAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 425974

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll

[AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT
AbandonedPublic