This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Custom lower INSERT_SUBVECTOR v3, v4, v5, v8
ClosedPublic

Authored by tpr on Jun 11 2019, 1:10 PM.

Download Raw Diff

Details

Reviewers

Commits

rG5816889c748b: [AMDGPU] Custom lower INSERT_SUBVECTOR v3, v4, v5, v8
rL365148: [AMDGPU] Custom lower INSERT_SUBVECTOR v3, v4, v5, v8

Summary

Since the changes to introduce vec3 and vec5, INSERT_VECTOR for these
sizes has been marked "expand", which made LegalizeDAG lower it to loads
and stores via a stack slot. The code got optimized a bit later, but the
now-unused stack slot was never deleted.

This commit avoids that problem by custom lowering INSERT_SUBVECTOR into
an EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT for each element in the
subvector to insert.

Change-Id: I9e3c13e36f68cfa3431bb9814851cc1f673274e1

Diff Detail

Repository: rL LLVM

Event Timeline

tpr created this revision.Jun 11 2019, 1:10 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 1:10 PM

Herald added subscribers: llvm-commits, t-tye, dstuttard and 6 others. · View Herald Transcript

Harbormaster completed remote builds in B33235: Diff 204142.Jun 11 2019, 1:10 PM

arsenm added inline comments.Jun 11 2019, 1:15 PM

lib/Target/AMDGPU/SIISelLowering.cpp
348 ↗	(On Diff #204142)	Should this handle MVT::Other and get all the types? What happens with v6i32 or v4i16?
test/CodeGen/AMDGPU/insert-subvector-unused-scratch.ll
6 ↗	(On Diff #204142)	Should check more, even if generated
12 ↗	(On Diff #204142)	No case for 5 x?

V2: Addressed review comments re test.

Harbormaster completed remote builds in B33271: Diff 204248.Jun 12 2019, 3:30 AM

tpr marked 3 inline comments as done.Jun 12 2019, 3:31 AM

tpr added inline comments.

lib/Target/AMDGPU/SIISelLowering.cpp
348 ↗	(On Diff #204142)	I would prefer not to in this commit. Here I am just trying to undo the damage done by my vec3 and vec5 changes, which (a) added those setOperationActions with Expand, and (b) implemented some joining involving the new vec3/vec5 types in terms of INSERT_SUBVECTOR.

LGTMT

This revision is now accepted and ready to land.Jun 18 2019, 4:35 PM

Closed by commit rL365148: [AMDGPU] Custom lower INSERT_SUBVECTOR v3, v4, v5, v8 (authored by tpr). · Explain WhyJul 4 2019, 10:38 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

SIISelLowering.h

1 line

SIISelLowering.cpp

44 lines

test/

CodeGen/

AMDGPU/

insert-subvector-unused-scratch.ll

32 lines

Diff 208059

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	private:
/// Custom lowering for ISD::FP_ROUND for MVT::f16.		/// Custom lowering for ISD::FP_ROUND for MVT::f16.
SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerFMINNUM_FMAXNUM(SDValue Op, SelectionDAG &DAG) const;

SDValue getSegmentAperture(unsigned AS, const SDLoc &DL,		SDValue getSegmentAperture(unsigned AS, const SDLoc &DL,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

SDValue lowerADDRSPACECAST(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerADDRSPACECAST(SDValue Op, SelectionDAG &DAG) const;
		SDValue lowerINSERT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerTRAP(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerTRAP(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerDEBUGTRAP(SDValue Op, SelectionDAG &DAG) const;

SDNode adjustWritemask(MachineSDNode &N, SelectionDAG &DAG) const;		SDNode adjustWritemask(MachineSDNode &N, SelectionDAG &DAG) const;
▲ Show 20 Lines • Show All 250 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	#endif
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v8i8, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v8i8, Custom);

setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4i16, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4i16, Custom);
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4f16, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4f16, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i16, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i16, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f16, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f16, Custom);

// Deal with vec3 vector operations when widened to vec4.		// Deal with vec3 vector operations when widened to vec4.
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v3i32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v3i32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v3f32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v3f32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4i32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4i32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4f32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4f32, Custom);

// Deal with vec5 vector operations when widened to vec8.		// Deal with vec5 vector operations when widened to vec8.
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v5i32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v5i32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v5f32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v5f32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i32, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8f32, Expand);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8f32, Custom);

// BUFFER/FLAT_ATOMIC_CMP_SWAP on GCN GPUs needs input marshalling,		// BUFFER/FLAT_ATOMIC_CMP_SWAP on GCN GPUs needs input marshalling,
// and output demarshalling		// and output demarshalling
setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Custom);		setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i32, Custom);
setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i64, Custom);		setOperationAction(ISD::ATOMIC_CMP_SWAP, MVT::i64, Custom);

// We can't return success/failure, only the old value,		// We can't return success/failure, only the old value,
// let LLVM add the comparison		// let LLVM add the comparison
▲ Show 20 Lines • Show All 3,595 Lines • ▼ Show 20 Lines	case ISD::GlobalAddress: {
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
return LowerGlobalAddress(MFI, Op, DAG);		return LowerGlobalAddress(MFI, Op, DAG);
}		}
case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op, DAG);		case ISD::INTRINSIC_WO_CHAIN: return LowerINTRINSIC_WO_CHAIN(Op, DAG);
case ISD::INTRINSIC_W_CHAIN: return LowerINTRINSIC_W_CHAIN(Op, DAG);		case ISD::INTRINSIC_W_CHAIN: return LowerINTRINSIC_W_CHAIN(Op, DAG);
case ISD::INTRINSIC_VOID: return LowerINTRINSIC_VOID(Op, DAG);		case ISD::INTRINSIC_VOID: return LowerINTRINSIC_VOID(Op, DAG);
case ISD::ADDRSPACECAST: return lowerADDRSPACECAST(Op, DAG);		case ISD::ADDRSPACECAST: return lowerADDRSPACECAST(Op, DAG);
		case ISD::INSERT_SUBVECTOR:
		return lowerINSERT_SUBVECTOR(Op, DAG);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return lowerINSERT_VECTOR_ELT(Op, DAG);		return lowerINSERT_VECTOR_ELT(Op, DAG);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return lowerEXTRACT_VECTOR_ELT(Op, DAG);		return lowerEXTRACT_VECTOR_ELT(Op, DAG);
case ISD::VECTOR_SHUFFLE:		case ISD::VECTOR_SHUFFLE:
return lowerVECTOR_SHUFFLE(Op, DAG);		return lowerVECTOR_SHUFFLE(Op, DAG);
case ISD::BUILD_VECTOR:		case ISD::BUILD_VECTOR:
return lowerBUILD_VECTOR(Op, DAG);		return lowerBUILD_VECTOR(Op, DAG);
▲ Show 20 Lines • Show All 656 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::lowerADDRSPACECAST(SDValue Op,
const MachineFunction &MF = DAG.getMachineFunction();		const MachineFunction &MF = DAG.getMachineFunction();
DiagnosticInfoUnsupported InvalidAddrSpaceCast(		DiagnosticInfoUnsupported InvalidAddrSpaceCast(
MF.getFunction(), "invalid addrspacecast", SL.getDebugLoc());		MF.getFunction(), "invalid addrspacecast", SL.getDebugLoc());
DAG.getContext()->diagnose(InvalidAddrSpaceCast);		DAG.getContext()->diagnose(InvalidAddrSpaceCast);

return DAG.getUNDEF(ASC->getValueType(0));		return DAG.getUNDEF(ASC->getValueType(0));
}		}

		// This lowers an INSERT_SUBVECTOR by extracting the individual elements from
		// the small vector and inserting them into the big vector. That is better than
		// the default expansion of doing it via a stack slot. Even though the use of
		// the stack slot would be optimized away afterwards, the stack slot itself
		// remains.
		SDValue SITargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,
		SelectionDAG &DAG) const {
		SDValue Vec = Op.getOperand(0);
		SDValue Ins = Op.getOperand(1);
		SDValue Idx = Op.getOperand(2);
		EVT VecVT = Vec.getValueType();
		EVT InsVT = Ins.getValueType();
		EVT EltVT = VecVT.getVectorElementType();
		unsigned InsNumElts = InsVT.getVectorNumElements();
		unsigned IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
		SDLoc SL(Op);

		for (unsigned I = 0; I != InsNumElts; ++I) {
		SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT, Ins,
		DAG.getConstant(I, SL, MVT::i32));
		Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, SL, VecVT, Vec, Elt,
		DAG.getConstant(IdxVal + I, SL, MVT::i32));
		}
		return Vec;
		}

SDValue SITargetLowering::lowerINSERT_VECTOR_ELT(SDValue Op,		SDValue SITargetLowering::lowerINSERT_VECTOR_ELT(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Vec = Op.getOperand(0);		SDValue Vec = Op.getOperand(0);
SDValue InsVal = Op.getOperand(1);		SDValue InsVal = Op.getOperand(1);
SDValue Idx = Op.getOperand(2);		SDValue Idx = Op.getOperand(2);
EVT VecVT = Vec.getValueType();		EVT VecVT = Vec.getValueType();
EVT EltVT = VecVT.getVectorElementType();		EVT EltVT = VecVT.getVectorElementType();
unsigned VecSize = VecVT.getSizeInBits();		unsigned VecSize = VecVT.getSizeInBits();
▲ Show 20 Lines • Show All 5,968 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/insert-subvector-unused-scratch.ll

				; RUN: llc -mtriple amdgcn-amd-- -mcpu=bonaire -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				; Before the fix that this test was committed with, this code would leave
				; an unused stack slot, causing ScratchSize to be non-zero.

				; GCN-LABEL: store_v3i32:
				; GCN: ds_read_b64
				; GCN: ds_read_b32
				; GCN: ds_write_b32
				; GCN: ds_write_b64
				; GCN: ScratchSize: 0
				define amdgpu_kernel void @store_v3i32(<3 x i32> addrspace(3)* %out, <3 x i32> %a) nounwind {
				%val = load <3 x i32>, <3 x i32> addrspace(3)* %out
				%val.1 = add <3 x i32> %a, %val
				store <3 x i32> %val.1, <3 x i32> addrspace(3)* %out, align 16
				ret void
				}

				; GCN-LABEL: store_v5i32:
				; GCN: ds_read2_b64
				; GCN: ds_read_b32
				; GCN: ds_write_b32
				; GCN: ds_write2_b64
				; GCN: ScratchSize: 0
				define amdgpu_kernel void @store_v5i32(<5 x i32> addrspace(3)* %out, <5 x i32> %a) nounwind {
				%val = load <5 x i32>, <5 x i32> addrspace(3)* %out
				%val.1 = add <5 x i32> %a, %val
				store <5 x i32> %val.1, <5 x i32> addrspace(3)* %out, align 16
				ret void
				}