This is an archive of the discontinued LLVM Phabricator instance.

[ARM] FP16: constant initialised v4f16 and v8f16 vectors
AbandonedPublic

Authored by SjoerdMeijer on Nov 29 2018, 9:10 AM.

Download Raw Diff

Details

Reviewers

olista01
samparker
efriedma

Summary

Compilation and autovectorisation of a fp16 reduction kernel like this:

    
_Float16 sum = .0F16;
for (unsigned i = 0; i < N; i++)
  sum += A[i];
return sum;

fails with an instruction selection 'cannot match' error. A BUILD_VECTOR node is created to hold the 'sum' vector, which gets initialised with VMOVIMM. The problem was that BUILD_VECTOR nodes for v4f16 and v8f16 were assigned the wrong type so that it didn't know how to lower the VMOVIMM.

There are different ways to initialise vectors with constants, e.g. constant pool loads or vmov with immediates. But this BUILD_VECTOR node is another case, that gets created for constant initialised phi nodes, which again, we were not handling.

In a follow up commit, I will add support for 'extractelt' from v4f16 and v8f16 vectors, which is the last step to get this fully working.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Nov 29 2018, 9:10 AM

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptNov 29 2018, 9:10 AM

The lib/Target/ARM/ARMInstrNEON.td look like they belong in a separate patch. And it looks like there isn't any explicit test coverage of bitcasts.

Please clean up the testcase so it only contains the relevant operations; the whole loop isn't necessary.

lib/Target/ARM/ARMISelLowering.cpp
5744	Why does this matter?

olista01 added inline comments.Nov 30 2018, 1:04 AM

lib/Target/ARM/ARMInstrNEON.td
7145	There are still a lot of combinations of types missing here, I think it would be better to add all of them at once. It might also be better to re-write all of these patterns as a pair of tablegen foreach loops, so that we don't miss any combinations.
7228	Why does this need a VREV? All of the other cases where the lane widths are the same are no-ops for BE and LE.

Thanks for taking a look!

I am actually going to split this up in 3 patches:

the changes in ARMISelDAGToDAG.cpp, the are related to (post-increment) VLD1, and should also be tested explicitly
the changes in ARMInstrNEON.td, the bitcasts,
and then this one to support BUILD_VECTOR.

Please clean up the testcase so it only contains the relevant operations; the whole loop isn't necessary.

I think for testing this BUILD_VECTOR change, a loop is necessary, as this code is generated from:

%vec.phi = phi <8 x half> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]

but I will have a look again.

SjoerdMeijer abandoned this revision.Mar 17 2023, 1:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:43 AM

Revision Contents

Path

Size

lib/

Target/

ARM/

ARMISelDAGToDAG.cpp

4 lines

ARMISelLowering.cpp

18 lines

ARMInstrNEON.td

4 lines

test/

CodeGen/

ARM/

fp16-reduction.ll

227 lines

Diff 175879

lib/Target/ARM/ARMISelDAGToDAG.cpp

Show First 20 Lines • Show All 1,757 Lines • ▼ Show 20 Lines	void ARMDAGToDAGISel::SelectVLD(SDNode *N, bool isUpdating, unsigned NumVecs,
bool is64BitVector = VT.is64BitVector();		bool is64BitVector = VT.is64BitVector();
Align = GetVLDSTAlign(Align, dl, NumVecs, is64BitVector);		Align = GetVLDSTAlign(Align, dl, NumVecs, is64BitVector);

unsigned OpcodeIndex;		unsigned OpcodeIndex;
switch (VT.getSimpleVT().SimpleTy) {		switch (VT.getSimpleVT().SimpleTy) {
default: llvm_unreachable("unhandled vld type");		default: llvm_unreachable("unhandled vld type");
// Double-register operations:		// Double-register operations:
case MVT::v8i8: OpcodeIndex = 0; break;		case MVT::v8i8: OpcodeIndex = 0; break;
		case MVT::v4f16:
case MVT::v4i16: OpcodeIndex = 1; break;		case MVT::v4i16: OpcodeIndex = 1; break;
case MVT::v2f32:		case MVT::v2f32:
case MVT::v2i32: OpcodeIndex = 2; break;		case MVT::v2i32: OpcodeIndex = 2; break;
case MVT::v1i64: OpcodeIndex = 3; break;		case MVT::v1i64: OpcodeIndex = 3; break;
// Quad-register operations:		// Quad-register operations:
case MVT::v16i8: OpcodeIndex = 0; break;		case MVT::v16i8: OpcodeIndex = 0; break;
		case MVT::v8f16:
case MVT::v8i16: OpcodeIndex = 1; break;		case MVT::v8i16: OpcodeIndex = 1; break;
case MVT::v4f32:		case MVT::v4f32:
case MVT::v4i32: OpcodeIndex = 2; break;		case MVT::v4i32: OpcodeIndex = 2; break;
case MVT::v2f64:		case MVT::v2f64:
case MVT::v2i64: OpcodeIndex = 3; break;		case MVT::v2i64: OpcodeIndex = 3; break;
}		}

EVT ResTy;		EVT ResTy;
▲ Show 20 Lines • Show All 285 Lines • ▼ Show 20 Lines	void ARMDAGToDAGISel::SelectVLDSTLane(SDNode *N, bool IsLoad, bool isUpdating,
}		}
Align = CurDAG->getTargetConstant(Alignment, dl, MVT::i32);		Align = CurDAG->getTargetConstant(Alignment, dl, MVT::i32);

unsigned OpcodeIndex;		unsigned OpcodeIndex;
switch (VT.getSimpleVT().SimpleTy) {		switch (VT.getSimpleVT().SimpleTy) {
default: llvm_unreachable("unhandled vld/vst lane type");		default: llvm_unreachable("unhandled vld/vst lane type");
// Double-register operations:		// Double-register operations:
case MVT::v8i8: OpcodeIndex = 0; break;		case MVT::v8i8: OpcodeIndex = 0; break;
		case MVT::v4f16:
case MVT::v4i16: OpcodeIndex = 1; break;		case MVT::v4i16: OpcodeIndex = 1; break;
case MVT::v2f32:		case MVT::v2f32:
case MVT::v2i32: OpcodeIndex = 2; break;		case MVT::v2i32: OpcodeIndex = 2; break;
// Quad-register operations:		// Quad-register operations:
		case MVT::v8f16:
case MVT::v8i16: OpcodeIndex = 0; break;		case MVT::v8i16: OpcodeIndex = 0; break;
case MVT::v4f32:		case MVT::v4f32:
case MVT::v4i32: OpcodeIndex = 1; break;		case MVT::v4i32: OpcodeIndex = 1; break;
}		}

std::vector<EVT> ResTys;		std::vector<EVT> ResTys;
if (IsLoad) {		if (IsLoad) {
unsigned ResTyElts = (NumVecs == 3) ? 4 : NumVecs;		unsigned ResTyElts = (NumVecs == 3) ? 4 : NumVecs;
▲ Show 20 Lines • Show All 2,250 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,728 Lines • ▼ Show 20 Lines
}		}

/// isNEONModifiedImm - Check if the specified splat value corresponds to a		/// isNEONModifiedImm - Check if the specified splat value corresponds to a
/// valid vector constant for a NEON instruction with a "modified immediate"		/// valid vector constant for a NEON instruction with a "modified immediate"
/// operand (e.g., VMOV). If so, return the encoded value.		/// operand (e.g., VMOV). If so, return the encoded value.
static SDValue isNEONModifiedImm(uint64_t SplatBits, uint64_t SplatUndef,		static SDValue isNEONModifiedImm(uint64_t SplatBits, uint64_t SplatUndef,
unsigned SplatBitSize, SelectionDAG &DAG,		unsigned SplatBitSize, SelectionDAG &DAG,
const SDLoc &dl, EVT &VT, bool is128Bits,		const SDLoc &dl, EVT &VT, bool is128Bits,
NEONModImmType type) {		NEONModImmType type, bool FP16 = false) {
unsigned OpCmode, Imm;		unsigned OpCmode, Imm;

// SplatBitSize is set to the smallest size that splats the vector, so a		// SplatBitSize is set to the smallest size that splats the vector, so a
// zero vector will always have SplatBitSize == 8. However, NEON modified		// zero vector will always have SplatBitSize == 8. However, NEON modified
// immediate instructions others than VMOV do not support the 8-bit encoding		// immediate instructions others than VMOV do not support the 8-bit encoding
// of a zero vector, and the default encoding of zero is supposed to be the		// of a zero vector, and the default encoding of zero is supposed to be the
// 32-bit version.		// 32-bit version, and the 16-bit version for f16 vectors.
		efriedmaUnsubmitted Not Done Reply Inline Actions Why does this matter? efriedma: Why does this matter?
if (SplatBits == 0)		if (SplatBits == 0)
SplatBitSize = 32;		SplatBitSize = FP16 ? 16 : 32;

switch (SplatBitSize) {		switch (SplatBitSize) {
case 8:		case 8:
if (type != VMOVModImm)		if (type != VMOVModImm)
return SDValue();		return SDValue();
// Any 1-byte value is OK. Op=0, Cmode=1110.		// Any 1-byte value is OK. Op=0, Cmode=1110.
assert((SplatBits & ~0xff) == 0 && "one byte splat value is too big");		assert((SplatBits & ~0xff) == 0 && "one byte splat value is too big");
OpCmode = 0xe;		OpCmode = 0xe;
▲ Show 20 Lines • Show All 624 Lines • ▼ Show 20 Lines	SDValue ARMTargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,
bool HasAnyUndefs;		bool HasAnyUndefs;
if (BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize, HasAnyUndefs)) {		if (BVN->isConstantSplat(SplatBits, SplatUndef, SplatBitSize, HasAnyUndefs)) {
if (SplatUndef.isAllOnesValue())		if (SplatUndef.isAllOnesValue())
return DAG.getUNDEF(VT);		return DAG.getUNDEF(VT);

if (SplatBitSize <= 64) {		if (SplatBitSize <= 64) {
// Check if an immediate VMOV works.		// Check if an immediate VMOV works.
EVT VmovVT;		EVT VmovVT;
		const bool FP16 = (VT == MVT::v4f16 \|\| VT == MVT::v8f16);
SDValue Val = isNEONModifiedImm(SplatBits.getZExtValue(),		SDValue Val = isNEONModifiedImm(SplatBits.getZExtValue(),
SplatUndef.getZExtValue(), SplatBitSize,		SplatUndef.getZExtValue(), SplatBitSize,
DAG, dl, VmovVT, VT.is128BitVector(),		DAG, dl, VmovVT, VT.is128BitVector(),
VMOVModImm);		VMOVModImm, FP16);
if (Val.getNode()) {		if (Val.getNode()) {
SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);		SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);
return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);		return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
}		}

// Try an immediate VMVN.		// Try an immediate VMVN.
uint64_t NegatedImm = (~SplatBits).getZExtValue();		uint64_t NegatedImm = (~SplatBits).getZExtValue();
Val = isNEONModifiedImm(NegatedImm,		Val = isNEONModifiedImm(NegatedImm,
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	SDValue ARMTargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,

// Loads are better lowered with insert_vector_elt/ARMISD::BUILD_VECTOR.		// Loads are better lowered with insert_vector_elt/ARMISD::BUILD_VECTOR.
// Keep going if we are hitting this case.		// Keep going if we are hitting this case.
if (isOnlyLowElement && !ISD::isNormalLoad(Value.getNode()))		if (isOnlyLowElement && !ISD::isNormalLoad(Value.getNode()))
return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Value);		return DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Value);

unsigned EltSize = VT.getScalarSizeInBits();		unsigned EltSize = VT.getScalarSizeInBits();

// Use VDUP for non-constant splats. For f32 constant splats, reduce to		// Use VDUP for non-constant splats. For f32 and f16 constant splats, reduce to
// i32 and try again.		// i32 and i16 and try again.
if (hasDominantValue && EltSize <= 32) {		if (hasDominantValue && EltSize <= 32) {
		EVT IntEltType = (EltSize == 32 ? MVT::i32 : MVT::i16);
if (!isConstant) {		if (!isConstant) {
SDValue N;		SDValue N;

// If we are VDUPing a value that comes directly from a vector, that will		// If we are VDUPing a value that comes directly from a vector, that will
// cause an unnecessary move to and from a GPR, where instead we could		// cause an unnecessary move to and from a GPR, where instead we could
// just use VDUPLANE. We can only do this if the lane being extracted		// just use VDUPLANE. We can only do this if the lane being extracted
// is at a constant index, as the VDUP from lane instructions only have		// is at a constant index, as the VDUP from lane instructions only have
// constant-index forms.		// constant-index forms.
Show All 30 Lines	if (!isConstant) {
N = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, Ops);		N = DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, Ops);
}		}
}		}
return N;		return N;
}		}
if (VT.getVectorElementType().isFloatingPoint()) {		if (VT.getVectorElementType().isFloatingPoint()) {
SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
for (unsigned i = 0; i < NumElts; ++i)		for (unsigned i = 0; i < NumElts; ++i)
Ops.push_back(DAG.getNode(ISD::BITCAST, dl, MVT::i32,		Ops.push_back(DAG.getNode(ISD::BITCAST, dl, IntEltType,
Op.getOperand(i)));		Op.getOperand(i)));
EVT VecVT = EVT::getVectorVT(*DAG.getContext(), MVT::i32, NumElts);		EVT VecVT = EVT::getVectorVT(*DAG.getContext(), IntEltType, NumElts);
SDValue Val = DAG.getBuildVector(VecVT, dl, Ops);		SDValue Val = DAG.getBuildVector(VecVT, dl, Ops);
Val = LowerBUILD_VECTOR(Val, DAG, ST);		Val = LowerBUILD_VECTOR(Val, DAG, ST);
if (Val.getNode())		if (Val.getNode())
return DAG.getNode(ISD::BITCAST, dl, VT, Val);		return DAG.getNode(ISD::BITCAST, dl, VT, Val);
}		}
if (usesOnlyOneValue) {		if (usesOnlyOneValue) {
SDValue Val = IsSingleInstrConstant(Value, DAG, ST, dl);		SDValue Val = IsSingleInstrConstant(Value, DAG, ST, dl);
if (isConstant && Val.getNode())		if (isConstant && Val.getNode())
▲ Show 20 Lines • Show All 8,586 Lines • Show Last 20 Lines

lib/Target/ARM/ARMInstrNEON.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,136 Lines • ▼ Show 20 Lines	let Predicates = [IsLE] in {
def : Pat<(v1i64 (bitconvert (v2f32 DPR:$src))), (v1i64 DPR:$src)>;		def : Pat<(v1i64 (bitconvert (v2f32 DPR:$src))), (v1i64 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v1i64 DPR:$src))), (v2i32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v1i64 DPR:$src))), (v2i32 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v4i16 DPR:$src))), (v2i32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v4i16 DPR:$src))), (v2i32 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v8i8 DPR:$src))), (v2i32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v8i8 DPR:$src))), (v2i32 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (f64 DPR:$src))), (v2i32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (f64 DPR:$src))), (v2i32 DPR:$src)>;
}		}
def : Pat<(v2i32 (bitconvert (v2f32 DPR:$src))), (v2i32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v2f32 DPR:$src))), (v2i32 DPR:$src)>;
let Predicates = [IsLE] in {		let Predicates = [IsLE] in {
		def : Pat<(v4f16 (bitconvert (v4i16 DPR:$src))), (v4f16 DPR:$src)>;
		olista01Unsubmitted Not Done Reply Inline Actions There are still a lot of combinations of types missing here, I think it would be better to add all of them at once. It might also be better to re-write all of these patterns as a pair of tablegen foreach loops, so that we don't miss any combinations. olista01: There are still a lot of combinations of types missing here, I think it would be better to add…
def : Pat<(v4i16 (bitconvert (v1i64 DPR:$src))), (v4i16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v1i64 DPR:$src))), (v4i16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v2i32 DPR:$src))), (v4i16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v2i32 DPR:$src))), (v4i16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v8i8 DPR:$src))), (v4i16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v8i8 DPR:$src))), (v4i16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (f64 DPR:$src))), (v4i16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (f64 DPR:$src))), (v4i16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v2f32 DPR:$src))), (v4i16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v2f32 DPR:$src))), (v4i16 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v1i64 DPR:$src))), (v8i8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v1i64 DPR:$src))), (v8i8 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v2i32 DPR:$src))), (v8i8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v2i32 DPR:$src))), (v8i8 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v4i16 DPR:$src))), (v8i8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v4i16 DPR:$src))), (v8i8 DPR:$src)>;
Show All 27 Lines	let Predicates = [IsLE] in {
def : Pat<(v2i64 (bitconvert (v4f32 QPR:$src))), (v2i64 QPR:$src)>;		def : Pat<(v2i64 (bitconvert (v4f32 QPR:$src))), (v2i64 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v2i64 QPR:$src))), (v4i32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v2i64 QPR:$src))), (v4i32 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v8i16 QPR:$src))), (v4i32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v8i16 QPR:$src))), (v4i32 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v16i8 QPR:$src))), (v4i32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v16i8 QPR:$src))), (v4i32 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v2f64 QPR:$src))), (v4i32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v2f64 QPR:$src))), (v4i32 QPR:$src)>;
}		}
def : Pat<(v4i32 (bitconvert (v4f32 QPR:$src))), (v4i32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v4f32 QPR:$src))), (v4i32 QPR:$src)>;
let Predicates = [IsLE] in {		let Predicates = [IsLE] in {
		def : Pat<(v8f16 (bitconvert (v8i16 QPR:$src))), (v8f16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v2i64 QPR:$src))), (v8i16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v2i64 QPR:$src))), (v8i16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v4i32 QPR:$src))), (v8i16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v4i32 QPR:$src))), (v8i16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v16i8 QPR:$src))), (v8i16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v16i8 QPR:$src))), (v8i16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v2f64 QPR:$src))), (v8i16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v2f64 QPR:$src))), (v8i16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v4f32 QPR:$src))), (v8i16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v4f32 QPR:$src))), (v8i16 QPR:$src)>;
def : Pat<(v8f16 (bitconvert (v2f64 QPR:$src))), (v8f16 QPR:$src)>;		def : Pat<(v8f16 (bitconvert (v2f64 QPR:$src))), (v8f16 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v2i64 QPR:$src))), (v16i8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v2i64 QPR:$src))), (v16i8 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v4i32 QPR:$src))), (v16i8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v4i32 QPR:$src))), (v16i8 QPR:$src)>;
Show All 22 Lines	let Predicates = [IsBE] in {
def : Pat<(v1i64 (bitconvert (v2i32 DPR:$src))), (VREV64d32 DPR:$src)>;		def : Pat<(v1i64 (bitconvert (v2i32 DPR:$src))), (VREV64d32 DPR:$src)>;
def : Pat<(v1i64 (bitconvert (v4i16 DPR:$src))), (VREV64d16 DPR:$src)>;		def : Pat<(v1i64 (bitconvert (v4i16 DPR:$src))), (VREV64d16 DPR:$src)>;
def : Pat<(v1i64 (bitconvert (v8i8 DPR:$src))), (VREV64d8 DPR:$src)>;		def : Pat<(v1i64 (bitconvert (v8i8 DPR:$src))), (VREV64d8 DPR:$src)>;
def : Pat<(v1i64 (bitconvert (v2f32 DPR:$src))), (VREV64d32 DPR:$src)>;		def : Pat<(v1i64 (bitconvert (v2f32 DPR:$src))), (VREV64d32 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v1i64 DPR:$src))), (VREV64d32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v1i64 DPR:$src))), (VREV64d32 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v4i16 DPR:$src))), (VREV32d16 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v4i16 DPR:$src))), (VREV32d16 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (v8i8 DPR:$src))), (VREV32d8 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (v8i8 DPR:$src))), (VREV32d8 DPR:$src)>;
def : Pat<(v2i32 (bitconvert (f64 DPR:$src))), (VREV64d32 DPR:$src)>;		def : Pat<(v2i32 (bitconvert (f64 DPR:$src))), (VREV64d32 DPR:$src)>;
		def : Pat<(v4f16 (bitconvert (v4i16 DPR:$src))), (VREV64d16 DPR:$src)>;
		olista01Unsubmitted Not Done Reply Inline Actions Why does this need a VREV? All of the other cases where the lane widths are the same are no-ops for BE and LE. olista01: Why does this need a VREV? All of the other cases where the lane widths are the same are no-ops…
def : Pat<(v4i16 (bitconvert (v1i64 DPR:$src))), (VREV64d16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v1i64 DPR:$src))), (VREV64d16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v2i32 DPR:$src))), (VREV32d16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v2i32 DPR:$src))), (VREV32d16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v8i8 DPR:$src))), (VREV16d8 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v8i8 DPR:$src))), (VREV16d8 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (f64 DPR:$src))), (VREV64d16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (f64 DPR:$src))), (VREV64d16 DPR:$src)>;
def : Pat<(v4i16 (bitconvert (v2f32 DPR:$src))), (VREV32d16 DPR:$src)>;		def : Pat<(v4i16 (bitconvert (v2f32 DPR:$src))), (VREV32d16 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v1i64 DPR:$src))), (VREV64d8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v1i64 DPR:$src))), (VREV64d8 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v2i32 DPR:$src))), (VREV32d8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v2i32 DPR:$src))), (VREV32d8 DPR:$src)>;
def : Pat<(v8i8 (bitconvert (v4i16 DPR:$src))), (VREV16d8 DPR:$src)>;		def : Pat<(v8i8 (bitconvert (v4i16 DPR:$src))), (VREV16d8 DPR:$src)>;
Show All 17 Lines	let Predicates = [IsBE] in {
def : Pat<(v4i32 (bitconvert (v2i64 QPR:$src))), (VREV64q32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v2i64 QPR:$src))), (VREV64q32 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v8i16 QPR:$src))), (VREV32q16 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v8i16 QPR:$src))), (VREV32q16 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v16i8 QPR:$src))), (VREV32q8 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v16i8 QPR:$src))), (VREV32q8 QPR:$src)>;
def : Pat<(v4i32 (bitconvert (v2f64 QPR:$src))), (VREV64q32 QPR:$src)>;		def : Pat<(v4i32 (bitconvert (v2f64 QPR:$src))), (VREV64q32 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v2i64 QPR:$src))), (VREV64q16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v2i64 QPR:$src))), (VREV64q16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v4i32 QPR:$src))), (VREV32q16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v4i32 QPR:$src))), (VREV32q16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v16i8 QPR:$src))), (VREV16q8 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v16i8 QPR:$src))), (VREV16q8 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v2f64 QPR:$src))), (VREV64q16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v2f64 QPR:$src))), (VREV64q16 QPR:$src)>;
		def : Pat<(v8f16 (bitconvert (v8i16 QPR:$src))), (VREV64q16 QPR:$src)>;
def : Pat<(v8f16 (bitconvert (v2f64 QPR:$src))), (VREV64q16 QPR:$src)>;		def : Pat<(v8f16 (bitconvert (v2f64 QPR:$src))), (VREV64q16 QPR:$src)>;
def : Pat<(v8i16 (bitconvert (v4f32 QPR:$src))), (VREV32q16 QPR:$src)>;		def : Pat<(v8i16 (bitconvert (v4f32 QPR:$src))), (VREV32q16 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v2i64 QPR:$src))), (VREV64q8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v2i64 QPR:$src))), (VREV64q8 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v4i32 QPR:$src))), (VREV32q8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v4i32 QPR:$src))), (VREV32q8 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v8i16 QPR:$src))), (VREV16q8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v8i16 QPR:$src))), (VREV16q8 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v2f64 QPR:$src))), (VREV64q8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v2f64 QPR:$src))), (VREV64q8 QPR:$src)>;
def : Pat<(v16i8 (bitconvert (v4f32 QPR:$src))), (VREV32q8 QPR:$src)>;		def : Pat<(v16i8 (bitconvert (v4f32 QPR:$src))), (VREV32q8 QPR:$src)>;
def : Pat<(v4f32 (bitconvert (v2i64 QPR:$src))), (VREV64q32 QPR:$src)>;		def : Pat<(v4f32 (bitconvert (v2i64 QPR:$src))), (VREV64q32 QPR:$src)>;
▲ Show 20 Lines • Show All 1,362 Lines • Show Last 20 Lines

test/CodeGen/ARM/fp16-reduction.ll

This file was added.

				; RUN: llc < %s \| FileCheck %s
				; RUN: llc -mtriple armeb-unknown < %s \| FileCheck %s

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "armv8.2a-arm-unknown-eabihf"

				define dso_local float @vec8_zero_init(half* nocapture readonly %V, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp6 = icmp sgt i32 %N, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				%min.iters.check = icmp ult i32 %N, 8
				br i1 %min.iters.check, label %for.body.preheader15, label %vector.ph

				for.body.preheader15:
				%i.08.ph = phi i32 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
				%Tmp.07.ph = phi half [ 0xH0000, %for.body.preheader ], [ 0xH8000, %middle.block ]
				br label %for.body

				vector.ph:
				%n.vec = and i32 %N, -8
				br label %vector.body

				vector.body:
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%vec.phi = phi <8 x half> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]

				; CHECK-LABEL: vec8_zero_init:
				; CHECK: vmov.i16 q8, #0x0

				%0 = getelementptr inbounds half, half* %V, i32 %index
				%1 = bitcast half* %0 to <8 x half>*
				%wide.load = load <8 x half>, <8 x half>* %1, align 2
				%2 = fadd fast <8 x half> %wide.load, %vec.phi
				%index.next = add i32 %index, 8
				%3 = icmp eq i32 %index.next, %n.vec
				br i1 %3, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf = shufflevector <8 x half> %2, <8 x half> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = fadd fast <8 x half> %2, %rdx.shuf
				%rdx.shuf11 = shufflevector <8 x half> %bin.rdx, <8 x half> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx12 = fadd fast <8 x half> %bin.rdx, %rdx.shuf11
				%rdx.shuf13 = shufflevector <8 x half> %bin.rdx12, <8 x half> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx14 = fadd fast <8 x half> %bin.rdx12, %rdx.shuf13

				; TODO: suport v8f16 extractelement
				; %4 = extractelement <8 x half> %bin.rdx14, i32 0
				%cmp.n = icmp eq i32 %n.vec, %N
				br i1 %cmp.n, label %for.cond.cleanup.loopexit, label %for.body.preheader15

				for.cond.cleanup.loopexit:
				; %add.lcssa = phi half [ %4, %middle.block ], [ %add, %for.body ]
				%add.lcssa = phi half [ 0.000000e+00, %middle.block ], [ %add, %for.body ]
				%phitmp = bitcast half %add.lcssa to i16
				%phitmp9 = zext i16 %phitmp to i32
				%phitmp10 = bitcast i32 %phitmp9 to float
				br label %for.cond.cleanup

				for.cond.cleanup:
				%Tmp.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %phitmp10, %for.cond.cleanup.loopexit ]
				ret float %Tmp.0.lcssa

				for.body:
				%i.08 = phi i32 [ %inc, %for.body ], [ %i.08.ph, %for.body.preheader15 ]
				%Tmp.07 = phi half [ %add, %for.body ], [ %Tmp.07.ph, %for.body.preheader15 ]
				%arrayidx = getelementptr inbounds half, half* %V, i32 %i.08
				%V5 = load half, half* %arrayidx, align 2
				%add = fadd fast half %V5, %Tmp.07
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}


				define dso_local float @vec8_one_init(half* nocapture readonly %V, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp6 = icmp sgt i32 %N, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				%min.iters.check = icmp ult i32 %N, 8
				br i1 %min.iters.check, label %for.body.preheader15, label %vector.ph

				for.body.preheader15:
				%i.08.ph = phi i32 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
				%Tmp.07.ph = phi half [ 0xH0000, %for.body.preheader ], [ 0xH8000, %middle.block ]
				br label %for.body

				vector.ph:
				%n.vec = and i32 %N, -8
				br label %vector.body

				vector.body:
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%vec.phi = phi <8 x half> [ <half 0xH3C00, half 0xH0000, half 0xH0000, half 0xH0000, half 0xH0000, half 0xH0000, half 0xH0000, half 0xH0000>, %vector.ph ], [ %2, %vector.body ]

				; CHECK-LABEL: vec8_one_init:
				; CHECK: adr r2, .LCPI1_2
				; CHECK: .LCPI1_2:
				; CHECK-NEXT: .short 15360 @ half 1
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0
				; CHECK-NEXT: .short 0 @ half 0


				%0 = getelementptr inbounds half, half* %V, i32 %index
				%1 = bitcast half* %0 to <8 x half>*
				%wide.load = load <8 x half>, <8 x half>* %1, align 2
				%2 = fadd fast <8 x half> %wide.load, %vec.phi
				%index.next = add i32 %index, 8
				%3 = icmp eq i32 %index.next, %n.vec
				br i1 %3, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf = shufflevector <8 x half> %2, <8 x half> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = fadd fast <8 x half> %2, %rdx.shuf
				%rdx.shuf11 = shufflevector <8 x half> %bin.rdx, <8 x half> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx12 = fadd fast <8 x half> %bin.rdx, %rdx.shuf11
				%rdx.shuf13 = shufflevector <8 x half> %bin.rdx12, <8 x half> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx14 = fadd fast <8 x half> %bin.rdx12, %rdx.shuf13

				; TODO: suport v8f16 extractelement
				; %4 = extractelement <8 x half> %bin.rdx14, i32 0
				%cmp.n = icmp eq i32 %n.vec, %N
				br i1 %cmp.n, label %for.cond.cleanup.loopexit, label %for.body.preheader15

				for.cond.cleanup.loopexit:
				; %add.lcssa = phi half [ %4, %middle.block ], [ %add, %for.body ]
				%add.lcssa = phi half [ 0.000000e+00, %middle.block ], [ %add, %for.body ]
				%phitmp = bitcast half %add.lcssa to i16
				%phitmp9 = zext i16 %phitmp to i32
				%phitmp10 = bitcast i32 %phitmp9 to float
				br label %for.cond.cleanup

				for.cond.cleanup:
				%Tmp.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %phitmp10, %for.cond.cleanup.loopexit ]
				ret float %Tmp.0.lcssa

				for.body:
				%i.08 = phi i32 [ %inc, %for.body ], [ %i.08.ph, %for.body.preheader15 ]
				%Tmp.07 = phi half [ %add, %for.body ], [ %Tmp.07.ph, %for.body.preheader15 ]
				%arrayidx = getelementptr inbounds half, half* %V, i32 %i.08
				%V5 = load half, half* %arrayidx, align 2
				%add = fadd fast half %V5, %Tmp.07
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}



				define dso_local float @vec4_zero_init(half* nocapture readonly %V, i32 %N) local_unnamed_addr #0 {
				entry:
				%cmp6 = icmp sgt i32 %N, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				%min.iters.check = icmp ult i32 %N, 8
				br i1 %min.iters.check, label %for.body.preheader15, label %vector.ph

				for.body.preheader15:
				%i.08.ph = phi i32 [ 0, %for.body.preheader ], [ %n.vec, %middle.block ]
				%Tmp.07.ph = phi half [ 0xH0000, %for.body.preheader ], [ 0xH8000, %middle.block ]
				br label %for.body

				vector.ph:
				%n.vec = and i32 %N, -8
				br label %vector.body

				vector.body:
				%index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
				%vec.phi = phi <4 x half> [ zeroinitializer, %vector.ph ], [ %2, %vector.body ]

				; CHECK-LABEL: vec4_zero_init:
				; CHECK: vmov.i16 d16, #0x0

				%0 = getelementptr inbounds half, half* %V, i32 %index
				%1 = bitcast half* %0 to <4 x half>*
				%wide.load = load <4 x half>, <4 x half>* %1, align 2
				%2 = fadd fast <4 x half> %wide.load, %vec.phi
				%index.next = add i32 %index, 8
				%3 = icmp eq i32 %index.next, %n.vec
				br i1 %3, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf = shufflevector <4 x half> %2, <4 x half> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				%bin.rdx = fadd fast <4 x half> %2, %rdx.shuf
				%rdx.shuf11 = shufflevector <4 x half> %bin.rdx, <4 x half> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
				%bin.rdx12 = fadd fast <4 x half> %bin.rdx, %rdx.shuf11
				%rdx.shuf13 = shufflevector <4 x half> %bin.rdx12, <4 x half> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
				%bin.rdx14 = fadd fast <4 x half> %bin.rdx12, %rdx.shuf13

				; TODO: support v4f16 extractelement
				; %4 = extractelement <4 x half> %bin.rdx14, i32 0
				%cmp.n = icmp eq i32 %n.vec, %N
				br i1 %cmp.n, label %for.cond.cleanup.loopexit, label %for.body.preheader15

				for.cond.cleanup.loopexit:
				; %add.lcssa = phi half [ %4, %middle.block ], [ %add, %for.body ]
				%add.lcssa = phi half [ 0.000000e+00, %middle.block ], [ %add, %for.body ]
				%phitmp = bitcast half %add.lcssa to i16
				%phitmp9 = zext i16 %phitmp to i32
				%phitmp10 = bitcast i32 %phitmp9 to float
				br label %for.cond.cleanup

				for.cond.cleanup:
				%Tmp.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %phitmp10, %for.cond.cleanup.loopexit ]
				ret float %Tmp.0.lcssa

				for.body:
				%i.08 = phi i32 [ %inc, %for.body ], [ %i.08.ph, %for.body.preheader15 ]
				%Tmp.07 = phi half [ %add, %for.body ], [ %Tmp.07.ph, %for.body.preheader15 ]
				%arrayidx = getelementptr inbounds half, half* %V, i32 %i.08
				%V5 = load half, half* %arrayidx, align 2
				%add = fadd fast half %V5, %Tmp.07
				%inc = add nuw nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

				attributes #0 = { norecurse nounwind readonly "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="generic" "target-features"="+armv8.2-a,+crc,+crypto,+dsp,+fp-armv8,+fullfp16,+hwdiv,+hwdiv-arm,+neon,+ras,+strict-align,-thumb-mode" "unsafe-fp-math"="true" "use-soft-float"="false" }