This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
-
ARMISelLowering.h
3/4
ARMISelLowering.cpp
2/3
ARMInstrVFP.td
-
ARMPredicates.td
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
-
arm-bf16-pcs.ll

Differential D82372

[ARM][BFloat] Legalize bf16 type even without fullfp16.
ClosedPublic

Authored by simon_tatham on Jun 23 2020, 5:36 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
stuij
chill
miyuki
labrinea

Commits

rGb769eb02b526: [ARM][BFloat] Legalize bf16 type even without fullfp16.

Summary

This change permits scalar bfloats to be loaded, stored, moved and
used as function call arguments and return values, whenever the bf16
feature is supported by the subtarget.

Previously that was only supported in the presence of the fullfp16
feature, because the code generation strategy depended on instructions
from that extension. This change adds alternative code generation
strategies so that those operations can be done even without fullfp16.

The strategy for loads and stores is to replace VLDRH/VSTRH with
integer LDRH/STRH plus a move between register classes. I've written
isel patterns for those, conditional on not having the fullfp16
feature (so that in the fullfp16 case, the existing patterns will
still be used).

For function arguments and returns, instead of writing isel patterns
to match VMOVhr and VMOVrh, I've avoided generating those SDNodes
in the first place, by factoring out the code that constructs them
into helper functions MoveToHPR and MoveFromHPR which have a
fallback for non-fullfp16 subtargets.

The current output code is not especially pretty: in the new test file
you can see unnecessary store/load pairs implementing no-op bitcasts,
and lots of pointless moves back and forth between FP registers and
GPRs. But it at least works, which is an improvement on the previous
situation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

simon_tatham created this revision.Jun 23 2020, 5:36 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2020, 5:36 AM

Herald added subscribers: llvm-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Hi Simon, thanks for working on this. Looks good overall. A few remarks inline.

llvm/lib/Target/ARM/ARMISelLowering.cpp
730–734	Shouldn't this and the following 4 lines be contitional to `Subtarget->hasBF16()` ?
llvm/lib/Target/ARM/ARMInstrVFP.td
171	Can you either define a new pattern with this predicate, or delete the `FPRegs16Pat` pattern and use the same syntax for uniformity? Just a personal preference.
llvm/lib/Target/ARM/Thumb2InstrInfo.cpp
595 ↗	(On Diff #272692)	Why do we need this block of changes?

labrinea added a reviewer: labrinea.Jun 23 2020, 6:20 AM

Harbormaster failed remote builds in B61380: Diff 272692!Jun 23 2020, 6:53 AM

New version addressing review comments.

llvm/lib/Target/ARM/ARMISelLowering.cpp
730–734	How did I miss that?! Thanks, good catch.
llvm/lib/Target/ARM/Thumb2InstrInfo.cpp
595 ↗	(On Diff #272692)	I put it in because otherwise the Thumb invocations of `llc` in my new test file failed with the 'Unsupported addressing mode' assertion in this function. However, now I look more closely, that's probably because I was using the wrong instruction – I was using `LDRH` for both Arm and Thumb, whereas I should have used `t2LDRHi12` in Thumb. With that change, I don't seem to need this extra addressing mode support any more. (But I couldn't tell you why it's not needed, because as far as I can tell `LDRH` itself still does use addressing mode 3, and is still selected in Arm state.)

(Sorry @labrinea – removing you as a reviewer was an unintentional side effect of a bad arc command.)

simon_tatham added a reviewer: labrinea.Jun 23 2020, 7:05 AM

LGTM with a nit. Can you also remove FPRegs16Pat from ARMInstrFormats.td now that is no longer used?

This revision is now accepted and ready to land.Jun 23 2020, 7:17 AM

Harbormaster failed remote builds in B61392: Diff 272712!Jun 23 2020, 7:58 AM

dmgreen added inline comments.Jun 23 2020, 8:28 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
733	I'm not sure if setAllExpand should be guarded by !hasFullFP16? In either case they would be in pretty much the same situation - most operations are illegal with bf16. Does something go wrong always doing this?
llvm/lib/Target/ARM/ARMInstrVFP.td
166–177	I would expect HasNoFPRegs16 + f16 to never come up? If so you could alternatively move that single pattern back into the instruction above.

It isn't unused – there are still some isel patterns using it, which I haven't touched in this commit.

simon_tatham marked an inline comment as done.Jun 23 2020, 8:35 AM

simon_tatham added inline comments.

llvm/lib/Target/ARM/ARMInstrVFP.td
166–177	I'll give that a try too.

Updated for @dmgreen's review suggestions, which both seem to work as far as I can see.

Thanks. If that setAllExpand does start causing problems we can always modify it. LGTM

simon_tatham marked an inline comment as done.Jun 23 2020, 9:09 AM

simon_tatham added inline comments.

llvm/lib/Target/ARM/ARMISelLowering.cpp
733	(Huh, where did my reply to this comment go? I must have forgotten to press 'Save Draft'.) I did it that way in order to avoid affecting the hasFullFP16 code path, because that way I was confident I wouldn't break it as a side effect :-) But I've just tried out the version with `setAllExpand` unconditional, and it seems to work just as well.

Harbormaster failed remote builds in B61413: Diff 272744!Jun 23 2020, 10:10 AM

Closed by commit rGb769eb02b526: [ARM][BFloat] Legalize bf16 type even without fullfp16. (authored by simon_tatham). · Explain WhyJun 24 2020, 2:08 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

7 lines

97 lines

36 lines

3 lines

test/

CodeGen/

ARM/

arm-bf16-pcs.ll

319 lines

Diff 272941

llvm/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 758 Lines • ▼ Show 20 Lines	private:
SDValue LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,		SDValue LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,
const ARMSubtarget *ST) const;		const ARMSubtarget *ST) const;
SDValue LowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDivRem(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDivRem(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDIV_Windows(SDValue Op, SelectionDAG &DAG, bool Signed) const;		SDValue LowerDIV_Windows(SDValue Op, SelectionDAG &DAG, bool Signed) const;
void ExpandDIV_Windows(SDValue Op, SelectionDAG &DAG, bool Signed,		void ExpandDIV_Windows(SDValue Op, SelectionDAG &DAG, bool Signed,
SmallVectorImpl<SDValue> &Results) const;		SmallVectorImpl<SDValue> &Results) const;
		SDValue ExpandBITCAST(SDNode *N, SelectionDAG &DAG,
		const ARMSubtarget *Subtarget) const;
SDValue LowerWindowsDIVLibCall(SDValue Op, SelectionDAG &DAG, bool Signed,		SDValue LowerWindowsDIVLibCall(SDValue Op, SelectionDAG &DAG, bool Signed,
SDValue &Chain) const;		SDValue &Chain) const;
SDValue LowerREM(SDNode *N, SelectionDAG &DAG) const;		SDValue LowerREM(SDNode *N, SelectionDAG &DAG) const;
SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFSETCC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFSETCC(SDValue Op, SelectionDAG &DAG) const;
void lowerABS(SDNode *N, SmallVectorImpl<SDValue> &Results,		void lowerABS(SDNode *N, SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
void LowerLOAD(SDNode *N, SmallVectorImpl<SDValue> &Results,		void LowerLOAD(SDNode *N, SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

Register getRegisterByName(const char* RegName, LLT VT,		Register getRegisterByName(const char* RegName, LLT VT,
const MachineFunction &MF) const override;		const MachineFunction &MF) const override;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
SmallVectorImpl<SDNode *> &Created) const override;		SmallVectorImpl<SDNode *> &Created) const override;

bool isFMAFasterThanFMulAndFAdd(const MachineFunction &MF,		bool isFMAFasterThanFMulAndFAdd(const MachineFunction &MF,
EVT VT) const override;		EVT VT) const override;

		SDValue MoveToHPR(const SDLoc &dl, SelectionDAG &DAG, MVT LocVT, MVT ValVT,
		SDValue Val) const;
		SDValue MoveFromHPR(const SDLoc &dl, SelectionDAG &DAG, MVT LocVT,
		MVT ValVT, SDValue Val) const;

SDValue ReconstructShuffle(SDValue Op, SelectionDAG &DAG) const;		SDValue ReconstructShuffle(SDValue Op, SelectionDAG &DAG) const;

SDValue LowerCallResult(SDValue Chain, SDValue InFlag,		SDValue LowerCallResult(SDValue Chain, SDValue InFlag,
CallingConv::ID CallConv, bool isVarArg,		CallingConv::ID CallConv, bool isVarArg,
const SmallVectorImpl<ISD::InputArg> &Ins,		const SmallVectorImpl<ISD::InputArg> &Ins,
const SDLoc &dl, SelectionDAG &DAG,		const SDLoc &dl, SelectionDAG &DAG,
SmallVectorImpl<SDValue> &InVals, bool isThisReturn,		SmallVectorImpl<SDValue> &InVals, bool isThisReturn,
SDValue ThisVal) const;		SDValue ThisVal) const;
▲ Show 20 Lines • Show All 120 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 719 Lines • ▼ Show 20 Lines	ARMTargetLowering::ARMTargetLowering(const TargetMachine &TM,

if (Subtarget->hasFullFP16()) {		if (Subtarget->hasFullFP16()) {
addRegisterClass(MVT::f16, &ARM::HPRRegClass);		addRegisterClass(MVT::f16, &ARM::HPRRegClass);
setOperationAction(ISD::BITCAST, MVT::i16, Custom);		setOperationAction(ISD::BITCAST, MVT::i16, Custom);
setOperationAction(ISD::BITCAST, MVT::f16, Custom);		setOperationAction(ISD::BITCAST, MVT::f16, Custom);

setOperationAction(ISD::FMINNUM, MVT::f16, Legal);		setOperationAction(ISD::FMINNUM, MVT::f16, Legal);
setOperationAction(ISD::FMAXNUM, MVT::f16, Legal);		setOperationAction(ISD::FMAXNUM, MVT::f16, Legal);
		}

// For the time being bfloat is only supported when fullfp16 is present.		if (Subtarget->hasBF16()) {
if (Subtarget->hasBF16())
addRegisterClass(MVT::bf16, &ARM::HPRRegClass);		addRegisterClass(MVT::bf16, &ARM::HPRRegClass);
		setAllExpand(MVT::bf16);
		if (!Subtarget->hasFullFP16())
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm not sure if setAllExpand should be guarded by !hasFullFP16? In either case they would be in pretty much the same situation - most operations are illegal with bf16. Does something go wrong always doing this? dmgreen: I'm not sure if setAllExpand should be guarded by !hasFullFP16? In either case they would be in…
		simon_tathamAuthorUnsubmitted Done Reply Inline Actions (Huh, where did my reply to this comment go? I must have forgotten to press 'Save Draft'.) I did it that way in order to avoid affecting the hasFullFP16 code path, because that way I was confident I wouldn't break it as a side effect :-) But I've just tried out the version with `setAllExpand` unconditional, and it seems to work just as well. simon_tatham: (Huh, where did my reply to this comment go? I must have forgotten to press 'Save Draft'.) I…
		setOperationAction(ISD::BITCAST, MVT::bf16, Custom);
		labrineaUnsubmitted Done Reply Inline Actions Shouldn't this and the following 4 lines be contitional to `Subtarget->hasBF16()` ? labrinea: Shouldn't this and the following 4 lines be contitional to `Subtarget->hasBF16()` ?
		simon_tathamAuthorUnsubmitted Done Reply Inline Actions How did I miss that?! Thanks, good catch. simon_tatham: How did I miss //that?!// Thanks, good catch.
}		}

for (MVT VT : MVT::fixedlen_vector_valuetypes()) {		for (MVT VT : MVT::fixedlen_vector_valuetypes()) {
for (MVT InnerVT : MVT::fixedlen_vector_valuetypes()) {		for (MVT InnerVT : MVT::fixedlen_vector_valuetypes()) {
setTruncStoreAction(VT, InnerVT, Expand);		setTruncStoreAction(VT, InnerVT, Expand);
addAllExtLoads(VT, InnerVT, Expand);		addAllExtLoads(VT, InnerVT, Expand);
}		}

▲ Show 20 Lines • Show All 1,265 Lines • ▼ Show 20 Lines	case CallingConv::GHC:
return (Return ? RetCC_ARM_APCS : CC_ARM_APCS_GHC);		return (Return ? RetCC_ARM_APCS : CC_ARM_APCS_GHC);
case CallingConv::PreserveMost:		case CallingConv::PreserveMost:
return (Return ? RetCC_ARM_AAPCS : CC_ARM_AAPCS);		return (Return ? RetCC_ARM_AAPCS : CC_ARM_AAPCS);
case CallingConv::CFGuard_Check:		case CallingConv::CFGuard_Check:
return (Return ? RetCC_ARM_AAPCS : CC_ARM_Win32_CFGuard_Check);		return (Return ? RetCC_ARM_AAPCS : CC_ARM_Win32_CFGuard_Check);
}		}
}		}

		SDValue ARMTargetLowering::MoveToHPR(const SDLoc &dl, SelectionDAG &DAG,
		MVT LocVT, MVT ValVT, SDValue Val) const {
		Val = DAG.getNode(ISD::BITCAST, dl, MVT::getIntegerVT(LocVT.getSizeInBits()),
		Val);
		if (Subtarget->hasFullFP16()) {
		Val = DAG.getNode(ARMISD::VMOVhr, dl, ValVT, Val);
		} else {
		Val = DAG.getNode(ISD::TRUNCATE, dl,
		MVT::getIntegerVT(ValVT.getSizeInBits()), Val);
		Val = DAG.getNode(ISD::BITCAST, dl, ValVT, Val);
		}
		return Val;
		}

		SDValue ARMTargetLowering::MoveFromHPR(const SDLoc &dl, SelectionDAG &DAG,
		MVT LocVT, MVT ValVT,
		SDValue Val) const {
		if (Subtarget->hasFullFP16()) {
		Val = DAG.getNode(ARMISD::VMOVrh, dl,
		MVT::getIntegerVT(LocVT.getSizeInBits()), Val);
		} else {
		Val = DAG.getNode(ISD::BITCAST, dl,
		MVT::getIntegerVT(ValVT.getSizeInBits()), Val);
		Val = DAG.getNode(ISD::ZERO_EXTEND, dl,
		MVT::getIntegerVT(LocVT.getSizeInBits()), Val);
		}
		return DAG.getNode(ISD::BITCAST, dl, LocVT, Val);
		}

/// LowerCallResult - Lower the result values of a call into the		/// LowerCallResult - Lower the result values of a call into the
/// appropriate copies out of appropriate physical registers.		/// appropriate copies out of appropriate physical registers.
SDValue ARMTargetLowering::LowerCallResult(		SDValue ARMTargetLowering::LowerCallResult(
SDValue Chain, SDValue InFlag, CallingConv::ID CallConv, bool isVarArg,		SDValue Chain, SDValue InFlag, CallingConv::ID CallConv, bool isVarArg,
const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,		const SmallVectorImpl<ISD::InputArg> &Ins, const SDLoc &dl,
SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals, bool isThisReturn,		SelectionDAG &DAG, SmallVectorImpl<SDValue> &InVals, bool isThisReturn,
SDValue ThisVal) const {		SDValue ThisVal) const {
// Assign locations to each value returned by this call.		// Assign locations to each value returned by this call.
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	case CCValAssign::BCvt:
Val = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), Val);		Val = DAG.getNode(ISD::BITCAST, dl, VA.getValVT(), Val);
break;		break;
}		}

// f16 arguments have their size extended to 4 bytes and passed as if they		// f16 arguments have their size extended to 4 bytes and passed as if they
// had been copied to the LSBs of a 32-bit register.		// had been copied to the LSBs of a 32-bit register.
// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)		// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)
if (VA.needsCustom() &&		if (VA.needsCustom() &&
(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16)) {		(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16))
assert(Subtarget->hasFullFP16() &&		Val = MoveToHPR(dl, DAG, VA.getLocVT(), VA.getValVT(), Val);
"Lowering half precision fp return without full fp16 support");
Val = DAG.getNode(ISD::BITCAST, dl,
MVT::getIntegerVT(VA.getLocVT().getSizeInBits()), Val);
Val = DAG.getNode(ARMISD::VMOVhr, dl, VA.getValVT(), Val);
}

InVals.push_back(Val);		InVals.push_back(Val);
}		}

return Chain;		return Chain;
}		}

/// LowerMemOpCallTo - Store the argument to the stack.		/// LowerMemOpCallTo - Store the argument to the stack.
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	case CCValAssign::BCvt:
break;		break;
}		}

// f16 arguments have their size extended to 4 bytes and passed as if they		// f16 arguments have their size extended to 4 bytes and passed as if they
// had been copied to the LSBs of a 32-bit register.		// had been copied to the LSBs of a 32-bit register.
// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)		// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)
if (VA.needsCustom() &&		if (VA.needsCustom() &&
(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16)) {		(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16)) {
assert(Subtarget->hasFullFP16() &&		Arg = MoveFromHPR(dl, DAG, VA.getLocVT(), VA.getValVT(), Arg);
"Lowering half precision fp argument without full fp16 support");
Arg = DAG.getNode(ARMISD::VMOVrh, dl,
MVT::getIntegerVT(VA.getLocVT().getSizeInBits()), Arg);
Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
} else {		} else {
// f16 arguments could have been extended prior to argument lowering.		// f16 arguments could have been extended prior to argument lowering.
// Mask them arguments if this is a CMSE nonsecure call.		// Mask them arguments if this is a CMSE nonsecure call.
auto ArgVT = Outs[realArgIdx].ArgVT;		auto ArgVT = Outs[realArgIdx].ArgVT;
if (isCmseNSCall && (ArgVT == MVT::f16)) {		if (isCmseNSCall && (ArgVT == MVT::f16)) {
auto LocBits = VA.getLocVT().getSizeInBits();		auto LocBits = VA.getLocVT().getSizeInBits();
auto MaskValue = APInt::getLowBitsSet(LocBits, ArgVT.getSizeInBits());		auto MaskValue = APInt::getLowBitsSet(LocBits, ArgVT.getSizeInBits());
SDValue Mask =		SDValue Mask =
▲ Show 20 Lines • Show All 667 Lines • ▼ Show 20 Lines	case CCValAssign::BCvt:
Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);		Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
break;		break;
}		}

// Mask f16 arguments if this is a CMSE nonsecure entry.		// Mask f16 arguments if this is a CMSE nonsecure entry.
auto RetVT = Outs[realRVLocIdx].ArgVT;		auto RetVT = Outs[realRVLocIdx].ArgVT;
if (AFI->isCmseNSEntryFunction() && (RetVT == MVT::f16)) {		if (AFI->isCmseNSEntryFunction() && (RetVT == MVT::f16)) {
if (VA.needsCustom() && VA.getValVT() == MVT::f16) {		if (VA.needsCustom() && VA.getValVT() == MVT::f16) {
assert(Subtarget->hasFullFP16() &&		Arg = MoveFromHPR(dl, DAG, VA.getLocVT(), VA.getValVT(), Arg);
"Lowering f16 type argument without full fp16 support");
Arg =
DAG.getNode(ARMISD::VMOVrh, dl,
MVT::getIntegerVT(VA.getLocVT().getSizeInBits()), Arg);
Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
} else {		} else {
auto LocBits = VA.getLocVT().getSizeInBits();		auto LocBits = VA.getLocVT().getSizeInBits();
auto MaskValue = APInt::getLowBitsSet(LocBits, RetVT.getSizeInBits());		auto MaskValue = APInt::getLowBitsSet(LocBits, RetVT.getSizeInBits());
SDValue Mask =		SDValue Mask =
DAG.getConstant(MaskValue, dl, MVT::getIntegerVT(LocBits));		DAG.getConstant(MaskValue, dl, MVT::getIntegerVT(LocBits));
Arg = DAG.getNode(ISD::BITCAST, dl, MVT::getIntegerVT(LocBits), Arg);		Arg = DAG.getNode(ISD::BITCAST, dl, MVT::getIntegerVT(LocBits), Arg);
Arg = DAG.getNode(ISD::AND, dl, MVT::getIntegerVT(LocBits), Arg, Mask);		Arg = DAG.getNode(ISD::AND, dl, MVT::getIntegerVT(LocBits), Arg, Mask);
Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);		Arg = DAG.getNode(ISD::BITCAST, dl, VA.getLocVT(), Arg);
▲ Show 20 Lines • Show All 1,353 Lines • ▼ Show 20 Lines	if (VA.isRegLoc()) {
ArgValue = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), ArgValue);		ArgValue = DAG.getNode(ISD::TRUNCATE, dl, VA.getValVT(), ArgValue);
break;		break;
}		}

// f16 arguments have their size extended to 4 bytes and passed as if they		// f16 arguments have their size extended to 4 bytes and passed as if they
// had been copied to the LSBs of a 32-bit register.		// had been copied to the LSBs of a 32-bit register.
// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)		// For that, it's passed extended to i32 (soft ABI) or to f32 (hard ABI)
if (VA.needsCustom() &&		if (VA.needsCustom() &&
(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16)) {		(VA.getValVT() == MVT::f16 \|\| VA.getValVT() == MVT::bf16))
assert(Subtarget->hasFullFP16() &&		ArgValue = MoveToHPR(dl, DAG, VA.getLocVT(), VA.getValVT(), ArgValue);
"Lowering half precision fp argument without full fp16 support");
ArgValue = DAG.getNode(ISD::BITCAST, dl,
MVT::getIntegerVT(VA.getLocVT().getSizeInBits()),
ArgValue);
ArgValue = DAG.getNode(ARMISD::VMOVhr, dl, VA.getValVT(), ArgValue);
}

InVals.push_back(ArgValue);		InVals.push_back(ArgValue);
} else { // VA.isRegLoc()		} else { // VA.isRegLoc()
// sanity check		// sanity check
assert(VA.isMemLoc());		assert(VA.isMemLoc());
assert(VA.getValVT() != MVT::i64 && "i64 should already be lowered");		assert(VA.getValVT() != MVT::i64 && "i64 should already be lowered");

int index = VA.getValNo();		int index = VA.getValNo();
▲ Show 20 Lines • Show All 1,563 Lines • ▼ Show 20 Lines	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, DstVT, BitCast,
DAG.getConstant(NewIndex.getZExtValue(), dl, MVT::i32));		DAG.getConstant(NewIndex.getZExtValue(), dl, MVT::i32));
}		}

/// ExpandBITCAST - If the target supports VFP, this function is called to		/// ExpandBITCAST - If the target supports VFP, this function is called to
/// expand a bit convert where either the source or destination type is i64 to		/// expand a bit convert where either the source or destination type is i64 to
/// use a VMOVDRR or VMOVRRD node. This should not be done when the non-i64		/// use a VMOVDRR or VMOVRRD node. This should not be done when the non-i64
/// operand type is illegal (e.g., v2f32 for a target that doesn't support		/// operand type is illegal (e.g., v2f32 for a target that doesn't support
/// vectors), since the legalizer won't know what to do with that.		/// vectors), since the legalizer won't know what to do with that.
static SDValue ExpandBITCAST(SDNode *N, SelectionDAG &DAG,		SDValue ARMTargetLowering::ExpandBITCAST(SDNode *N, SelectionDAG &DAG,
const ARMSubtarget *Subtarget) {		const ARMSubtarget *Subtarget) const {
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
SDLoc dl(N);		SDLoc dl(N);
SDValue Op = N->getOperand(0);		SDValue Op = N->getOperand(0);

// This function is only supposed to be called for i16 and i64 types, either		// This function is only supposed to be called for i16 and i64 types, either
// as the source or destination of the bit convert.		// as the source or destination of the bit convert.
EVT SrcVT = Op.getValueType();		EVT SrcVT = Op.getValueType();
EVT DstVT = N->getValueType(0);		EVT DstVT = N->getValueType(0);

if (SrcVT == MVT::i16 && (DstVT == MVT::f16 \|\| DstVT == MVT::bf16)) {		if ((SrcVT == MVT::i16 \|\| SrcVT == MVT::i32) &&
if (!Subtarget->hasFullFP16())		(DstVT == MVT::f16 \|\| DstVT == MVT::bf16))
return SDValue();		return MoveToHPR(SDLoc(N), DAG, MVT::i32, DstVT.getSimpleVT(),
// (b)f16 bitcast i16 -> VMOVhr
return DAG.getNode(ARMISD::VMOVhr, SDLoc(N), DstVT,
DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), MVT::i32, Op));		DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), MVT::i32, Op));
}

if ((SrcVT == MVT::f16 \|\| SrcVT == MVT::bf16) && DstVT == MVT::i16) {		if ((DstVT == MVT::i16 \|\| DstVT == MVT::i32) &&
if (!Subtarget->hasFullFP16())		(SrcVT == MVT::f16 \|\| SrcVT == MVT::bf16))
return SDValue();		return DAG.getNode(
// i16 bitcast (b)f16 -> VMOVrh		ISD::TRUNCATE, SDLoc(N), DstVT,
return DAG.getNode(ISD::TRUNCATE, SDLoc(N), MVT::i16,		MoveFromHPR(SDLoc(N), DAG, MVT::i32, SrcVT.getSimpleVT(), Op));
DAG.getNode(ARMISD::VMOVrh, SDLoc(N), MVT::i32, Op));
}

if (!(SrcVT == MVT::i64 \|\| DstVT == MVT::i64))		if (!(SrcVT == MVT::i64 \|\| DstVT == MVT::i64))
return SDValue();		return SDValue();

// Turn i64->f64 into VMOVDRR.		// Turn i64->f64 into VMOVDRR.
if (SrcVT == MVT::i64 && TLI.isTypeLegal(DstVT)) {		if (SrcVT == MVT::i64 && TLI.isTypeLegal(DstVT)) {
// Do not force values to GPRs (this is what VMOVDRR does for the inputs)		// Do not force values to GPRs (this is what VMOVDRR does for the inputs)
// if we can combine the bitcast with its source.		// if we can combine the bitcast with its source.
▲ Show 20 Lines • Show All 12,712 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrVFP.td

Show First 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	def VLDRS : ASI5<0b1101, 0b01, (outs SPR:$Sd), (ins addrmode5:$addr),
// Some single precision VFP instructions may be executed on both NEON and VFP		// Some single precision VFP instructions may be executed on both NEON and VFP
// pipelines.		// pipelines.
let D = VFPNeonDomain;		let D = VFPNeonDomain;
}		}

let isUnpredicable = 1 in		let isUnpredicable = 1 in
def VLDRH : AHI5<0b1101, 0b01, (outs HPR:$Sd), (ins addrmode5fp16:$addr),		def VLDRH : AHI5<0b1101, 0b01, (outs HPR:$Sd), (ins addrmode5fp16:$addr),
IIC_fpLoad16, "vldr", ".16\t$Sd, $addr",		IIC_fpLoad16, "vldr", ".16\t$Sd, $addr",
[]>,		[(set HPR:$Sd, (f16 (alignedload16 addrmode5fp16:$addr)))]>,
Requires<[HasFPRegs16]>;		Requires<[HasFPRegs16]>;

} // End of 'let canFoldAsLoad = 1, isReMaterializable = 1 in'		} // End of 'let canFoldAsLoad = 1, isReMaterializable = 1 in'

def : FPRegs16Pat<(f16 (alignedload16 addrmode5fp16:$addr)),		def : Pat<(bf16 (alignedload16 addrmode5fp16:$addr)),
(VLDRH addrmode5fp16:$addr)>;		(VLDRH addrmode5fp16:$addr)> {
def : FPRegs16Pat<(bf16 (alignedload16 addrmode5fp16:$addr)),		let Predicates = [HasFPRegs16];
(VLDRH addrmode5fp16:$addr)>;		}
		def : Pat<(bf16 (alignedload16 addrmode3:$addr)),
		(COPY_TO_REGCLASS (LDRH addrmode3:$addr), HPR)> {
		labrineaUnsubmitted Done Reply Inline Actions Can you either define a new pattern with this predicate, or delete the `FPRegs16Pat` pattern and use the same syntax for uniformity? Just a personal preference. labrinea: Can you either define a new pattern with this predicate, or delete the `FPRegs16Pat` pattern…
		let Predicates = [HasNoFPRegs16, IsARM];
		}
		def : Pat<(bf16 (alignedload16 t2addrmode_imm12:$addr)),
		(COPY_TO_REGCLASS (t2LDRHi12 t2addrmode_imm12:$addr), HPR)> {
		let Predicates = [HasNoFPRegs16, IsThumb];
		}
		dmgreenUnsubmitted Not Done Reply Inline Actions I would expect HasNoFPRegs16 + f16 to never come up? If so you could alternatively move that single pattern back into the instruction above. dmgreen: I would expect HasNoFPRegs16 + f16 to never come up? If so you could alternatively move that…
		simon_tathamAuthorUnsubmitted Done Reply Inline Actions I'll give that a try too. simon_tatham: I'll give that a try too.

def VSTRD : ADI5<0b1101, 0b00, (outs), (ins DPR:$Dd, addrmode5:$addr),		def VSTRD : ADI5<0b1101, 0b00, (outs), (ins DPR:$Dd, addrmode5:$addr),
IIC_fpStore64, "vstr", "\t$Dd, $addr",		IIC_fpStore64, "vstr", "\t$Dd, $addr",
[(alignedstore32 (f64 DPR:$Dd), addrmode5:$addr)]>,		[(alignedstore32 (f64 DPR:$Dd), addrmode5:$addr)]>,
Requires<[HasFPRegs]>;		Requires<[HasFPRegs]>;

def VSTRS : ASI5<0b1101, 0b00, (outs), (ins SPR:$Sd, addrmode5:$addr),		def VSTRS : ASI5<0b1101, 0b00, (outs), (ins SPR:$Sd, addrmode5:$addr),
IIC_fpStore32, "vstr", "\t$Sd, $addr",		IIC_fpStore32, "vstr", "\t$Sd, $addr",
[(alignedstore32 SPR:$Sd, addrmode5:$addr)]>,		[(alignedstore32 SPR:$Sd, addrmode5:$addr)]>,
Requires<[HasFPRegs]> {		Requires<[HasFPRegs]> {
// Some single precision VFP instructions may be executed on both NEON and VFP		// Some single precision VFP instructions may be executed on both NEON and VFP
// pipelines.		// pipelines.
let D = VFPNeonDomain;		let D = VFPNeonDomain;
}		}

let isUnpredicable = 1 in		let isUnpredicable = 1 in
def VSTRH : AHI5<0b1101, 0b00, (outs), (ins HPR:$Sd, addrmode5fp16:$addr),		def VSTRH : AHI5<0b1101, 0b00, (outs), (ins HPR:$Sd, addrmode5fp16:$addr),
IIC_fpStore16, "vstr", ".16\t$Sd, $addr",		IIC_fpStore16, "vstr", ".16\t$Sd, $addr",
[]>,		[(alignedstore16 (f16 HPR:$Sd), addrmode5fp16:$addr)]>,
Requires<[HasFPRegs16]>;		Requires<[HasFPRegs16]>;

def : FPRegs16Pat<(alignedstore16 (f16 HPR:$Sd), addrmode5fp16:$addr),		def : Pat<(alignedstore16 (bf16 HPR:$Sd), addrmode5fp16:$addr),
(VSTRH (f16 HPR:$Sd), addrmode5fp16:$addr)>;		(VSTRH (bf16 HPR:$Sd), addrmode5fp16:$addr)> {
def : FPRegs16Pat<(alignedstore16 (bf16 HPR:$Sd), addrmode5fp16:$addr),		let Predicates = [HasFPRegs16];
(VSTRH (bf16 HPR:$Sd), addrmode5fp16:$addr)>;		}
		def : Pat<(alignedstore16 (bf16 HPR:$Sd), addrmode3:$addr),
		(STRH (COPY_TO_REGCLASS $Sd, GPR), addrmode3:$addr)> {
		let Predicates = [HasNoFPRegs16, IsARM];
		}
		def : Pat<(alignedstore16 (bf16 HPR:$Sd), t2addrmode_imm12:$addr),
		(t2STRHi12 (COPY_TO_REGCLASS $Sd, GPR), t2addrmode_imm12:$addr)> {
		let Predicates = [HasNoFPRegs16, IsThumb];
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Load / store multiple Instructions.		// Load / store multiple Instructions.
//		//

multiclass vfp_ldst_mult<string asm, bit L_bit,		multiclass vfp_ldst_mult<string asm, bit L_bit,
InstrItinClass itin, InstrItinClass itin_upd> {		InstrItinClass itin, InstrItinClass itin_upd> {
let Predicates = [HasFPRegs] in {		let Predicates = [HasFPRegs] in {
▲ Show 20 Lines • Show All 2,607 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMPredicates.td

Show All 38 Lines	def HasCDE : Predicate<"Subtarget->hasCDEOps()">,
AssemblerPredicate<(all_of HasCDEOps),		AssemblerPredicate<(all_of HasCDEOps),
"cde">;		"cde">;
def HasFPRegs : Predicate<"Subtarget->hasFPRegs()">,		def HasFPRegs : Predicate<"Subtarget->hasFPRegs()">,
AssemblerPredicate<(all_of FeatureFPRegs),		AssemblerPredicate<(all_of FeatureFPRegs),
"fp registers">;		"fp registers">;
def HasFPRegs16 : Predicate<"Subtarget->hasFPRegs16()">,		def HasFPRegs16 : Predicate<"Subtarget->hasFPRegs16()">,
AssemblerPredicate<(all_of FeatureFPRegs16),		AssemblerPredicate<(all_of FeatureFPRegs16),
"16-bit fp registers">;		"16-bit fp registers">;
		def HasNoFPRegs16 : Predicate<"!Subtarget->hasFPRegs16()">,
		AssemblerPredicate<(all_of (not FeatureFPRegs16)),
		"16-bit fp registers">;
def HasFPRegs64 : Predicate<"Subtarget->hasFPRegs64()">,		def HasFPRegs64 : Predicate<"Subtarget->hasFPRegs64()">,
AssemblerPredicate<(all_of FeatureFPRegs64),		AssemblerPredicate<(all_of FeatureFPRegs64),
"64-bit fp registers">;		"64-bit fp registers">;
def HasFPRegsV8_1M : Predicate<"Subtarget->hasFPRegs() && Subtarget->hasV8_1MMainlineOps()">,		def HasFPRegsV8_1M : Predicate<"Subtarget->hasFPRegs() && Subtarget->hasV8_1MMainlineOps()">,
AssemblerPredicate<(all_of FeatureFPRegs, HasV8_1MMainlineOps),		AssemblerPredicate<(all_of FeatureFPRegs, HasV8_1MMainlineOps),
"armv8.1m.main with FP or MVE">;		"armv8.1m.main with FP or MVE">;
def HasV6T2 : Predicate<"Subtarget->hasV6T2Ops()">,		def HasV6T2 : Predicate<"Subtarget->hasV6T2Ops()">,
AssemblerPredicate<(all_of HasV6T2Ops), "armv6t2">;		AssemblerPredicate<(all_of HasV6T2Ops), "armv6t2">;
▲ Show 20 Lines • Show All 164 Lines • Show Last 20 Lines

llvm/test/CodeGen/ARM/arm-bf16-pcs.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple armv8.6a-arm-none-eabi -o - %s \| FileCheck %s --check-prefix=BASE --check-prefix=BASE-ARM
				; RUN: llc -mtriple thumbv8.6a-arm-none-eabi -o - %s \| FileCheck %s --check-prefix=BASE --check-prefix=BASE-THUMB
				; RUN: llc -mtriple armv8.6a-arm-none-eabi -mattr=+fullfp16 -o - %s \| FileCheck %s --check-prefix=FULLFP16 --check-prefix=FULLFP16-ARM
				; RUN: llc -mtriple thumbv8.6a-arm-none-eabi -mattr=+fullfp16 -o - %s \| FileCheck %s --check-prefix=FULLFP16 --check-prefix=FULLFP16-THUMB

				define bfloat @bf_load_soft(bfloat* %p) {
				; BASE-LABEL: bf_load_soft:
				; BASE: @ %bb.0:
				; BASE-NEXT: ldrh r0, [r0]
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_load_soft:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vldr.16 s0, [r0]
				; FULLFP16-NEXT: vmov r0, s0
				; FULLFP16-NEXT: bx lr
				%f = load bfloat, bfloat* %p, align 2
				ret bfloat %f
				}

				define arm_aapcs_vfpcc bfloat @bf_load_hard(bfloat* %p) {
				; BASE-LABEL: bf_load_hard:
				; BASE: @ %bb.0:
				; BASE-NEXT: ldrh r0, [r0]
				; BASE-NEXT: vmov s0, r0
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_load_hard:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vldr.16 s0, [r0]
				; FULLFP16-NEXT: bx lr
				%f = load bfloat, bfloat* %p, align 2
				ret bfloat %f
				}

				define void @bf_store_soft(bfloat* %p, bfloat %f) {
				; BASE-LABEL: bf_store_soft:
				; BASE: @ %bb.0:
				; BASE-NEXT: strh r1, [r0]
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_store_soft:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vmov.f16 s0, r1
				; FULLFP16-NEXT: vstr.16 s0, [r0]
				; FULLFP16-NEXT: bx lr
				store bfloat %f, bfloat* %p, align 2
				ret void
				}

				define arm_aapcs_vfpcc void @bf_store_hard(bfloat* %p, bfloat %f) {
				; BASE-LABEL: bf_store_hard:
				; BASE: @ %bb.0:
				; BASE-NEXT: vmov r1, s0
				; BASE-NEXT: strh r1, [r0]
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_store_hard:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vstr.16 s0, [r0]
				; FULLFP16-NEXT: bx lr
				store bfloat %f, bfloat* %p, align 2
				ret void
				}

				define i32 @bf_to_int_soft(bfloat %f) {
				; BASE-LABEL: bf_to_int_soft:
				; BASE: @ %bb.0:
				; BASE-NEXT: uxth r0, r0
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_to_int_soft:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vmov.f16 s0, r0
				; FULLFP16-NEXT: vmov.f16 r0, s0
				; FULLFP16-NEXT: bx lr
				%h = bitcast bfloat %f to i16
				%w = zext i16 %h to i32
				ret i32 %w
				}

				define arm_aapcs_vfpcc i32 @bf_to_int_hard(bfloat %f) {
				; BASE-LABEL: bf_to_int_hard:
				; BASE: @ %bb.0:
				; BASE-NEXT: vmov r0, s0
				; BASE-NEXT: uxth r0, r0
				; BASE-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_to_int_hard:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vmov.f16 r0, s0
				; FULLFP16-NEXT: bx lr
				%h = bitcast bfloat %f to i16
				%w = zext i16 %h to i32
				ret i32 %w
				}

				define bfloat @bf_from_int_soft(i32 %w) {
				; BASE-ARM-LABEL: bf_from_int_soft:
				; BASE-ARM: @ %bb.0:
				; BASE-ARM-NEXT: .pad #4
				; BASE-ARM-NEXT: sub sp, sp, #4
				; BASE-ARM-NEXT: strh r0, [sp, #2]
				; BASE-ARM-NEXT: ldrh r0, [sp, #2]
				; BASE-ARM-NEXT: add sp, sp, #4
				; BASE-ARM-NEXT: bx lr
				;
				; BASE-THUMB-LABEL: bf_from_int_soft:
				; BASE-THUMB: @ %bb.0:
				; BASE-THUMB-NEXT: .pad #4
				; BASE-THUMB-NEXT: sub sp, #4
				; BASE-THUMB-NEXT: strh.w r0, [sp, #2]
				; BASE-THUMB-NEXT: ldrh.w r0, [sp, #2]
				; BASE-THUMB-NEXT: add sp, #4
				; BASE-THUMB-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_from_int_soft:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vmov.f16 s0, r0
				; FULLFP16-NEXT: vmov r0, s0
				; FULLFP16-NEXT: bx lr
				%h = trunc i32 %w to i16
				%f = bitcast i16 %h to bfloat
				ret bfloat %f
				}

				define arm_aapcs_vfpcc bfloat @bf_from_int_hard(i32 %w) {
				; BASE-ARM-LABEL: bf_from_int_hard:
				; BASE-ARM: @ %bb.0:
				; BASE-ARM-NEXT: .pad #4
				; BASE-ARM-NEXT: sub sp, sp, #4
				; BASE-ARM-NEXT: strh r0, [sp, #2]
				; BASE-ARM-NEXT: ldrh r0, [sp, #2]
				; BASE-ARM-NEXT: vmov s0, r0
				; BASE-ARM-NEXT: add sp, sp, #4
				; BASE-ARM-NEXT: bx lr
				;
				; BASE-THUMB-LABEL: bf_from_int_hard:
				; BASE-THUMB: @ %bb.0:
				; BASE-THUMB-NEXT: .pad #4
				; BASE-THUMB-NEXT: sub sp, #4
				; BASE-THUMB-NEXT: strh.w r0, [sp, #2]
				; BASE-THUMB-NEXT: ldrh.w r0, [sp, #2]
				; BASE-THUMB-NEXT: vmov s0, r0
				; BASE-THUMB-NEXT: add sp, #4
				; BASE-THUMB-NEXT: bx lr
				;
				; FULLFP16-LABEL: bf_from_int_hard:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: vmov.f16 s0, r0
				; FULLFP16-NEXT: bx lr
				%h = trunc i32 %w to i16
				%f = bitcast i16 %h to bfloat
				ret bfloat %f
				}

				define bfloat @test_fncall_soft(bfloat %bf, bfloat (bfloat, bfloat)* %f) {
				; BASE-ARM-LABEL: test_fncall_soft:
				; BASE-ARM: @ %bb.0:
				; BASE-ARM-NEXT: .save {r4, r5, r11, lr}
				; BASE-ARM-NEXT: push {r4, r5, r11, lr}
				; BASE-ARM-NEXT: .pad #8
				; BASE-ARM-NEXT: sub sp, sp, #8
				; BASE-ARM-NEXT: uxth r5, r0
				; BASE-ARM-NEXT: mov r4, r1
				; BASE-ARM-NEXT: mov r0, r5
				; BASE-ARM-NEXT: mov r1, r5
				; BASE-ARM-NEXT: blx r4
				; BASE-ARM-NEXT: strh r0, [sp, #6]
				; BASE-ARM-NEXT: uxth r1, r0
				; BASE-ARM-NEXT: mov r0, r5
				; BASE-ARM-NEXT: blx r4
				; BASE-ARM-NEXT: ldrh r0, [sp, #6]
				; BASE-ARM-NEXT: add sp, sp, #8
				; BASE-ARM-NEXT: pop {r4, r5, r11, pc}
				;
				; BASE-THUMB-LABEL: test_fncall_soft:
				; BASE-THUMB: @ %bb.0:
				; BASE-THUMB-NEXT: .save {r4, r5, r7, lr}
				; BASE-THUMB-NEXT: push {r4, r5, r7, lr}
				; BASE-THUMB-NEXT: .pad #8
				; BASE-THUMB-NEXT: sub sp, #8
				; BASE-THUMB-NEXT: uxth r5, r0
				; BASE-THUMB-NEXT: mov r4, r1
				; BASE-THUMB-NEXT: mov r0, r5
				; BASE-THUMB-NEXT: mov r1, r5
				; BASE-THUMB-NEXT: blx r4
				; BASE-THUMB-NEXT: uxth r1, r0
				; BASE-THUMB-NEXT: strh.w r0, [sp, #6]
				; BASE-THUMB-NEXT: mov r0, r5
				; BASE-THUMB-NEXT: blx r4
				; BASE-THUMB-NEXT: ldrh.w r0, [sp, #6]
				; BASE-THUMB-NEXT: add sp, #8
				; BASE-THUMB-NEXT: pop {r4, r5, r7, pc}
				;
				; FULLFP16-ARM-LABEL: test_fncall_soft:
				; FULLFP16-ARM: @ %bb.0:
				; FULLFP16-ARM-NEXT: .save {r4, r5, r11, lr}
				; FULLFP16-ARM-NEXT: push {r4, r5, r11, lr}
				; FULLFP16-ARM-NEXT: .vsave {d8}
				; FULLFP16-ARM-NEXT: vpush {d8}
				; FULLFP16-ARM-NEXT: vmov.f16 s0, r0
				; FULLFP16-ARM-NEXT: mov r4, r1
				; FULLFP16-ARM-NEXT: vmov.f16 r5, s0
				; FULLFP16-ARM-NEXT: mov r0, r5
				; FULLFP16-ARM-NEXT: mov r1, r5
				; FULLFP16-ARM-NEXT: blx r4
				; FULLFP16-ARM-NEXT: vmov.f16 s16, r0
				; FULLFP16-ARM-NEXT: mov r0, r5
				; FULLFP16-ARM-NEXT: vmov.f16 r1, s16
				; FULLFP16-ARM-NEXT: blx r4
				; FULLFP16-ARM-NEXT: vmov r0, s16
				; FULLFP16-ARM-NEXT: vpop {d8}
				; FULLFP16-ARM-NEXT: pop {r4, r5, r11, pc}
				;
				; FULLFP16-THUMB-LABEL: test_fncall_soft:
				; FULLFP16-THUMB: @ %bb.0:
				; FULLFP16-THUMB-NEXT: .save {r4, r5, r7, lr}
				; FULLFP16-THUMB-NEXT: push {r4, r5, r7, lr}
				; FULLFP16-THUMB-NEXT: .vsave {d8}
				; FULLFP16-THUMB-NEXT: vpush {d8}
				; FULLFP16-THUMB-NEXT: vmov.f16 s0, r0
				; FULLFP16-THUMB-NEXT: mov r4, r1
				; FULLFP16-THUMB-NEXT: vmov.f16 r5, s0
				; FULLFP16-THUMB-NEXT: mov r0, r5
				; FULLFP16-THUMB-NEXT: mov r1, r5
				; FULLFP16-THUMB-NEXT: blx r4
				; FULLFP16-THUMB-NEXT: vmov.f16 s16, r0
				; FULLFP16-THUMB-NEXT: mov r0, r5
				; FULLFP16-THUMB-NEXT: vmov.f16 r1, s16
				; FULLFP16-THUMB-NEXT: blx r4
				; FULLFP16-THUMB-NEXT: vmov r0, s16
				; FULLFP16-THUMB-NEXT: vpop {d8}
				; FULLFP16-THUMB-NEXT: pop {r4, r5, r7, pc}
				%call = tail call bfloat %f(bfloat %bf, bfloat %bf)
				%call1 = tail call bfloat %f(bfloat %bf, bfloat %call)
				ret bfloat %call
				}

				define arm_aapcs_vfpcc bfloat @test_fncall_hard(bfloat %bf, bfloat (bfloat, bfloat)* %f) {
				; BASE-ARM-LABEL: test_fncall_hard:
				; BASE-ARM: @ %bb.0:
				; BASE-ARM-NEXT: .save {r4, lr}
				; BASE-ARM-NEXT: push {r4, lr}
				; BASE-ARM-NEXT: .vsave {d8}
				; BASE-ARM-NEXT: vpush {d8}
				; BASE-ARM-NEXT: .pad #8
				; BASE-ARM-NEXT: sub sp, sp, #8
				; BASE-ARM-NEXT: mov r4, r0
				; BASE-ARM-NEXT: vmov r0, s0
				; BASE-ARM-NEXT: uxth r0, r0
				; BASE-ARM-NEXT: vmov s16, r0
				; BASE-ARM-NEXT: vmov.f32 s0, s16
				; BASE-ARM-NEXT: vmov.f32 s1, s16
				; BASE-ARM-NEXT: blx r4
				; BASE-ARM-NEXT: vmov r0, s0
				; BASE-ARM-NEXT: vmov.f32 s0, s16
				; BASE-ARM-NEXT: strh r0, [sp, #6]
				; BASE-ARM-NEXT: uxth r0, r0
				; BASE-ARM-NEXT: vmov s1, r0
				; BASE-ARM-NEXT: blx r4
				; BASE-ARM-NEXT: ldrh r0, [sp, #6]
				; BASE-ARM-NEXT: vmov s0, r0
				; BASE-ARM-NEXT: add sp, sp, #8
				; BASE-ARM-NEXT: vpop {d8}
				; BASE-ARM-NEXT: pop {r4, pc}
				;
				; BASE-THUMB-LABEL: test_fncall_hard:
				; BASE-THUMB: @ %bb.0:
				; BASE-THUMB-NEXT: .save {r4, lr}
				; BASE-THUMB-NEXT: push {r4, lr}
				; BASE-THUMB-NEXT: .vsave {d8}
				; BASE-THUMB-NEXT: vpush {d8}
				; BASE-THUMB-NEXT: .pad #8
				; BASE-THUMB-NEXT: sub sp, #8
				; BASE-THUMB-NEXT: mov r4, r0
				; BASE-THUMB-NEXT: vmov r0, s0
				; BASE-THUMB-NEXT: uxth r0, r0
				; BASE-THUMB-NEXT: vmov s16, r0
				; BASE-THUMB-NEXT: vmov.f32 s0, s16
				; BASE-THUMB-NEXT: vmov.f32 s1, s16
				; BASE-THUMB-NEXT: blx r4
				; BASE-THUMB-NEXT: vmov r0, s0
				; BASE-THUMB-NEXT: vmov.f32 s0, s16
				; BASE-THUMB-NEXT: strh.w r0, [sp, #6]
				; BASE-THUMB-NEXT: uxth r0, r0
				; BASE-THUMB-NEXT: vmov s1, r0
				; BASE-THUMB-NEXT: blx r4
				; BASE-THUMB-NEXT: ldrh.w r0, [sp, #6]
				; BASE-THUMB-NEXT: vmov s0, r0
				; BASE-THUMB-NEXT: add sp, #8
				; BASE-THUMB-NEXT: vpop {d8}
				; BASE-THUMB-NEXT: pop {r4, pc}
				;
				; FULLFP16-LABEL: test_fncall_hard:
				; FULLFP16: @ %bb.0:
				; FULLFP16-NEXT: .save {r4, lr}
				; FULLFP16-NEXT: push {r4, lr}
				; FULLFP16-NEXT: .vsave {d8, d9}
				; FULLFP16-NEXT: vpush {d8, d9}
				; FULLFP16-NEXT: mov r4, r0
				; FULLFP16-NEXT: vmov.f16 r0, s0
				; FULLFP16-NEXT: vmov s16, r0
				; FULLFP16-NEXT: vmov.f32 s0, s16
				; FULLFP16-NEXT: vmov.f32 s1, s16
				; FULLFP16-NEXT: blx r4
				; FULLFP16-NEXT: vmov.f16 r0, s0
				; FULLFP16-NEXT: vmov.f32 s18, s0
				; FULLFP16-NEXT: vmov.f32 s0, s16
				; FULLFP16-NEXT: vmov s1, r0
				; FULLFP16-NEXT: blx r4
				; FULLFP16-NEXT: vmov.f32 s0, s18
				; FULLFP16-NEXT: vpop {d8, d9}
				; FULLFP16-NEXT: pop {r4, pc}
				%call = tail call arm_aapcs_vfpcc bfloat %f(bfloat %bf, bfloat %bf)
				%call1 = tail call arm_aapcs_vfpcc bfloat %f(bfloat %bf, bfloat %call)
				ret bfloat %call
				}