This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-fp-extend-trunc.ll

Differential D114628

[AArch64][SVE] Duplicate FP_EXTEND/FP_TRUNC -> LOAD/STORE dag combines
ClosedPublic

Authored by bsmith on Nov 26 2021, 4:07 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
peterwaller-arm
sdesmalen
efriedma

Commits

rGfd9069ffce2d: [AArch64][SVE] Duplicate FP_EXTEND/FP_TRUNC -> LOAD/STORE dag combines

Summary

By duplicating these dag combines we can bypass the legality checks that
they do, this allows us to perform these combines on larger than legal
fixed types, which in turn allows us to bring the same benefits D114580
brought but to larger than legal fixed types.

Depends on D114580

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bsmith created this revision.Nov 26 2021, 4:07 AM

Herald added a reviewer: efriedma. · View Herald TranscriptNov 26 2021, 4:07 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

bsmith requested review of this revision.Nov 26 2021, 4:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 26 2021, 4:07 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

bsmith mentioned this in D110237: [AArch64][SVE] Add DAG combines to improve SVE fixed type FP_EXTEND lowering.Nov 26 2021, 4:10 AM

bsmith mentioned this in D110531: [AArch64][SVE] Perform FP_EXTEND combine on legal types to fold extend into load.

Harbormaster completed remote builds in B136196: Diff 389994.Nov 26 2021, 4:48 AM

Matt added a subscriber: Matt.Nov 26 2021, 5:52 AM

I suppose an alternate solution is a new TLI hook that is less reliant on the legalisation tables. However, if we're the only backend with this issue I guess copying a couple of DAG combines is not so bad.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15952–15953	Although true, here it's really type legalisation you don't want to be affected by and this cannot ask the usual question (because we only say we support custom lower for the legal types). However, as is the combine could result in an unselectable DAG so I think you still need a `DCI.isBeforeLegalizeOps()` check somewhere.
17313–17314	Same `DCI.isBeforeLegalizeOps` comment as above.

I'm not really happy that we're using extload/truncstore with floating-point types here. It's an obscure, mostly unused feature, intended to support certain ancient floating-point instruction sets that would implicitly extend floating-point loads to the natural register size (think x87, m68k). I'm concerned this is throwing us into mostly untested codepaths; in particular, I don't think any other target uses these operations with vector floating-point types or 16-bit floating-point types.

It's not clear to me why the special case is being implemented this way, anyway; if we want to combine this sort of load+shuffle or shuffle+store pattern into two loads/stores, we should just do that. It doesn't need to be specific to fcvt.

Only perform duplicated dag combines before legalization

Harbormaster completed remote builds in B136659: Diff 390660.Nov 30 2021, 4:30 AM

I'm accepting this patch on the grounds it doesn't newly enable the use of fp extload/truncstore nodes, it is just extending their use for the cases not already handled by D114580.

With regards to using fp extload/truncstrore nodes: We experimented with several solutions and this ended up being the cleanest of the approaches. The other solutions either required fragile target specific DAG combines or the introduction of new common ISD nodes (e.g. FPEXTEND_INREG) that require significant plumbing that seemed overkill for our single use case. There is nothing in the code (documentation, asserts... etc) suggesting any kind of restriction and we added tests to verify they work as required. If something does arise we'll just fix it like any other bug. The SVE VLS support is somewhat specially anyway given almost all nodes are not considered legal and require custom lowering so in this regard there's no real difference between an extending int load and an extending fp load.

This revision is now accepted and ready to land.Nov 30 2021, 8:52 AM

Okay.

Any thoughts on adding load+shuffle/shuffle+store combines?

Closed by commit rGfd9069ffce2d: [AArch64][SVE] Duplicate FP_EXTEND/FP_TRUNC -> LOAD/STORE dag combines (authored by bsmith). · Explain WhyDec 1 2021, 7:34 AM

This revision was automatically updated to reflect the committed changes.

bsmith added a commit: rGfd9069ffce2d: [AArch64][SVE] Duplicate FP_EXTEND/FP_TRUNC -> LOAD/STORE dag combines.

I guess I'm not sure what you're asking here. In the context of these floating point loads and stores there is no load/shuffle pattern, or at least not one that DAGCombine sees. Essentially before operation legalisation the DAG looks nice. The problem is introduced during operation legalisation and thus the first opportunity DAGCombine has to improve things is after everything has been converted to the predicated target specific nodes along with the (un)packs and subvector operations that marshal the data between fixed and scalable domains. Spotting these sequences and optimising them proved messy.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

49 lines

test/

CodeGen/

AArch64/

sve-fixed-length-fp-extend-trunc.ll

52 lines

Diff 389994

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 923 Lines • ▼ Show 20 Lines	#undef LCALLNAME5

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::VECREDUCE_ADD);		setTargetDAGCombine(ISD::VECREDUCE_ADD);
setTargetDAGCombine(ISD::STEP_VECTOR);		setTargetDAGCombine(ISD::STEP_VECTOR);

		setTargetDAGCombine(ISD::FP_EXTEND);

setTargetDAGCombine(ISD::GlobalAddress);		setTargetDAGCombine(ISD::GlobalAddress);

// In case of strict alignment, avoid an excessive number of byte wide stores.		// In case of strict alignment, avoid an excessive number of byte wide stores.
MaxStoresPerMemsetOptSize = 8;		MaxStoresPerMemsetOptSize = 8;
MaxStoresPerMemset = Subtarget->requiresStrictAlign()		MaxStoresPerMemset = Subtarget->requiresStrictAlign()
? MaxStoresPerMemsetOptSize : 32;		? MaxStoresPerMemsetOptSize : 32;

MaxGluedStoresPerMemcpy = 4;		MaxGluedStoresPerMemcpy = 4;
▲ Show 20 Lines • Show All 14,995 Lines • ▼ Show 20 Lines	static SDValue foldTruncStoreOfExt(SelectionDAG &DAG, SDNode *N) {

return SDValue();		return SDValue();
}		}

static SDValue performSTORECombine(SDNode *N,		static SDValue performSTORECombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
		StoreSDNode *ST = cast<StoreSDNode>(N);
		SDValue Chain = ST->getChain();
		SDValue Value = ST->getValue();
		SDValue Ptr = ST->getBasePtr();

		// If this is an FP_ROUND followed by a store, fold this into a truncating
		// store. We can do this even if this is already a truncstore.
		// We purposefully don't care about legality of the nodes here as we know
		// they can be split down into something legal.
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Although true, here it's really type legalisation you don't want to be affected by and this cannot ask the usual question (because we only say we support custom lower for the legal types). However, as is the combine could result in an unselectable DAG so I think you still need a `DCI.isBeforeLegalizeOps()` check somewhere. paulwalker-arm: Although true, here it's really type legalisation you don't want to be affected by and this…
		if (Value.getOpcode() == ISD::FP_ROUND && Value.getNode()->hasOneUse() &&
		ST->isUnindexed() && Subtarget->useSVEForFixedLengthVectors() &&
		Value.getValueType().isFixedLengthVector())
		return DAG.getTruncStore(Chain, SDLoc(N), Value.getOperand(0), Ptr,
		ST->getMemoryVT(), ST->getMemOperand());

if (SDValue Split = splitStores(N, DCI, DAG, Subtarget))		if (SDValue Split = splitStores(N, DCI, DAG, Subtarget))
return Split;		return Split;

if (Subtarget->supportsAddressTopByteIgnored() &&		if (Subtarget->supportsAddressTopByteIgnored() &&
performTBISimplification(N->getOperand(2), DCI, DAG))		performTBISimplification(N->getOperand(2), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);

if (SDValue Store = foldTruncStoreOfExt(DAG, N))		if (SDValue Store = foldTruncStoreOfExt(DAG, N))
▲ Show 20 Lines • Show All 1,326 Lines • ▼ Show 20 Lines	SDValue performSVESpliceCombine(SDNode *N, SelectionDAG &DAG) {
SDValue RHS = DAG.getAnyExtOrTrunc(DAG.getBitcast(IntTy, N->getOperand(1)),		SDValue RHS = DAG.getAnyExtOrTrunc(DAG.getBitcast(IntTy, N->getOperand(1)),
DL, ExtIntTy);		DL, ExtIntTy);
SDValue Idx = N->getOperand(2);		SDValue Idx = N->getOperand(2);
SDValue Splice = DAG.getNode(ISD::VECTOR_SPLICE, DL, ExtIntTy, LHS, RHS, Idx);		SDValue Splice = DAG.getNode(ISD::VECTOR_SPLICE, DL, ExtIntTy, LHS, RHS, Idx);
SDValue Trunc = DAG.getAnyExtOrTrunc(Splice, DL, IntTy);		SDValue Trunc = DAG.getAnyExtOrTrunc(Splice, DL, IntTy);
return DAG.getBitcast(Ty, Trunc);		return DAG.getBitcast(Ty, Trunc);
}		}

		SDValue performFPExtendCombine(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const AArch64Subtarget *Subtarget) {
		SDValue N0 = N->getOperand(0);
		EVT VT = N->getValueType(0);

		// If this is fp_round(fpextend), don't fold it, allow ourselves to be folded.
		if (N->hasOneUse() && N->use_begin()->getOpcode() == ISD::FP_ROUND)
		return SDValue();

		// fold (fpext (load x)) -> (fpext (fptrunc (extload x)))
		// We purposefully don't care about legality of the nodes here as we know
		// they can be split down into something legal.
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Same `DCI.isBeforeLegalizeOps` comment as above. paulwalker-arm: Same `DCI.isBeforeLegalizeOps` comment as above.
		if (ISD::isNormalLoad(N0.getNode()) && N0.hasOneUse() &&
		Subtarget->useSVEForFixedLengthVectors() && VT.isFixedLengthVector()) {
		LoadSDNode *LN0 = cast<LoadSDNode>(N0);
		SDValue ExtLoad = DAG.getExtLoad(ISD::EXTLOAD, SDLoc(N), VT,
		LN0->getChain(), LN0->getBasePtr(),
		N0.getValueType(), LN0->getMemOperand());
		DCI.CombineTo(N, ExtLoad);
		DCI.CombineTo(N0.getNode(),
		DAG.getNode(ISD::FP_ROUND, SDLoc(N0), N0.getValueType(),
		ExtLoad, DAG.getIntPtrConstant(1, SDLoc(N0))),
		ExtLoad.getValue(1));
		return SDValue(N, 0); // Return N so it doesn't get rechecked!
		}

		return SDValue();
		}

SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,		SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default:		default:
LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");		LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");
break;		break;
case ISD::ADD:		case ISD::ADD:
Show All 40 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::LOAD:		case ISD::LOAD:
if (performTBISimplification(N->getOperand(1), DCI, DAG))		if (performTBISimplification(N->getOperand(1), DCI, DAG))
return SDValue(N, 0);		return SDValue(N, 0);
break;		break;
case ISD::STORE:		case ISD::STORE:
return performSTORECombine(N, DCI, DAG, Subtarget);		return performSTORECombine(N, DCI, DAG, Subtarget);
case ISD::VECTOR_SPLICE:		case ISD::VECTOR_SPLICE:
return performSVESpliceCombine(N, DAG);		return performSVESpliceCombine(N, DAG);
		case ISD::FP_EXTEND:
		return performFPExtendCombine(N, DAG, DCI, Subtarget);
case AArch64ISD::BRCOND:		case AArch64ISD::BRCOND:
return performBRCONDCombine(N, DCI, DAG);		return performBRCONDCombine(N, DCI, DAG);
case AArch64ISD::TBNZ:		case AArch64ISD::TBNZ:
case AArch64ISD::TBZ:		case AArch64ISD::TBZ:
return performTBZCombine(N, DCI, DAG);		return performTBZCombine(N, DCI, DAG);
case AArch64ISD::CSEL:		case AArch64ISD::CSEL:
return performCSELCombine(N, DCI, DAG);		return performCSELCombine(N, DCI, DAG);
case AArch64ISD::DUP:		case AArch64ISD::DUP:
▲ Show 20 Lines • Show All 2,159 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-extend-trunc.ll

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <8 x float> %res, <8 x float>* %b		store <8 x float> %res, <8 x float>* %b
ret void		ret void
}		}

define void @fcvt_v16f16_v16f32(<16 x half>* %a, <16 x float>* %b) #0 {		define void @fcvt_v16f16_v16f32(<16 x half>* %a, <16 x float>* %b) #0 {
; Ensure sensible type legalisation.		; Ensure sensible type legalisation.
; VBITS_EQ_256-LABEL: fcvt_v16f16_v16f32:		; VBITS_EQ_256-LABEL: fcvt_v16f16_v16f32:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
; VBITS_EQ_256-NEXT: mov x8, #8		; VBITS_EQ_256-NEXT: mov x8, #8
; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8		; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: uunpklo z1.s, z0.h		; VBITS_EQ_256-NEXT: ld1sh { z0.s }, p0/z, [x0, x8, lsl #1]
; VBITS_EQ_256-NEXT: ext z0.b, z0.b, z0.b, #16		; VBITS_EQ_256-NEXT: ld1sh { z1.s }, p0/z, [x0]
; VBITS_EQ_256-NEXT: uunpklo z0.s, z0.h
; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.h
; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.h		; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.h
; VBITS_EQ_256-NEXT: st1w { z1.s }, p0, [x1]		; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.h
; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]		; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
		; VBITS_EQ_256-NEXT: st1w { z1.s }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v16f16_v16f32:		; VBITS_GE_512-LABEL: fcvt_v16f16_v16f32:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.s, vl16		; VBITS_GE_512-NEXT: ptrue p0.s, vl16
; VBITS_GE_512-NEXT: ld1sh { z0.s }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1sh { z0.s }, p0/z, [x0]
; VBITS_GE_512-NEXT: fcvt z0.s, p0/m, z0.h		; VBITS_GE_512-NEXT: fcvt z0.s, p0/m, z0.h
; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x1]		; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x1]
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%res = fpext <4 x half> %op1 to <4 x double>		%res = fpext <4 x half> %op1 to <4 x double>
store <4 x double> %res, <4 x double>* %b		store <4 x double> %res, <4 x double>* %b
ret void		ret void
}		}

define void @fcvt_v8f16_v8f64(<8 x half>* %a, <8 x double>* %b) #0 {		define void @fcvt_v8f16_v8f64(<8 x half>* %a, <8 x double>* %b) #0 {
; VBITS_EQ_256-LABEL: fcvt_v8f16_v8f64:		; VBITS_EQ_256-LABEL: fcvt_v8f16_v8f64:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: ldr q0, [x0]
; VBITS_EQ_256-NEXT: mov x8, #4		; VBITS_EQ_256-NEXT: mov x8, #4
; VBITS_EQ_256-NEXT: ptrue p0.d, vl4		; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: ext v1.16b, v0.16b, v0.16b, #8		; VBITS_EQ_256-NEXT: ld1sh { z0.d }, p0/z, [x0, x8, lsl #1]
; VBITS_EQ_256-NEXT: uunpklo z0.s, z0.h		; VBITS_EQ_256-NEXT: ld1sh { z1.d }, p0/z, [x0]
; VBITS_EQ_256-NEXT: uunpklo z0.d, z0.s
; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.h		; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.h
; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1]
; VBITS_EQ_256-NEXT: uunpklo z1.s, z1.h
; VBITS_EQ_256-NEXT: uunpklo z1.d, z1.s
; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.h		; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.h
; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]		; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
		; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v8f16_v8f64:		; VBITS_GE_512-LABEL: fcvt_v8f16_v8f64:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.d, vl8		; VBITS_GE_512-NEXT: ptrue p0.d, vl8
; VBITS_GE_512-NEXT: ld1sh { z0.d }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1sh { z0.d }, p0/z, [x0]
; VBITS_GE_512-NEXT: fcvt z0.d, p0/m, z0.h		; VBITS_GE_512-NEXT: fcvt z0.d, p0/m, z0.h
; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x1]		; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x1]
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <4 x double> %res, <4 x double>* %b		store <4 x double> %res, <4 x double>* %b
ret void		ret void
}		}

define void @fcvt_v8f32_v8f64(<8 x float>* %a, <8 x double>* %b) #0 {		define void @fcvt_v8f32_v8f64(<8 x float>* %a, <8 x double>* %b) #0 {
; Ensure sensible type legalisation.		; Ensure sensible type legalisation.
; VBITS_EQ_256-LABEL: fcvt_v8f32_v8f64:		; VBITS_EQ_256-LABEL: fcvt_v8f32_v8f64:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: mov x8, #4		; VBITS_EQ_256-NEXT: mov x8, #4
; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_EQ_256-NEXT: ptrue p0.d, vl4		; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: uunpklo z1.d, z0.s		; VBITS_EQ_256-NEXT: ld1sw { z0.d }, p0/z, [x0, x8, lsl #2]
; VBITS_EQ_256-NEXT: ext z0.b, z0.b, z0.b, #16		; VBITS_EQ_256-NEXT: ld1sw { z1.d }, p0/z, [x0]
; VBITS_EQ_256-NEXT: uunpklo z0.d, z0.s
; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.s
; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.s		; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.s
; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]		; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.s
; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]		; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
		; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v8f32_v8f64:		; VBITS_GE_512-LABEL: fcvt_v8f32_v8f64:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.d, vl8		; VBITS_GE_512-NEXT: ptrue p0.d, vl8
; VBITS_GE_512-NEXT: ld1sw { z0.d }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1sw { z0.d }, p0/z, [x0]
; VBITS_GE_512-NEXT: fcvt z0.d, p0/m, z0.s		; VBITS_GE_512-NEXT: fcvt z0.d, p0/m, z0.s
; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x1]		; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x1]
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
define void @fcvt_v16f32_v16f16(<16 x float>* %a, <16 x half>* %b) #0 {		define void @fcvt_v16f32_v16f16(<16 x float>* %a, <16 x half>* %b) #0 {
; Ensure sensible type legalisation		; Ensure sensible type legalisation
; VBITS_EQ_256-LABEL: fcvt_v16f32_v16f16:		; VBITS_EQ_256-LABEL: fcvt_v16f32_v16f16:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: mov x8, #8		; VBITS_EQ_256-NEXT: mov x8, #8
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8		; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]		; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
; VBITS_EQ_256-NEXT: ld1w { z1.s }, p0/z, [x0]		; VBITS_EQ_256-NEXT: ld1w { z1.s }, p0/z, [x0]
; VBITS_EQ_256-NEXT: ptrue p0.s
; VBITS_EQ_256-NEXT: fcvt z0.h, p0/m, z0.s		; VBITS_EQ_256-NEXT: fcvt z0.h, p0/m, z0.s
; VBITS_EQ_256-NEXT: fcvt z1.h, p0/m, z1.s		; VBITS_EQ_256-NEXT: fcvt z1.h, p0/m, z1.s
; VBITS_EQ_256-NEXT: uzp1 z0.h, z0.h, z0.h		; VBITS_EQ_256-NEXT: st1h { z0.s }, p0, [x1, x8, lsl #1]
; VBITS_EQ_256-NEXT: uzp1 z1.h, z1.h, z1.h		; VBITS_EQ_256-NEXT: st1h { z1.s }, p0, [x1]
; VBITS_EQ_256-NEXT: ptrue p0.h, vl8
; VBITS_EQ_256-NEXT: splice z1.h, p0, z1.h, z0.h
; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v16f32_v16f16:		; VBITS_GE_512-LABEL: fcvt_v16f32_v16f16:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.s, vl16		; VBITS_GE_512-NEXT: ptrue p0.s, vl16
; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_512-NEXT: fcvt z0.h, p0/m, z0.s		; VBITS_GE_512-NEXT: fcvt z0.h, p0/m, z0.s
; VBITS_GE_512-NEXT: st1h { z0.s }, p0, [x1]		; VBITS_GE_512-NEXT: st1h { z0.s }, p0, [x1]
▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines
define void @fcvt_v8f64_v8f32(<8 x double>* %a, <8 x float>* %b) #0 {		define void @fcvt_v8f64_v8f32(<8 x double>* %a, <8 x float>* %b) #0 {
; Ensure sensible type legalisation		; Ensure sensible type legalisation
; VBITS_EQ_256-LABEL: fcvt_v8f64_v8f32:		; VBITS_EQ_256-LABEL: fcvt_v8f64_v8f32:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: mov x8, #4		; VBITS_EQ_256-NEXT: mov x8, #4
; VBITS_EQ_256-NEXT: ptrue p0.d, vl4		; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]		; VBITS_EQ_256-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
; VBITS_EQ_256-NEXT: ld1d { z1.d }, p0/z, [x0]		; VBITS_EQ_256-NEXT: ld1d { z1.d }, p0/z, [x0]
; VBITS_EQ_256-NEXT: ptrue p0.d
; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.d		; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.d
; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.d		; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.d
; VBITS_EQ_256-NEXT: uzp1 z0.s, z0.s, z0.s		; VBITS_EQ_256-NEXT: st1w { z0.d }, p0, [x1, x8, lsl #2]
; VBITS_EQ_256-NEXT: uzp1 z1.s, z1.s, z1.s		; VBITS_EQ_256-NEXT: st1w { z1.d }, p0, [x1]
; VBITS_EQ_256-NEXT: ptrue p0.s, vl4
; VBITS_EQ_256-NEXT: splice z1.s, p0, z1.s, z0.s
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: st1w { z1.s }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v8f64_v8f32:		; VBITS_GE_512-LABEL: fcvt_v8f64_v8f32:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.d, vl8		; VBITS_GE_512-NEXT: ptrue p0.d, vl8
; VBITS_GE_512-NEXT: ld1d { z0.d }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1d { z0.d }, p0/z, [x0]
; VBITS_GE_512-NEXT: fcvt z0.s, p0/m, z0.d		; VBITS_GE_512-NEXT: fcvt z0.s, p0/m, z0.d
; VBITS_GE_512-NEXT: st1w { z0.d }, p0, [x1]		; VBITS_GE_512-NEXT: st1w { z0.d }, p0, [x1]
Show All 36 Lines