This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4/8
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-fp-extend-trunc.ll

Differential D110237

[AArch64][SVE] Add DAG combines to improve SVE fixed type FP_EXTEND lowering
AbandonedPublic

Authored by bsmith on Sep 22 2021, 5:57 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
peterwaller-arm
sdesmalen
efriedma

Summary

Currently in some cases FP_EXTEND lowering for SVE fixed types can
introduce EXTRACT_SUBVECTOR nodes between purely fixed types. Lowering
these is less than ideal as it may involve going through memory.

Instead, we can push an extra extend into a load feeding the FP_EXTEND,
so as to ensure the load gets split into the same number of components
as the FP_EXTEND. The extra truncate required to do this can then be
combined with an extend that is introduced during FP_EXTEND lowering.

The net result of this is that the extends produced by the FP_EXTEND
lowering are folded into an extending load, and requirement for
EXTRACT_SUBVECTOR nodes is removed.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bsmith created this revision.Sep 22 2021, 5:57 AM

Herald added a reviewer: efriedma. · View Herald TranscriptSep 22 2021, 5:57 AM

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald Transcript

bsmith requested review of this revision.Sep 22 2021, 5:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2021, 5:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B125095: Diff 374195.Sep 22 2021, 6:19 AM

Fix mistake in bail condition when logic was reversed.

Harbormaster completed remote builds in B125138: Diff 374243.Sep 22 2021, 8:51 AM

sdesmalen added inline comments.Sep 23 2021, 3:18 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265	`DAGCombiner::visitFP_EXTEND` tries to do something similar, but unfortunately it is a bit too conservative. Look for: // fold (fpext (load x)) -> (fpext (fptrunc (extload x))) It seems logical to reuse that code, rather than reimplementing the same thing here. Perhaps you can use something like `isVectorLoadExtDesirable` instead of `isLoadExtLegal` to determine whether to fold this.

bsmith added inline comments.Sep 23 2021, 4:18 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265	There's actually a subtle difference here, that existing DAG combine is ultimately trying to fold the fpext into the load, creating an FP extending load, which is not something we can do. The DAG combine below is not trying to remove the fpext, but rather ensure the load is of the same size as the fpext so that is gets split in the same way. I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector.

Matt added a subscriber: Matt.Sep 24 2021, 9:59 AM

bsmith added a child revision: D110531: [AArch64][SVE] Perform FP_EXTEND combine on legal types to fold extend into load.Sep 27 2021, 4:15 AM

sdesmalen added inline comments.Sep 29 2021, 7:16 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15265	Okay, I see what you mean. I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector. If I understand it correctly, the target-specific part about this is how SVE's FCVT instruction works, i.e. by interpreting the input as unpacked, hence the need to bitcast/zero-extend, which becomes difficult to express in a generic way (in e.g. a DAGCombine).
15271	is there a reason this is specific to fixed-width vectors? It seems the same problem exists for: %op1 = load <vscale x 8 x half>, <vscale x 8 x half>* %a %res = fpext <vscale x 8 x half> %op1 to <vscale x 8 x float> store <vscale x 8 x float> %res, <vscale x 8 x float>* %b
15292–15300	nit: is this loop maybe better expressed with llvm::any_of ?
15326	does the second operand actually matter, or can it be anything/undef?

bsmith added inline comments.Sep 29 2021, 8:53 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15271	For the scalable case the generated extract_subvector doesn't require a trip through memory, instead it just does an unpack lo/hi pair which is much more reasonable. (In fact this optimization actually makes things worse for scalable, we end up splitting the load but then combining it back together again with a UZP1, meaning the extract_subvectors are still present, but now also with a split load + recombine).
15292–15300	I'm not sure we can actually do that in this case. Using any_of would automatically dereference the `use_iterator` which calls `getUser()`, we also need to be able to call `getUse()` on the iterator.

Don't check 2nd operand of UZP1, it isn't used in this case.
Redo test changes after recent test reformatting.

Harbormaster completed remote builds in B126353: Diff 375913.Sep 29 2021, 9:45 AM

bsmith planned changes to this revision.Oct 5 2021, 3:34 AM

Abandoning in favour of D114628 and D114580

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

99 lines

test/

CodeGen/

AArch64/

sve-fixed-length-fp-extend-trunc.ll

64 lines

Diff 375913

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 913 Lines • ▼ Show 20 Lines	#undef LCALLNAME5
setTargetDAGCombine(ISD::CONCAT_VECTORS);		setTargetDAGCombine(ISD::CONCAT_VECTORS);
setTargetDAGCombine(ISD::INSERT_SUBVECTOR);		setTargetDAGCombine(ISD::INSERT_SUBVECTOR);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
if (Subtarget->supportsAddressTopByteIgnored())		if (Subtarget->supportsAddressTopByteIgnored())
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);

setTargetDAGCombine(ISD::MUL);		setTargetDAGCombine(ISD::MUL);

		setTargetDAGCombine(ISD::FP_EXTEND);

setTargetDAGCombine(ISD::SELECT);		setTargetDAGCombine(ISD::SELECT);
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::VECREDUCE_ADD);		setTargetDAGCombine(ISD::VECREDUCE_ADD);
▲ Show 20 Lines • Show All 14,325 Lines • ▼ Show 20 Lines	if (Op1.getOperand(0).getOpcode() == AArch64ISD::UZP1) {
SDValue Z = Op1.getOperand(0).getOperand(1);		SDValue Z = Op1.getOperand(0).getOperand(1);
return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);
}		}
}		}

return SDValue();		return SDValue();
}		}

		static SDValue performFpExtendCombine(SDNode *N, SelectionDAG &DAG,
		sdesmalenUnsubmitted Not Done Reply Inline Actions `DAGCombiner::visitFP_EXTEND` tries to do something similar, but unfortunately it is a bit too conservative. Look for: // fold (fpext (load x)) -> (fpext (fptrunc (extload x))) It seems logical to reuse that code, rather than reimplementing the same thing here. Perhaps you can use something like `isVectorLoadExtDesirable` instead of `isLoadExtLegal` to determine whether to fold this. sdesmalen: `DAGCombiner::visitFP_EXTEND` tries to do something similar, but unfortunately it is a bit too…
		bsmithAuthorUnsubmitted Done Reply Inline Actions There's actually a subtle difference here, that existing DAG combine is ultimately trying to fold the fpext into the load, creating an FP extending load, which is not something we can do. The DAG combine below is not trying to remove the fpext, but rather ensure the load is of the same size as the fpext so that is gets split in the same way. I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector. bsmith: There's actually a subtle difference here, that existing DAG combine is ultimately trying to…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Okay, I see what you mean. I'm also not sure it makes sense for this to be in the generic DAG combine. The only reason this is sensible to do, is that our lowering of FP_EXTEND introduces an integer extend, which can then be folded into the integer truncate this DAG combine introduces. In absence of that fact, this DAG combine would likely just make things worse. It's also only really required because we can't sensibly lower extract_subvector. If I understand it correctly, the target-specific part about this is how SVE's FCVT instruction works, i.e. by interpreting the input as unpacked, hence the need to bitcast/zero-extend, which becomes difficult to express in a generic way (in e.g. a DAGCombine). sdesmalen: Okay, I see what you mean. > I'm also not sure it makes sense for this to be in the generic…
		const AArch64Subtarget *Subtarget) {
		SDLoc DL(N);
		SDValue Op = N->getOperand(0);
		EVT VT = N->getValueType(0);

		if (!VT.isFixedLengthVector())
		sdesmalenUnsubmitted Not Done Reply Inline Actions is there a reason this is specific to fixed-width vectors? It seems the same problem exists for: %op1 = load <vscale x 8 x half>, <vscale x 8 x half>* %a %res = fpext <vscale x 8 x half> %op1 to <vscale x 8 x float> store <vscale x 8 x float> %res, <vscale x 8 x float>* %b sdesmalen: is there a reason this is specific to fixed-width vectors? It seems the same problem exists for…
		bsmithAuthorUnsubmitted Done Reply Inline Actions For the scalable case the generated extract_subvector doesn't require a trip through memory, instead it just does an unpack lo/hi pair which is much more reasonable. (In fact this optimization actually makes things worse for scalable, we end up splitting the load but then combining it back together again with a UZP1, meaning the extract_subvectors are still present, but now also with a split load + recombine). bsmith: For the scalable case the generated extract_subvector doesn't require a trip through memory…
		return SDValue();

		if (DAG.getTargetLoweringInfo().isTypeLegal(VT) \|\|
		!Subtarget->useSVEForFixedLengthVectors())
		return SDValue();

		// In cases where the result of the FP_EXTEND is not legal, it will be
		// expanded into multiple extract_subvectors which cannot be lowered without
		// going through memory.
		//
		// If we push an extend into the load feeding the FP_EXTEND, we can force the
		// load to be be expanded into the same number of parts as the FP_EXTEND,
		// avoiding the need for extract_subvectors completely.
		//
		// As part of the lowering of FP_EXTEND for fixed length types uunpklo nodes
		// will be introduced which will then combine with the truncate introduced
		// after the load.
		if (ISD::isNormalLoad(Op.getNode())) {
		LoadSDNode *LD = cast<LoadSDNode>(Op.getNode());

		// Check if there are other uses. If so, do not combine as it will introduce
		// an extra load.
		for (SDNode::use_iterator UI = LD->use_begin(), UE = LD->use_end();
		UI != UE; ++UI) {
		if (UI.getUse().getResNo() == 1) // Ignore uses of the chain result.
		continue;
		if (*UI != N)
		return SDValue();
		}
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: is this loop maybe better expressed with llvm::any_of ? sdesmalen: nit: is this loop maybe better expressed with llvm::any_of ?
		bsmithAuthorUnsubmitted Done Reply Inline Actions I'm not sure we can actually do that in this case. Using any_of would automatically dereference the `use_iterator` which calls `getUser()`, we also need to be able to call `getUse()` on the iterator. bsmith: I'm not sure we can actually do that in this case. Using any_of would automatically dereference…

		SDValue NewLoad = DAG.getExtLoad(
		ISD::ZEXTLOAD, DL, VT.changeTypeToInteger(), LD->getChain(),
		LD->getBasePtr(), LD->getMemoryVT().changeTypeToInteger(),
		LD->getMemOperand());

		DAG.ReplaceAllUsesOfValueWith(SDValue(LD, 1), NewLoad.getValue(1));

		SDValue Trunc = DAG.getNode(
		ISD::TRUNCATE, DL, Op->getValueType(0).changeTypeToInteger(), NewLoad);
		SDValue Bitcast = DAG.getNode(ISD::BITCAST, DL, Op->getValueType(0), Trunc);

		return DAG.getNode(ISD::FP_EXTEND, DL, VT, Bitcast);
		}

		return SDValue();
		}

		static SDValue performUunpkloCombine(SDNode *N, SelectionDAG &DAG) {
		SDLoc DL(N);
		SDValue Op = N->getOperand(0);
		EVT VT = N->getValueType(0);

		// uunpklo(uzp1(x, x)) where x = bitcast(zextload) -> x
		if (Op->getOpcode() == AArch64ISD::UZP1) {
		EVT HalfVT = Op.getValueType();
		sdesmalenUnsubmitted Done Reply Inline Actions does the second operand actually matter, or can it be anything/undef? sdesmalen: does the second operand actually matter, or can it be anything/undef?

		// Ensure the unzip input is the same size as the unpack output
		if (Op->getOperand(0)->getOpcode() != ISD::BITCAST \|\|
		Op->getValueType(0) == VT)
		return SDValue();

		SDValue Bitcast = Op->getOperand(0);

		// Look through bitcasts and unzips
		SDValue Input = Bitcast->getOperand(0);
		while (Input->getOpcode() == ISD::BITCAST \|\|
		(Input->getOpcode() == AArch64ISD::UZP1 &&
		Input->getOperand(0) == Input->getOperand(1)))
		Input = Input->getOperand(0);

		// Input should come from an extending load
		if (!isa<MaskedLoadSDNode>(Input) \|\|
		cast<MaskedLoadSDNode>(Input)->getExtensionType() != ISD::ZEXTLOAD)
		return SDValue();

		// Ensure that we don't care about the top half of the input
		EVT MemVT = cast<MaskedLoadSDNode>(Input)->getMemoryVT();
		if (isPackedVectorType(MemVT, DAG) &&
		MemVT.getVectorElementType().getScalarSizeInBits() <=
		HalfVT.getScalarSizeInBits())
		return Bitcast->getOperand(0);
		}

		return SDValue();
		}

static SDValue performGLD1Combine(SDNode *N, SelectionDAG &DAG) {		static SDValue performGLD1Combine(SDNode *N, SelectionDAG &DAG) {
unsigned Opc = N->getOpcode();		unsigned Opc = N->getOpcode();

assert(((Opc >= AArch64ISD::GLD1_MERGE_ZERO && // unsigned gather loads		assert(((Opc >= AArch64ISD::GLD1_MERGE_ZERO && // unsigned gather loads
Opc <= AArch64ISD::GLD1_IMM_MERGE_ZERO) \|\|		Opc <= AArch64ISD::GLD1_IMM_MERGE_ZERO) \|\|
(Opc >= AArch64ISD::GLD1S_MERGE_ZERO && // signed gather loads		(Opc >= AArch64ISD::GLD1S_MERGE_ZERO && // signed gather loads
Opc <= AArch64ISD::GLD1S_IMM_MERGE_ZERO)) &&		Opc <= AArch64ISD::GLD1S_IMM_MERGE_ZERO)) &&
"Invalid opcode.");		"Invalid opcode.");
▲ Show 20 Lines • Show All 1,629 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::NVCAST:		case AArch64ISD::NVCAST:
return performNVCASTCombine(N);		return performNVCASTCombine(N);
case AArch64ISD::SPLICE:		case AArch64ISD::SPLICE:
return performSpliceCombine(N, DAG);		return performSpliceCombine(N, DAG);
case AArch64ISD::UZP1:		case AArch64ISD::UZP1:
return performUzpCombine(N, DAG);		return performUzpCombine(N, DAG);
case AArch64ISD::SETCC_MERGE_ZERO:		case AArch64ISD::SETCC_MERGE_ZERO:
return performSetccMergeZeroCombine(N, DAG);		return performSetccMergeZeroCombine(N, DAG);
		case ISD::FP_EXTEND:
		return performFpExtendCombine(N, DAG, Subtarget);
case AArch64ISD::GLD1_MERGE_ZERO:		case AArch64ISD::GLD1_MERGE_ZERO:
case AArch64ISD::GLD1_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1_UXTW_MERGE_ZERO:		case AArch64ISD::GLD1_UXTW_MERGE_ZERO:
case AArch64ISD::GLD1_SXTW_MERGE_ZERO:		case AArch64ISD::GLD1_SXTW_MERGE_ZERO:
case AArch64ISD::GLD1_UXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1_UXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1_SXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1_SXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1_IMM_MERGE_ZERO:		case AArch64ISD::GLD1_IMM_MERGE_ZERO:
case AArch64ISD::GLD1S_MERGE_ZERO:		case AArch64ISD::GLD1S_MERGE_ZERO:
case AArch64ISD::GLD1S_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1S_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1S_UXTW_MERGE_ZERO:		case AArch64ISD::GLD1S_UXTW_MERGE_ZERO:
case AArch64ISD::GLD1S_SXTW_MERGE_ZERO:		case AArch64ISD::GLD1S_SXTW_MERGE_ZERO:
case AArch64ISD::GLD1S_UXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1S_UXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1S_SXTW_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1S_SXTW_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1S_IMM_MERGE_ZERO:		case AArch64ISD::GLD1S_IMM_MERGE_ZERO:
return performGLD1Combine(N, DAG);		return performGLD1Combine(N, DAG);
case AArch64ISD::VASHR:		case AArch64ISD::VASHR:
case AArch64ISD::VLSHR:		case AArch64ISD::VLSHR:
return performVectorShiftCombine(N, *this, DCI);		return performVectorShiftCombine(N, *this, DCI);
		case AArch64ISD::UUNPKLO:
		return performUunpkloCombine(N, DAG);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return performInsertVectorEltCombine(N, DCI);		return performInsertVectorEltCombine(N, DCI);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return performExtractVectorEltCombine(N, DAG);		return performExtractVectorEltCombine(N, DAG);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return performVecReduceAddCombine(N, DCI.DAG, Subtarget);		return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
case ISD::INTRINSIC_VOID:		case ISD::INTRINSIC_VOID:
case ISD::INTRINSIC_W_CHAIN:		case ISD::INTRINSIC_W_CHAIN:
▲ Show 20 Lines • Show All 2,073 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-fp-extend-trunc.ll

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%op1 = load <8 x half>, <8 x half>* %a		%op1 = load <8 x half>, <8 x half>* %a
%res = fpext <8 x half> %op1 to <8 x float>		%res = fpext <8 x half> %op1 to <8 x float>
store <8 x float> %res, <8 x float>* %b		store <8 x float> %res, <8 x float>* %b
ret void		ret void
}		}

define void @fcvt_v16f16_v16f32(<16 x half>* %a, <16 x float>* %b) #0 {		define void @fcvt_v16f16_v16f32(<16 x half>* %a, <16 x float>* %b) #0 {
; Ensure sensible type legalisation - fixed type extract_subvector codegen is poor currently.		; Ensure sensible type legalisation.
; VBITS_EQ_256-LABEL: fcvt_v16f16_v16f32:		; VBITS_EQ_256-LABEL: fcvt_v16f16_v16f32:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
; VBITS_EQ_256-NEXT: sub x9, sp, #48
; VBITS_EQ_256-NEXT: mov x29, sp
; VBITS_EQ_256-NEXT: and sp, x9, #0xffffffffffffffe0
; VBITS_EQ_256-NEXT: .cfi_def_cfa w29, 16
; VBITS_EQ_256-NEXT: .cfi_offset w30, -8
; VBITS_EQ_256-NEXT: .cfi_offset w29, -16
; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_EQ_256-NEXT: mov x8, sp
; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x8]
; VBITS_EQ_256-NEXT: ldp q0, q1, [sp]
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: mov x8, #8		; VBITS_EQ_256-NEXT: mov x8, #8
; VBITS_EQ_256-NEXT: uunpklo z0.s, z0.h		; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: uunpklo z1.s, z1.h		; VBITS_EQ_256-NEXT: ld1h { z0.s }, p0/z, [x0, x8, lsl #1]
		; VBITS_EQ_256-NEXT: ld1h { z1.s }, p0/z, [x0]
; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.h		; VBITS_EQ_256-NEXT: fcvt z0.s, p0/m, z0.h
; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.h		; VBITS_EQ_256-NEXT: fcvt z1.s, p0/m, z1.h
; VBITS_EQ_256-NEXT: st1w { z1.s }, p0, [x1, x8, lsl #2]		; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x1, x8, lsl #2]
; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x1]		; VBITS_EQ_256-NEXT: st1w { z1.s }, p0, [x1]
; VBITS_EQ_256-NEXT: mov sp, x29
; VBITS_EQ_256-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v16f16_v16f32:		; VBITS_GE_512-LABEL: fcvt_v16f16_v16f32:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.h, vl16		; VBITS_GE_512-NEXT: ptrue p0.h, vl16
; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
; VBITS_GE_512-NEXT: ptrue p0.s, vl16		; VBITS_GE_512-NEXT: ptrue p0.s, vl16
; VBITS_GE_512-NEXT: uunpklo z0.s, z0.h		; VBITS_GE_512-NEXT: uunpklo z0.s, z0.h
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <4 x double> %res, <4 x double>* %b		store <4 x double> %res, <4 x double>* %b
ret void		ret void
}		}

define void @fcvt_v8f16_v8f64(<8 x half>* %a, <8 x double>* %b) #0 {		define void @fcvt_v8f16_v8f64(<8 x half>* %a, <8 x double>* %b) #0 {
; Ensure sensible type legalisation.		; Ensure sensible type legalisation.
; VBITS_EQ_256-LABEL: fcvt_v8f16_v8f64:		; VBITS_EQ_256-LABEL: fcvt_v8f16_v8f64:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: ldr q0, [x0]
; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: mov x8, #4		; VBITS_EQ_256-NEXT: mov x8, #4
; VBITS_EQ_256-NEXT: uunpklo z1.s, z0.h		; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: ext v0.16b, v0.16b, v0.16b, #8		; VBITS_EQ_256-NEXT: ld1h { z0.d }, p0/z, [x0, x8, lsl #1]
; VBITS_EQ_256-NEXT: uunpklo z0.s, z0.h		; VBITS_EQ_256-NEXT: ld1h { z1.d }, p0/z, [x0]
; VBITS_EQ_256-NEXT: uunpklo z1.d, z1.s
; VBITS_EQ_256-NEXT: uunpklo z0.d, z0.s
; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.h
; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.h		; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.h
		; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.h
; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]		; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]		; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v8f16_v8f64:		; VBITS_GE_512-LABEL: fcvt_v8f16_v8f64:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ldr q0, [x0]		; VBITS_GE_512-NEXT: ldr q0, [x0]
; VBITS_GE_512-NEXT: ptrue p0.d, vl8		; VBITS_GE_512-NEXT: ptrue p0.d, vl8
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%op1 = load <4 x float>, <4 x float>* %a		%op1 = load <4 x float>, <4 x float>* %a
%res = fpext <4 x float> %op1 to <4 x double>		%res = fpext <4 x float> %op1 to <4 x double>
store <4 x double> %res, <4 x double>* %b		store <4 x double> %res, <4 x double>* %b
ret void		ret void
}		}

define void @fcvt_v8f32_v8f64(<8 x float>* %a, <8 x double>* %b) #0 {		define void @fcvt_v8f32_v8f64(<8 x float>* %a, <8 x double>* %b) #0 {
; Ensure sensible type legalisation - fixed type extract_subvector codegen is poor currently.		; Ensure sensible type legalisation.
; VBITS_EQ_256-LABEL: fcvt_v8f32_v8f64:		; VBITS_EQ_256-LABEL: fcvt_v8f32_v8f64:
; VBITS_EQ_256: // %bb.0:		; VBITS_EQ_256: // %bb.0:
; VBITS_EQ_256-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
; VBITS_EQ_256-NEXT: sub x9, sp, #48
; VBITS_EQ_256-NEXT: mov x29, sp
; VBITS_EQ_256-NEXT: and sp, x9, #0xffffffffffffffe0
; VBITS_EQ_256-NEXT: .cfi_def_cfa w29, 16
; VBITS_EQ_256-NEXT: .cfi_offset w30, -8
; VBITS_EQ_256-NEXT: .cfi_offset w29, -16
; VBITS_EQ_256-NEXT: ptrue p0.s, vl8
; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_EQ_256-NEXT: mov x8, sp
; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x8]
; VBITS_EQ_256-NEXT: ldp q0, q1, [sp]
; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: mov x8, #4		; VBITS_EQ_256-NEXT: mov x8, #4
; VBITS_EQ_256-NEXT: uunpklo z0.d, z0.s		; VBITS_EQ_256-NEXT: ptrue p0.d, vl4
; VBITS_EQ_256-NEXT: uunpklo z1.d, z1.s		; VBITS_EQ_256-NEXT: ld1w { z0.d }, p0/z, [x0, x8, lsl #2]
		; VBITS_EQ_256-NEXT: ld1w { z1.d }, p0/z, [x0]
; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.s		; VBITS_EQ_256-NEXT: fcvt z0.d, p0/m, z0.s
; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.s		; VBITS_EQ_256-NEXT: fcvt z1.d, p0/m, z1.s
; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1, x8, lsl #3]		; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x1]		; VBITS_EQ_256-NEXT: st1d { z1.d }, p0, [x1]
; VBITS_EQ_256-NEXT: mov sp, x29
; VBITS_EQ_256-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
; VBITS_EQ_256-NEXT: ret		; VBITS_EQ_256-NEXT: ret
;		;
; VBITS_GE_512-LABEL: fcvt_v8f32_v8f64:		; VBITS_GE_512-LABEL: fcvt_v8f32_v8f64:
; VBITS_GE_512: // %bb.0:		; VBITS_GE_512: // %bb.0:
; VBITS_GE_512-NEXT: ptrue p0.s, vl8		; VBITS_GE_512-NEXT: ptrue p0.s, vl8
; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]		; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
; VBITS_GE_512-NEXT: ptrue p0.d, vl8		; VBITS_GE_512-NEXT: ptrue p0.d, vl8
; VBITS_GE_512-NEXT: uunpklo z0.d, z0.s		; VBITS_GE_512-NEXT: uunpklo z0.d, z0.s
▲ Show 20 Lines • Show All 375 Lines • Show Last 20 Lines