This is an archive of the discontinued LLVM Phabricator instance.

v2f32 ops
Needs ReviewPublic

Authored by jonpa on Feb 23 2017, 5:25 AM.

Download Raw Diff

Details

Reviewers

hliao
ab
delena
uweigand
igorb
RKSimon

Summary

A few days ago I described this issue on llvm-commits along with patch:

Hi,

I found that on SystemZ, for v2f32, four and not two scalar operations are emitted. This is because the v2f32 type is widened, which is good in cases of memory-only operations for instance. There is however no fp32 vector support on z13, so these will always be scalarized. If this is done after type-legalization, four and not two operations are produced, which is particularly bad in case of fp32 divide inside a vectorized loop.

In order to fix this, my patch unrolls these operations before type legalization (with a target DAG combine). They must also have the operation action of 'Expand', since otherwise other DAGCombiner methods may re-vectorized them again. This happened in reduceBuildVecConvertToConvertBuildVec(), where I also needed to add a call to TLI.isOperationLegalOrCustom(Opcode, VT), to check on the *result* type, which would in this case be v2f32. It would not work to
mark all the operand VT's of SINT_TO_FP as 'Expand', because this is only true for the v2f32 result case.

I do get some failing regression tests, which I am not sure about:

Failing Tests (3):
LLVM :: CodeGen/ARM/vdup.ll
LLVM :: CodeGen/X86/2009-02-26-MachineLICMBug.ll
LLVM :: CodeGen/X86/cvtv2f32.ll
Is it ok to add the check for the result type per what I did?

/Jonas

In short, this is something I did for the SystemZ target, but because I had to add an extra check in reduceBuildVecConvertToConvertBuildVec(), three tests failed.

I have now tried to regenerate the tests, but only one of them was successfully regenerated. I include that test diff, plus the two output diffs of the two remaining test cases.

I hope that either the tests have been improved as expected (which I can't tell for myself for sure), or that someone gives me a pointer on how to change the patch.

Thanks

/Jonas

Diff Detail

Event Timeline

jonpa created this revision.Feb 23 2017, 5:25 AM

Herald added a subscriber: aemerson. · View Herald TranscriptFeb 23 2017, 5:25 AM

igorb added a reviewer: delena.Feb 23 2017, 6:09 AM

On platforms where the widened operation is legal or custom-lowered, we want to use that operation. (For example, if v4f32 uitofp is custom, we also want to use that lowering for v2f32.)

It looks like x86 in particular is legalizing v2f32 uint_to_fp nodes earlier than it should, so the DAGCombine can't trigger after type legalization.

It looks like on ARM, we can't perform the transform after type legalization because we end up with a 32-bit sitofp (because ARM doesn't have 16-bit registers). The "SrcVT != InVT" check could be extended to handle this case.

That said, if you don't want to mess with other targets, maybe you could change the new check in DAGCombine to "if (LegalTypes && !TLI.isOperationLegalOrCustom(Opcode, VT))"?

test/CodeGen/X86/cvtv2f32.ll
21 ↗	(On Diff #89502)	This looks worse.
31 ↗	(On Diff #89502)	This... is arguably better, I guess, but you're not really reaching it in any principled manner.
test_CodeGen_ARM_vdup.diff
83 ↗	(On Diff #89502)	The old code was loading directly to a vector register (vld1 is a splat load, vmovl is a sign-extend, and vcvt is the int->float conversion). The new code is loading to an integer register (ldrsh), moving to an fp register (vmov), converting to fp (vcvt), then splatting the result (vdup). The new code is slightly worse. (Ultimately, we want to do the splat before the int->fp conversion here because we can fold the splat into the load.)
test_CodeGen_X86_MLICMbug.diff
58 ↗	(On Diff #89502)	This looks worse.

craig.topper added a reviewer: RKSimon.Feb 23 2017, 9:39 PM

Thanks for help with the test regressions!

That said, if you don't want to mess with other targets, maybe you could change the new check in DAGCombine to "if (LegalTypes && !TLI.isOperationLegalOrCustom(Opcode, VT))"?

I don't think that would work, becuase this problem arose as an infinite loop *before* type legalization (the point was to call DAG.UnrollVectorOp() before type legalization with only two elements).

On platforms where the widened operation is legal or custom-lowered, we want to use that operation. (For example, if v4f32 uitofp is custom, we also want to use that lowering for v2f32.)

I tried this idea, by adding a check for also the widened VT, but that didn't work. In vdup.ll,
this optimization is needed to make a 'v4f32 = sint_to_fp' node. This is however marked for 'Expand', so TLI.isOperationLegalOrCustom() doesn't work.

Instead, I made my check more precise, to handle only exactly the case where before type legalization, if the result is going to be widened, and both the narrow and wider ops are 'Expand', then don't vectorize it (return SDValue()). This seems to work better, with no regressions.

This seems even possibly even like a general DAG heuristic, to actually scalarize early in this case. Or is it not?

I also realized that I don't have to call setOperationAction(Op, MVT::v2f32, Expand) at all, since this is true already per default for all vector ops in SystemZ.

What is stopping you from implementing SystemZTargetLowering::ReplaceNodeResults to handle these?

Please can you keep the x86/arm regressions in your patch.

What is stopping you from implementing SystemZTargetLowering::ReplaceNodeResults to handle these?

The problem I see is with the FP_TO_XINT nodes. The fp->v2i64 are legal, which is correct in the case of v2f64->v2i64. This also means in practice that v2f32 -> v2i64 is legal which it is not, but this is basically ok since there is no vector support for f32. Unfortunately, there is no way to specify that just v2f32->v2iXX should be custom, based on the operand type.

This is a problem there does not seem to be a solution to at the moment - it depends on checking the operand type in this case, while CustomWidenLowerNode() just checks for the result VT.

It might be possible to mark all FP_TO_XINT nodes of all integer vector types as custom, but that would also affect other things, such as cost functions and what not, so I am not sure that would be worth bothering with.

I could apply the patch as it is or move that transformation to ReplaceNodeResults() and maybe just ignore the FP_TO_XINT nodes for now, since fp32 is not a high priority.

Please can you keep the x86/arm regressions in your patch.

The test regressions are gone because I avoided it with the tighter check - sorry I didn't mention that more clearly.

As I said before, if anything I would like to try to make this a general rule that if a node is going to get widened and then expanded, then it should be expanded if possilbe before type legalization (widening). Otherwise unnecessary scalar operations will be built for the undef elements of the widened vector.

The main reason that ReplaceNodeResults() cannot be used in this case, is really because it is run after type-legalization and the widening performed there on these operations. Then the VT has changed to v4f32 and it is not possible anymore to expand to just two ops.

RKSimon resigned from this revision.Apr 7 2018, 9:19 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptApr 7 2018, 9:19 AM

RKSimon mentioned this in D51325: [X86] Type legalize v2i32 div/rem by scalarizing rather than promoting.Sep 7 2018, 10:48 AM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

13 lines

Target/

SystemZ/

SystemZISelLowering.cpp

21 lines

test/

CodeGen/

SystemZ/

fp32-vec-conv.ll

41 lines

fp32-vec-ops.ll

49 lines

Diff 89633

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,295 Lines • ▼ Show 20 Lines	assert((Opcode == ISD::UINT_TO_FP \|\| Opcode == ISD::SINT_TO_FP)
&& "Should only handle conversion from integer to float.");		&& "Should only handle conversion from integer to float.");
assert(SrcVT != MVT::Other && "Cannot determine source type!");		assert(SrcVT != MVT::Other && "Cannot determine source type!");

EVT NVT = EVT::getVectorVT(*DAG.getContext(), SrcVT, NumInScalars);		EVT NVT = EVT::getVectorVT(*DAG.getContext(), SrcVT, NumInScalars);

if (!TLI.isOperationLegalOrCustom(Opcode, NVT))		if (!TLI.isOperationLegalOrCustom(Opcode, NVT))
return SDValue();		return SDValue();

		// A target may want to call DAG:UnrollVectorOp() on a node which is going
		// to be widened and then expanded. This is better to do before type
		// legalization, because then only two scalar operations result (an
		// infinite loop would result if this function would re-vectorize the op).
		if (!LegalTypes && TLI.isOperationExpand(Opcode, VT)) {
		LLVMContext &Context = *DAG.getContext();
		if (TLI.getTypeAction(Context, VT) == TargetLowering::TypeWidenVector) {
		EVT WideVT = TLI.getTypeToTransformTo(Context, VT);
		if (TLI.isOperationExpand(Opcode, WideVT))
		return SDValue();
		}
		}

// Just because the floating-point vector type is legal does not necessarily		// Just because the floating-point vector type is legal does not necessarily
// mean that the corresponding integer vector type is.		// mean that the corresponding integer vector type is.
if (!isTypeLegal(NVT))		if (!isTypeLegal(NVT))
return SDValue();		return SDValue();

SmallVector<SDValue, 8> Opnds;		SmallVector<SDValue, 8> Opnds;
for (unsigned i = 0; i != NumInScalars; ++i) {		for (unsigned i = 0; i != NumInScalars; ++i) {
SDValue In = N->getOperand(i);		SDValue In = N->getOperand(i);
▲ Show 20 Lines • Show All 2,699 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZISelLowering.cpp

Show First 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	SystemZTargetLowering::SystemZTargetLowering(const TargetMachine &TM,
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::FP_ROUND);		setTargetDAGCombine(ISD::FP_ROUND);
setTargetDAGCombine(ISD::BSWAP);		setTargetDAGCombine(ISD::BSWAP);
setTargetDAGCombine(ISD::SHL);		setTargetDAGCombine(ISD::SHL);
setTargetDAGCombine(ISD::SRA);		setTargetDAGCombine(ISD::SRA);
setTargetDAGCombine(ISD::SRL);		setTargetDAGCombine(ISD::SRL);
setTargetDAGCombine(ISD::ROTL);		setTargetDAGCombine(ISD::ROTL);

		// Scalarize v2f32 early, to avoid later expansion to 4 operations (see
		// comment in PerformDAGCombine).
		SmallVector<ISD::NodeType, 12> FP32Ops =
		{ISD::FADD, ISD::FSUB, ISD::FMUL, ISD::FDIV, ISD::FREM, ISD::SINT_TO_FP,
		ISD::UINT_TO_FP, ISD::FP_TO_SINT, ISD::FP_TO_UINT};
		for (auto Op : FP32Ops)
		setTargetDAGCombine(Op);

// Handle intrinsics.		// Handle intrinsics.
setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);		setOperationAction(ISD::INTRINSIC_W_CHAIN, MVT::Other, Custom);
setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom);		setOperationAction(ISD::INTRINSIC_WO_CHAIN, MVT::Other, Custom);

// We want to use MVC in preference to even a single load/store pair.		// We want to use MVC in preference to even a single load/store pair.
MaxStoresPerMemcpy = 0;		MaxStoresPerMemcpy = 0;
MaxStoresPerMemcpyOptSize = 0;		MaxStoresPerMemcpyOptSize = 0;

▲ Show 20 Lines • Show All 4,719 Lines • ▼ Show 20 Lines	SDValue SystemZTargetLowering::combineSHIFTROT(
}		}

return SDValue();		return SDValue();
}		}

SDValue SystemZTargetLowering::PerformDAGCombine(SDNode *N,		SDValue SystemZTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
switch(N->getOpcode()) {		switch(N->getOpcode()) {
default: break;		default:
		// Z13 can handle fp32 vectors in registers and memory, but does not
		// support any vector operations on them. v2f32 is widened to v4f32 and
		// kept in a single vector register, but any operations on v2f32 should
		// be scalarized before type legalization, or else all four operations
		// will actually be emitted.
		if (N->getValueType(0) == MVT::v2f32 \|\|
		((N->getOpcode() == ISD::FP_TO_SINT \|\| N->getOpcode() == ISD::FP_TO_UINT) &&
		(N->getOperand(0)->getValueType(0) == MVT::v2f32)))
		return DCI.DAG.UnrollVectorOp(N, 2);

		break;
case ISD::SIGN_EXTEND: return combineSIGN_EXTEND(N, DCI);		case ISD::SIGN_EXTEND: return combineSIGN_EXTEND(N, DCI);
case SystemZISD::MERGE_HIGH:		case SystemZISD::MERGE_HIGH:
case SystemZISD::MERGE_LOW: return combineMERGE(N, DCI);		case SystemZISD::MERGE_LOW: return combineMERGE(N, DCI);
case ISD::STORE: return combineSTORE(N, DCI);		case ISD::STORE: return combineSTORE(N, DCI);
case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);		case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);
case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);		case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);
case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);		case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);
case ISD::BSWAP: return combineBSWAP(N, DCI);		case ISD::BSWAP: return combineBSWAP(N, DCI);
▲ Show 20 Lines • Show All 1,148 Lines • Show Last 20 Lines

test/CodeGen/SystemZ/fp32-vec-conv.ll

This file was added.

				; Test that a vector of two floats only generates two instructions (and not
				; four).
				;
				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s


				define <2 x float> @fun0(<2 x i32> %val1) {
				; CHECK-LABEL: fun0:
				; CHECK: celfbr
				; CHECK: celfbr
				; CHECK-NOT: celfbr
				%z = uitofp <2 x i32> %val1 to <2 x float>
				ret <2 x float> %z
				}

				define <2 x float> @fun1(<2 x i32> %val1) {
				; CHECK-LABEL: fun1:
				; CHECK: cefbr
				; CHECK: cefbr
				; CHECK-NOT: cefbr
				%z = sitofp <2 x i32> %val1 to <2 x float>
				ret <2 x float> %z
				}

				define <2 x i32> @fun2(<2 x float> %val1) {
				; CHECK-LABEL: fun2:
				; CHECK: cfebr
				; CHECK: cfebr
				; CHECK-NOT: cfebr
				%z = fptosi <2 x float> %val1 to <2 x i32>
				ret <2 x i32> %z
				}

				define <2 x i32> @fun3(<2 x float> %val1) {
				; CHECK-LABEL: fun3:
				; CHECK: clfebr
				; CHECK: clfebr
				; CHECK-NOT: clfebr
				%z = fptoui <2 x float> %val1 to <2 x i32>
				ret <2 x i32> %z
				}

test/CodeGen/SystemZ/fp32-vec-ops.ll

This file was added.

				; Test that a vector of two floats only generates two instructions (and not
				; four).
				;
				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s

				define <2 x float> @fun0(<2 x float> %val1, <2 x float> %val2) {
				; CHECK-LABEL: fun0:
				; CHECK: aebr
				; CHECK: aebr
				; CHECK-NOT: aebr
				%ret = fadd <2 x float> %val1, %val2
				ret <2 x float> %ret
				}

				define <2 x float> @fun1(<2 x float> %val1, <2 x float> %val2) {
				; CHECK-LABEL: fun1:
				; CHECK: sebr
				; CHECK: sebr
				; CHECK-NOT: sebr
				%ret = fsub <2 x float> %val1, %val2
				ret <2 x float> %ret
				}

				define <2 x float> @fun2(<2 x float> %val1, <2 x float> %val2) {
				; CHECK-LABEL: fun2:
				; CHECK: meebr
				; CHECK: meebr
				; CHECK-NOT: meebr
				%ret = fmul <2 x float> %val1, %val2
				ret <2 x float> %ret
				}

				define <2 x float> @fun3(<2 x float> %val1, <2 x float> %val2) {
				; CHECK-LABEL: fun3:
				; CHECK: debr
				; CHECK: debr
				; CHECK-NOT: debr
				%ret = fdiv <2 x float> %val1, %val2
				ret <2 x float> %ret
				}

				define <2 x float> @fun4(<2 x float> %val1, <2 x float> %val2) {
				; CHECK-LABEL: fun4:
				; CHECK: brasl %r14, fmodf@PLT
				; CHECK: brasl %r14, fmodf@PLT
				; CHECK-NOT: brasl %r14, fmodf@PLT
				%ret = frem <2 x float> %val1, %val2
				ret <2 x float> %ret
				}