This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
2/2
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
ftrunc.ll

Differential D77895

[x86] use vector instructions to lower FP->int->FP casts
ClosedPublic

Authored by spatel on Apr 10 2020, 1:16 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
lebedev.ri
pcordes
scanon

Commits

rGd04db4825a4d: [x86] use vector instructions to lower FP->int->FP casts

Summary

As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM by using the vector versions of the convert instructions.

Based on experimental results from recent Intel/AMD chips, we don't need to worry about triggering denorm stalls while operating on garbage data in the high lanes with convert instructions, so this is expected to always be as good or better perf than the scalar instruction equivalent. FP exceptions are also not a concern because strict code should not be using the regular SDAG opcodes.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Apr 10 2020, 1:16 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptApr 10 2020, 1:16 PM

Looks ok to me in principle.

pcordes accepted this revision.Apr 10 2020, 2:07 PM

This revision is now accepted and ready to land.Apr 10 2020, 2:07 PM

Looks good to me, too. SSE4.1 still uses roundss which is at least as good. (1 uop on SnB and AMD, 2 uops on HSW/SKL).

While we're looking at this, could we use [v]roundss $11, src, src, dst when available for no-fast-math?

float->int overflow is UB so ISO C allows us to assume it doesn't happen. Unless we want to go beyond that, we don't need to preserve the behaviour of producing 0x80000000 as an integer result from the overflowing (int) cast and then converting that back to a float that represents that negative value, for out-of-range inputs. But if this optimization stage happens late enough on IR that does expect those semantics, that's different. But behaviour for similar asm instructions on other ISAs might not give the same bit-pattern so I'm not sure how useful this is. e.g. ARM conversions saturate to the bounds of the int value-range.

(I meant to just mark my review as "looks good to me", but I think that actually marked it as officially accepted. Let me know if that wasn't what the preferred course of action.)

In D77895#1975100, @pcordes wrote:

While we're looking at this, could we use [v]roundss $11, src, src, dst when available for no-fast-math?

float->int overflow is UB so ISO C allows us to assume it doesn't happen. Unless we want to go beyond that, we don't need to preserve the behaviour of producing 0x80000000 as an integer result from the overflowing (int) cast and then converting that back to a float that represents that negative value, for out-of-range inputs. But if this optimization stage happens late enough on IR that does expect those semantics, that's different. But behaviour for similar asm instructions on other ISAs might not give the same bit-pattern so I'm not sure how useful this is. e.g. ARM conversions saturate to the bounds of the int value-range.

The problem is -0.0 rather than overflow. (The non-obvious "#0" attribute on some of the tests is for "no-signed-zeros-fp-math"="true"; I should change the test name to make it clearer.)
We used to produce roundss more aggressively, but we had to back that out because it's wrong:
D48085
We also added a bailout flag for the transform because there's too much code in the wild that relied on some particular UB overflow behavior.

In D77895#1975191, @spatel wrote:

The problem is -0.0 rather than overflow.

Ah yes, that's a showstopper for correctness, thanks.

Is anyone working on a similar patch for double-precision? Zen2 apparently has single-uop CVTPD2DQ XMM, XMM and DQ2PD.

https://www.uops.info/table.html?search=cvtpd2dq&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CON=on&cb_SNB=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on

It may be break-even on other CPUs for throughput, and a win for latency.

Conroe and Nehalem have single-uop CVTTSD2SI (R32, XMM) so scalar round trip is best there for throughput.

SnB-family CPUs have 2 uop cvt to/from SI or PD<->DQ, but going to xmm->GP-int is port 0. So a scalar round trip distributes back-end port pressure more evenly. So the best choice depends on surrounding code if tuning for Intel without caring about Zen2. In a loop that *only* does (float)(int)x, on Skylake scalar has 1/clock throughput because it avoids a port5 bottleneck. The 2 instructions are p0 + p01 and p5 + p01. vs. cvttpd2dq and back would both be p5 + p01.

We can avoid a false dependency by converting back into the XMM reg we came from, so scalar round trip can avoid needing an extra instruction for xor-zeroing the destination. (But apparently we missed that optimization for float in the testcases without this patch).

LGTM, cheers - I'm not sure whats stopping us handling the float/double-int-double/float conversions as well in this patch though.

llvm/lib/Target/X86/X86ISelLowering.cpp
19157	What's stopping us getting this done in this patch as well?

spatel marked 2 inline comments as done.Apr 11 2020, 5:36 AM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
19157	useVectorCast() needs to be updated (or we just account for instruction availability directly here). Not a big deal, but I figured it was better to go piecemeal with the legacy of broken software from the related fptrunc patches. :) I'll add more tests and make the enhancements soon.

Closed by commit rGd04db4825a4d: [x86] use vector instructions to lower FP->int->FP casts (authored by spatel). · Explain WhyApr 12 2020, 7:28 AM

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2020, 7:28 AM

spatel mentioned this in D78362: [x86] use vector instructions to lower more FP->int->FP casts.Apr 17 2020, 5:58 AM

spatel mentioned this in rGcceb630a07cc: [x86] use vector instructions to lower more FP->int->FP casts.Apr 19 2020, 5:52 AM

spatel mentioned this in D78758: [x86] use vector instructions to lower even more FP->int->FP casts.Apr 23 2020, 2:03 PM

spatel mentioned this in rG7f4ff782d406: [x86] use vector instructions to lower even more FP->int->FP casts.Apr 25 2020, 9:00 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

42 lines

test/

CodeGen/

X86/

ftrunc.ll

14 lines

Diff 256854

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 19,135 Lines • ▼ Show 20 Lines	static SDValue vectorizeExtractedCast(SDValue Cast, SelectionDAG &DAG,

// cast (extelt V, 0) --> extelt (cast (extract_subv V)), 0		// cast (extelt V, 0) --> extelt (cast (extract_subv V)), 0
// cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0		// cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0
SDValue VCast = DAG.getNode(Cast.getOpcode(), DL, ToVT, VecOp);		SDValue VCast = DAG.getNode(Cast.getOpcode(), DL, ToVT, VecOp);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, DestVT, VCast,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, DestVT, VCast,
DAG.getIntPtrConstant(0, DL));		DAG.getIntPtrConstant(0, DL));
}		}

		/// Given a scalar cast to FP with a cast to integer operand (almost an ftrunc),
		/// try to vectorize the cast ops. This will avoid an expensive round-trip
		/// between XMM and GPR.
		static SDValue lowerFPToIntToFP(SDValue CastToFP, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// TODO: Allow FP_TO_UINT.
		SDValue CastToInt = CastToFP.getOperand(0);
		MVT VT = CastToFP.getSimpleValueType();
		if (CastToInt.getOpcode() != ISD::FP_TO_SINT \|\| VT.isVector())
		return SDValue();

		MVT IntVT = CastToInt.getSimpleValueType();
		SDValue X = CastToInt.getOperand(0);
		// TODO: Allow size-changing from source to dest (double -> i32 -> float)
		RKSimonUnsubmitted Done Reply Inline Actions What's stopping us getting this done in this patch as well? RKSimon: What's stopping us getting this done in this patch as well?
		spatelAuthorUnsubmitted Done Reply Inline Actions useVectorCast() needs to be updated (or we just account for instruction availability directly here). Not a big deal, but I figured it was better to go piecemeal with the legacy of broken software from the related fptrunc patches. :) I'll add more tests and make the enhancements soon. spatel: useVectorCast() needs to be updated (or we just account for instruction availability directly…
		if (X.getSimpleValueType() != VT \|\|
		VT.getSizeInBits() != IntVT.getSizeInBits())
		return SDValue();

		// See if we have a 128-bit vector cast op for this type of cast.
		unsigned NumEltsInXMM = 128 / VT.getScalarSizeInBits();
		MVT Vec128VT = MVT::getVectorVT(VT, NumEltsInXMM);
		MVT Int128VT = MVT::getVectorVT(IntVT, NumEltsInXMM);
		if (!useVectorCast(CastToFP.getOpcode(), Int128VT, Vec128VT, Subtarget))
		return SDValue();

		// sint_to_fp (fp_to_sint X) --> extelt (sint_to_fp (fp_to_sint (s2v X))), 0
		//
		// We are not defining the high elements (for example, zero them) because
		// that could nullify any performance advantage that we hoped to gain from
		// this vector op hack. We do not expect any adverse effects (like denorm
		// penalties) with cast ops.
		SDLoc DL(CastToFP);
		SDValue ZeroIdx = DAG.getIntPtrConstant(0, DL);
		SDValue VecX = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, Vec128VT, X);
		SDValue VCastToInt = DAG.getNode(ISD::FP_TO_SINT, DL, Int128VT, VecX);
		SDValue VCastToFP = DAG.getNode(ISD::SINT_TO_FP, DL, Vec128VT, VCastToInt);
		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT, VCastToFP, ZeroIdx);
		}

static SDValue lowerINT_TO_FP_vXi64(SDValue Op, SelectionDAG &DAG,		static SDValue lowerINT_TO_FP_vXi64(SDValue Op, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDLoc DL(Op);		SDLoc DL(Op);
bool IsStrict = Op->isStrictFPOpcode();		bool IsStrict = Op->isStrictFPOpcode();
MVT VT = Op->getSimpleValueType(0);		MVT VT = Op->getSimpleValueType(0);
SDValue Src = Op->getOperand(IsStrict ? 1 : 0);		SDValue Src = Op->getOperand(IsStrict ? 1 : 0);

if (Subtarget.hasDQI()) {		if (Subtarget.hasDQI()) {
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::LowerSINT_TO_FP(SDValue Op,
SDValue Chain = IsStrict ? Op->getOperand(0) : DAG.getEntryNode();		SDValue Chain = IsStrict ? Op->getOperand(0) : DAG.getEntryNode();
MVT SrcVT = Src.getSimpleValueType();		MVT SrcVT = Src.getSimpleValueType();
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);

if (SDValue Extract = vectorizeExtractedCast(Op, DAG, Subtarget))		if (SDValue Extract = vectorizeExtractedCast(Op, DAG, Subtarget))
return Extract;		return Extract;

		if (SDValue R = lowerFPToIntToFP(Op, DAG, Subtarget))
		return R;

if (SrcVT.isVector()) {		if (SrcVT.isVector()) {
if (SrcVT == MVT::v2i32 && VT == MVT::v2f64) {		if (SrcVT == MVT::v2i32 && VT == MVT::v2f64) {
// Note: Since v2f64 is a legal type. We don't need to zero extend the		// Note: Since v2f64 is a legal type. We don't need to zero extend the
// source for strict FP.		// source for strict FP.
if (IsStrict)		if (IsStrict)
return DAG.getNode(		return DAG.getNode(
X86ISD::STRICT_CVTSI2P, dl, {VT, MVT::Other},		X86ISD::STRICT_CVTSI2P, dl, {VT, MVT::Other},
{Chain, DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4i32, Src,		{Chain, DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v4i32, Src,
▲ Show 20 Lines • Show All 29,646 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/ftrunc.ll

Show First 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	; AVX1-NEXT: retq
%i = fptoui <4 x double> %x to <4 x i64>		%i = fptoui <4 x double> %x to <4 x i64>
%r = uitofp <4 x i64> %i to <4 x double>		%r = uitofp <4 x i64> %i to <4 x double>
ret <4 x double> %r		ret <4 x double> %r
}		}

define float @trunc_signed_f32_no_fast_math(float %x) {		define float @trunc_signed_f32_no_fast_math(float %x) {
; SSE-LABEL: trunc_signed_f32_no_fast_math:		; SSE-LABEL: trunc_signed_f32_no_fast_math:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: cvttss2si %xmm0, %eax		; SSE-NEXT: cvttps2dq %xmm0, %xmm0
; SSE-NEXT: xorps %xmm0, %xmm0		; SSE-NEXT: cvtdq2ps %xmm0, %xmm0
; SSE-NEXT: cvtsi2ss %eax, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX1-LABEL: trunc_signed_f32_no_fast_math:		; AVX1-LABEL: trunc_signed_f32_no_fast_math:
; AVX1: # %bb.0:		; AVX1: # %bb.0:
; AVX1-NEXT: vcvttss2si %xmm0, %eax		; AVX1-NEXT: vcvttps2dq %xmm0, %xmm0
; AVX1-NEXT: vcvtsi2ss %eax, %xmm1, %xmm0		; AVX1-NEXT: vcvtdq2ps %xmm0, %xmm0
; AVX1-NEXT: retq		; AVX1-NEXT: retq
%i = fptosi float %x to i32		%i = fptosi float %x to i32
%r = sitofp i32 %i to float		%r = sitofp i32 %i to float
ret float %r		ret float %r
}		}

define float @trunc_signed_f32(float %x) #0 {		define float @trunc_signed_f32(float %x) #0 {
; SSE2-LABEL: trunc_signed_f32:		; SSE2-LABEL: trunc_signed_f32:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: cvttss2si %xmm0, %eax		; SSE2-NEXT: cvttps2dq %xmm0, %xmm0
; SSE2-NEXT: xorps %xmm0, %xmm0		; SSE2-NEXT: cvtdq2ps %xmm0, %xmm0
; SSE2-NEXT: cvtsi2ss %eax, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE41-LABEL: trunc_signed_f32:		; SSE41-LABEL: trunc_signed_f32:
; SSE41: # %bb.0:		; SSE41: # %bb.0:
; SSE41-NEXT: roundss $11, %xmm0, %xmm0		; SSE41-NEXT: roundss $11, %xmm0, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX1-LABEL: trunc_signed_f32:		; AVX1-LABEL: trunc_signed_f32:
▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines