This is an archive of the discontinued LLVM Phabricator instance.

[x86] vectorize more cast ops in lowering to avoid register file transfers
ClosedPublic

Authored by spatel on Feb 13 2019, 11:15 AM.

Download Raw Diff

Details

Reviewers

RKSimon
andreadb
craig.topper

Commits

rG234a5e8ea422: [x86] vectorize more cast ops in lowering to avoid register file transfers
rL354619: [x86] vectorize more cast ops in lowering to avoid register file transfers

Summary

This is a follow-up to D56864.

If we're extracting from a non-zero index before casting to FP, then shuffle the vector and optionally narrow the vector before doing the cast:

cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0

This might be enough to close PR39974:
https://bugs.llvm.org/show_bug.cgi?id=39974

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Feb 13 2019, 11:15 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 13 2019, 11:15 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

RKSimon added inline comments.Feb 13 2019, 12:49 PM

llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗	(On Diff #186701)	Any idea why this still fails?

spatel marked an inline comment as done.Feb 13 2019, 1:39 PM

spatel added inline comments.

llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗	(On Diff #186701)	This is almost the same example as in: https://bugs.llvm.org/show_bug.cgi?id=39975 On x86-64 only (because the 64-bit shift isn't legal on i686), we scalarize the shift. So that means we have the shift sitting between the extract and cast, so no match.

RKSimon added inline comments.Feb 14 2019, 4:17 AM

llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗	(On Diff #186701)	Hmm - is it worth us investigating a trunc(lshr(extract(v2i64 x, i), 32)) -> trunc(extract(v2i64 x, i+1)) combine? (and variants)

spatel marked an inline comment as done.Feb 14 2019, 6:02 AM

spatel added inline comments.

llvm/test/CodeGen/X86/known-signbits-vector.ll
79 ↗	(On Diff #186701)	Yeah, I though we already had that transform, but that case isn't matched. Note that this example is an ashr, not lshr, and the example in PR39975 is probably tougher because it's shift-by-33. We might want to carve out an exception for the scalarization transform for these kinds of cases in general.

Ping.

LGTM, cheers

This revision is now accepted and ready to land.Feb 20 2019, 12:18 PM

Closed by commit rL354619: [x86] vectorize more cast ops in lowering to avoid register file transfers (authored by spatel). · Explain WhyFeb 21 2019, 12:40 PM

This revision was automatically updated to reflect the committed changes.

efriedma added a subscriber: efriedma.Feb 21 2019, 4:10 PM

efriedma added inline comments.

llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874	Why not "vcvtudq2pd %xmm0, %xmm0"?

spatel marked an inline comment as done.Feb 22 2019, 6:36 AM

spatel added inline comments.

llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874	We're only matching the generic UINT_TO_FP node, so we go from <4 x i32> to <4 x double>. That's also why the SSE targets don't get the similar SINT_TO_FP test above here. I can look into how the SINT_TO_FP example gets narrowed and try to make that happen here too. There's no documentation that the generic nodes can change the number of elements in the vector, so I'm assuming they don't have that ability. Currently, we use the X86ISD::CVTSI2P for those patterns, so I think we need to extend the matching logic to handle that case to solve this more completely.

spatel mentioned this in rL354675: [x86] allow narrowing of vector UINT_TO_FP.Feb 22 2019, 7:53 AM

spatel mentioned this in rGa9e289174a1c: [x86] allow narrowing of vector UINT_TO_FP.

spatel marked an inline comment as done.Feb 22 2019, 7:55 AM

spatel added inline comments.

llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll
5874	rL354675

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

29 lines

test/

CodeGen/

X86/

known-signbits-vector.ll

4 lines

vec_int_to_fp.ll

93 lines

Diff 187844

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,630 Lines • ▼ Show 20 Lines	static bool useVectorCast(unsigned Opcode, MVT FromVT, MVT ToVT,
}		}
}		}

/// Given a scalar cast operation that is extracted from a vector, try to		/// Given a scalar cast operation that is extracted from a vector, try to
/// vectorize the cast op followed by extraction. This will avoid an expensive		/// vectorize the cast op followed by extraction. This will avoid an expensive
/// round-trip between XMM and GPR.		/// round-trip between XMM and GPR.
static SDValue vectorizeExtractedCast(SDValue Cast, SelectionDAG &DAG,		static SDValue vectorizeExtractedCast(SDValue Cast, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
// TODO: The limitation for extracting from the 0-element is not required,
// but if we extract from some other element, it will require shuffling to
// get the result into the right place.
// TODO: This could be enhanced to handle smaller integer types by peeking		// TODO: This could be enhanced to handle smaller integer types by peeking
// through an extend.		// through an extend.
SDValue Extract = Cast.getOperand(0);		SDValue Extract = Cast.getOperand(0);
MVT DestVT = Cast.getSimpleValueType();		MVT DestVT = Cast.getSimpleValueType();
if (Extract.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|		if (Extract.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
!isNullConstant(Extract.getOperand(1)))		!isa<ConstantSDNode>(Extract.getOperand(1)))
return SDValue();		return SDValue();

		// See if we have a 128-bit vector cast op for this type of cast.
SDValue VecOp = Extract.getOperand(0);		SDValue VecOp = Extract.getOperand(0);
MVT FromVT = VecOp.getSimpleValueType();		MVT FromVT = VecOp.getSimpleValueType();
MVT ToVT = MVT::getVectorVT(DestVT, FromVT.getVectorNumElements());		unsigned NumEltsInXMM = 128 / FromVT.getScalarSizeInBits();
if (!useVectorCast(Cast.getOpcode(), FromVT, ToVT, Subtarget))		MVT Vec128VT = MVT::getVectorVT(FromVT.getScalarType(), NumEltsInXMM);
		MVT ToVT = MVT::getVectorVT(DestVT, NumEltsInXMM);
		if (!useVectorCast(Cast.getOpcode(), Vec128VT, ToVT, Subtarget))
return SDValue();		return SDValue();

// cast (extract V, Y) --> extract (cast V), Y		// If we are extracting from a non-zero element, first shuffle the source
		// vector to allow extracting from element zero.
SDLoc DL(Cast);		SDLoc DL(Cast);
		if (!isNullConstant(Extract.getOperand(1))) {
		SmallVector<int, 16> Mask(FromVT.getVectorNumElements(), -1);
		Mask[0] = Extract.getConstantOperandVal(1);
		VecOp = DAG.getVectorShuffle(FromVT, DL, VecOp, DAG.getUNDEF(FromVT), Mask);
		}
		// If the source vector is wider than 128-bits, extract the low part. Do not
		// create an unnecessarily wide vector cast op.
		if (FromVT != Vec128VT)
		VecOp = extract128BitVector(VecOp, 0, DAG, DL);

		// cast (extelt V, 0) --> extelt (cast (extract_subv V)), 0
		// cast (extelt V, C) --> extelt (cast (extract_subv (shuffle V, [C...]))), 0
SDValue VCast = DAG.getNode(Cast.getOpcode(), DL, ToVT, VecOp);		SDValue VCast = DAG.getNode(Cast.getOpcode(), DL, ToVT, VecOp);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, DestVT, VCast,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, DestVT, VCast,
Extract.getOperand(1));		DAG.getIntPtrConstant(0, DL));
}		}

SDValue X86TargetLowering::LowerSINT_TO_FP(SDValue Op,		SDValue X86TargetLowering::LowerSINT_TO_FP(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);		SDValue Src = Op.getOperand(0);
MVT SrcVT = Src.getSimpleValueType();		MVT SrcVT = Src.getSimpleValueType();
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
SDLoc dl(Op);		SDLoc dl(Op);
▲ Show 20 Lines • Show All 25,746 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/known-signbits-vector.ll

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
%9 = sitofp <4 x i64> %8 to <4 x float>		%9 = sitofp <4 x i64> %8 to <4 x float>
ret <4 x float> %9		ret <4 x float> %9
}		}

define float @signbits_ashr_extract_sitofp_0(<2 x i64> %a0) nounwind {		define float @signbits_ashr_extract_sitofp_0(<2 x i64> %a0) nounwind {
; X32-LABEL: signbits_ashr_extract_sitofp_0:		; X32-LABEL: signbits_ashr_extract_sitofp_0:
; X32: # %bb.0:		; X32: # %bb.0:
; X32-NEXT: pushl %eax		; X32-NEXT: pushl %eax
; X32-NEXT: vextractps $1, %xmm0, %eax		; X32-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,1,2,3]
; X32-NEXT: vcvtsi2ssl %eax, %xmm1, %xmm0		; X32-NEXT: vcvtdq2ps %xmm0, %xmm0
; X32-NEXT: vmovss %xmm0, (%esp)		; X32-NEXT: vmovss %xmm0, (%esp)
; X32-NEXT: flds (%esp)		; X32-NEXT: flds (%esp)
; X32-NEXT: popl %eax		; X32-NEXT: popl %eax
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: signbits_ashr_extract_sitofp_0:		; X64-LABEL: signbits_ashr_extract_sitofp_0:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: vmovq %xmm0, %rax		; X64-NEXT: vmovq %xmm0, %rax
▲ Show 20 Lines • Show All 353 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll

Show First 20 Lines • Show All 5,720 Lines • ▼ Show 20 Lines	; AVX512VLDQ-NEXT: retq
%e = extractelement <4 x i32> %x, i32 0		%e = extractelement <4 x i32> %x, i32 0
%r = uitofp i32 %e to double		%r = uitofp i32 %e to double
ret double %r		ret double %r
}		}

; Extract non-zero element from int vector and convert to FP.		; Extract non-zero element from int vector and convert to FP.

define float @extract3_sitofp_v4i32_f32(<4 x i32> %x) nounwind {		define float @extract3_sitofp_v4i32_f32(<4 x i32> %x) nounwind {
; SSE2-LABEL: extract3_sitofp_v4i32_f32:		; SSE-LABEL: extract3_sitofp_v4i32_f32:
; SSE2: # %bb.0:		; SSE: # %bb.0:
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]		; SSE-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]
; SSE2-NEXT: cvtdq2ps %xmm0, %xmm0		; SSE-NEXT: cvtdq2ps %xmm0, %xmm0
; SSE2-NEXT: retq		; SSE-NEXT: retq
;
; SSE41-LABEL: extract3_sitofp_v4i32_f32:
; SSE41: # %bb.0:
; SSE41-NEXT: extractps $3, %xmm0, %eax
; SSE41-NEXT: xorps %xmm0, %xmm0
; SSE41-NEXT: cvtsi2ssl %eax, %xmm0
; SSE41-NEXT: retq
;		;
; AVX-LABEL: extract3_sitofp_v4i32_f32:		; AVX-LABEL: extract3_sitofp_v4i32_f32:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vextractps $3, %xmm0, %eax		; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX-NEXT: vcvtsi2ssl %eax, %xmm1, %xmm0		; AVX-NEXT: vcvtdq2ps %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%e = extractelement <4 x i32> %x, i32 3		%e = extractelement <4 x i32> %x, i32 3
%r = sitofp i32 %e to float		%r = sitofp i32 %e to float
ret float %r		ret float %r
}		}

define double @extract3_sitofp_v4i32_f64(<4 x i32> %x) nounwind {		define double @extract3_sitofp_v4i32_f64(<4 x i32> %x) nounwind {
; SSE2-LABEL: extract3_sitofp_v4i32_f64:		; SSE2-LABEL: extract3_sitofp_v4i32_f64:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]		; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]
; SSE2-NEXT: movd %xmm0, %eax		; SSE2-NEXT: movd %xmm0, %eax
; SSE2-NEXT: xorps %xmm0, %xmm0		; SSE2-NEXT: xorps %xmm0, %xmm0
; SSE2-NEXT: cvtsi2sdl %eax, %xmm0		; SSE2-NEXT: cvtsi2sdl %eax, %xmm0
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE41-LABEL: extract3_sitofp_v4i32_f64:		; SSE41-LABEL: extract3_sitofp_v4i32_f64:
; SSE41: # %bb.0:		; SSE41: # %bb.0:
; SSE41-NEXT: extractps $3, %xmm0, %eax		; SSE41-NEXT: extractps $3, %xmm0, %eax
; SSE41-NEXT: xorps %xmm0, %xmm0		; SSE41-NEXT: xorps %xmm0, %xmm0
; SSE41-NEXT: cvtsi2sdl %eax, %xmm0		; SSE41-NEXT: cvtsi2sdl %eax, %xmm0
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; AVX-LABEL: extract3_sitofp_v4i32_f64:		; AVX-LABEL: extract3_sitofp_v4i32_f64:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vextractps $3, %xmm0, %eax		; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX-NEXT: vcvtsi2sdl %eax, %xmm1, %xmm0		; AVX-NEXT: vcvtdq2pd %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%e = extractelement <4 x i32> %x, i32 3		%e = extractelement <4 x i32> %x, i32 3
%r = sitofp i32 %e to double		%r = sitofp i32 %e to double
ret double %r		ret double %r
}		}

define float @extract3_uitofp_v4i32_f32(<4 x i32> %x) nounwind {		define float @extract3_uitofp_v4i32_f32(<4 x i32> %x) nounwind {
; SSE2-LABEL: extract3_uitofp_v4i32_f32:		; SSE2-LABEL: extract3_uitofp_v4i32_f32:
Show All 12 Lines
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; VEX-LABEL: extract3_uitofp_v4i32_f32:		; VEX-LABEL: extract3_uitofp_v4i32_f32:
; VEX: # %bb.0:		; VEX: # %bb.0:
; VEX-NEXT: vextractps $3, %xmm0, %eax		; VEX-NEXT: vextractps $3, %xmm0, %eax
; VEX-NEXT: vcvtsi2ssq %rax, %xmm1, %xmm0		; VEX-NEXT: vcvtsi2ssq %rax, %xmm1, %xmm0
; VEX-NEXT: retq		; VEX-NEXT: retq
;		;
; AVX512-LABEL: extract3_uitofp_v4i32_f32:		; AVX512F-LABEL: extract3_uitofp_v4i32_f32:
; AVX512: # %bb.0:		; AVX512F: # %bb.0:
; AVX512-NEXT: vextractps $3, %xmm0, %eax		; AVX512F-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX512-NEXT: vcvtusi2ssl %eax, %xmm1, %xmm0		; AVX512F-NEXT: vcvtudq2ps %zmm0, %zmm0
; AVX512-NEXT: retq		; AVX512F-NEXT: # kill: def $xmm0 killed $xmm0 killed $zmm0
		; AVX512F-NEXT: vzeroupper
		; AVX512F-NEXT: retq
		;
		; AVX512VL-LABEL: extract3_uitofp_v4i32_f32:
		; AVX512VL: # %bb.0:
		; AVX512VL-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512VL-NEXT: vcvtudq2ps %xmm0, %xmm0
		; AVX512VL-NEXT: retq
		;
		; AVX512DQ-LABEL: extract3_uitofp_v4i32_f32:
		; AVX512DQ: # %bb.0:
		; AVX512DQ-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512DQ-NEXT: vcvtudq2ps %zmm0, %zmm0
		; AVX512DQ-NEXT: # kill: def $xmm0 killed $xmm0 killed $zmm0
		; AVX512DQ-NEXT: vzeroupper
		; AVX512DQ-NEXT: retq
		;
		; AVX512VLDQ-LABEL: extract3_uitofp_v4i32_f32:
		; AVX512VLDQ: # %bb.0:
		; AVX512VLDQ-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512VLDQ-NEXT: vcvtudq2ps %xmm0, %xmm0
		; AVX512VLDQ-NEXT: retq
%e = extractelement <4 x i32> %x, i32 3		%e = extractelement <4 x i32> %x, i32 3
%r = uitofp i32 %e to float		%r = uitofp i32 %e to float
ret float %r		ret float %r
}		}

define double @extract3_uitofp_v4i32_f64(<4 x i32> %x) nounwind {		define double @extract3_uitofp_v4i32_f64(<4 x i32> %x) nounwind {
; SSE2-LABEL: extract3_uitofp_v4i32_f64:		; SSE2-LABEL: extract3_uitofp_v4i32_f64:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
Show All 11 Lines
; SSE41-NEXT: retq		; SSE41-NEXT: retq
;		;
; VEX-LABEL: extract3_uitofp_v4i32_f64:		; VEX-LABEL: extract3_uitofp_v4i32_f64:
; VEX: # %bb.0:		; VEX: # %bb.0:
; VEX-NEXT: vextractps $3, %xmm0, %eax		; VEX-NEXT: vextractps $3, %xmm0, %eax
; VEX-NEXT: vcvtsi2sdq %rax, %xmm1, %xmm0		; VEX-NEXT: vcvtsi2sdq %rax, %xmm1, %xmm0
; VEX-NEXT: retq		; VEX-NEXT: retq
;		;
; AVX512-LABEL: extract3_uitofp_v4i32_f64:		; AVX512F-LABEL: extract3_uitofp_v4i32_f64:
; AVX512: # %bb.0:		; AVX512F: # %bb.0:
; AVX512-NEXT: vextractps $3, %xmm0, %eax		; AVX512F-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
; AVX512-NEXT: vcvtusi2sdl %eax, %xmm1, %xmm0		; AVX512F-NEXT: vcvtudq2pd %ymm0, %zmm0
; AVX512-NEXT: retq		; AVX512F-NEXT: # kill: def $xmm0 killed $xmm0 killed $zmm0
		; AVX512F-NEXT: vzeroupper
		; AVX512F-NEXT: retq
		;
		; AVX512VL-LABEL: extract3_uitofp_v4i32_f64:
		; AVX512VL: # %bb.0:
		; AVX512VL-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512VL-NEXT: vcvtudq2pd %xmm0, %ymm0
		; AVX512VL-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
		; AVX512VL-NEXT: vzeroupper
		; AVX512VL-NEXT: retq
		;
		; AVX512DQ-LABEL: extract3_uitofp_v4i32_f64:
		; AVX512DQ: # %bb.0:
		; AVX512DQ-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512DQ-NEXT: vcvtudq2pd %ymm0, %zmm0
		; AVX512DQ-NEXT: # kill: def $xmm0 killed $xmm0 killed $zmm0
		; AVX512DQ-NEXT: vzeroupper
		; AVX512DQ-NEXT: retq
		;
		; AVX512VLDQ-LABEL: extract3_uitofp_v4i32_f64:
		; AVX512VLDQ: # %bb.0:
		; AVX512VLDQ-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[3,1,2,3]
		; AVX512VLDQ-NEXT: vcvtudq2pd %xmm0, %ymm0
		efriedmaUnsubmitted Not Done Reply Inline Actions Why not "vcvtudq2pd %xmm0, %xmm0"? efriedma: Why not "vcvtudq2pd %xmm0, %xmm0"?
		spatelAuthorUnsubmitted Done Reply Inline Actions We're only matching the generic UINT_TO_FP node, so we go from <4 x i32> to <4 x double>. That's also why the SSE targets don't get the similar SINT_TO_FP test above here. I can look into how the SINT_TO_FP example gets narrowed and try to make that happen here too. There's no documentation that the generic nodes can change the number of elements in the vector, so I'm assuming they don't have that ability. Currently, we use the X86ISD::CVTSI2P for those patterns, so I think we need to extend the matching logic to handle that case to solve this more completely. spatel: We're only matching the generic UINT_TO_FP node, so we go from <4 x i32> to <4 x double>.
		spatelAuthorUnsubmitted Done Reply Inline Actions rL354675 spatel: rL354675
		; AVX512VLDQ-NEXT: # kill: def $xmm0 killed $xmm0 killed $ymm0
		; AVX512VLDQ-NEXT: vzeroupper
		; AVX512VLDQ-NEXT: retq
%e = extractelement <4 x i32> %x, i32 3		%e = extractelement <4 x i32> %x, i32 3
%r = uitofp i32 %e to double		%r = uitofp i32 %e to double
ret double %r		ret double %r
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[x86] vectorize more cast ops in lowering to avoid register file transfersClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 187844

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/known-signbits-vector.ll

llvm/trunk/test/CodeGen/X86/vec_int_to_fp.ll

[x86] vectorize more cast ops in lowering to avoid register file transfers
ClosedPublic