This is an archive of the discontinued LLVM Phabricator instance.

Optimized instruction sequence for sitofp operation on X86-32
ClosedPublic

Authored by delena on Jan 7 2016, 2:48 AM.

Download Raw Diff

Details

Reviewers

Commits

rG542dfcf44c22: Optimized instruction sequence for sitofp operation on X86-32 Optimized sitofp…
rL257285: Optimized instruction sequence for sitofp operation on X86-32

Summary

Optimized
sitofp i64 %x to double
the current sequence

movl %ecx, 8(%esp)
movl %edx, 12(%esp)
fildll 8(%esp)

replaced with

movd %ecx, %xmm0
movd %edx, %xmm1
punpckldq %xmm1, %xmm0
movq %xmm0, 8(%esp)

Diff Detail

Repository: rL LLVM

Event Timeline

delena updated this revision to Diff 44200.Jan 7 2016, 2:48 AM

delena retitled this revision from to Optimized instruction sequence for sitofp operation on X86-32.

delena updated this object.

delena added a reviewer: mbodart.

delena set the repository for this revision to rL LLVM.

delena added a subscriber: llvm-commits.

Hi Elena,

Just a few minor comments, otherwise LGTM!

mitch

../lib/Target/X86/X86ISelLowering.cpp
12658	It would be useful to have a source comment here to the effect: Bitcasting to f64 here allows us to do a single 64-bit store from an SSE register, avoiding the store forwarding penalty that would come with two 32-bit stores.
../test/CodeGen/X86/scalar-int-to-fp.ll
76–78	For both test functions u64_to_f and s64_to_f, we should add the following additional checks before the fildll: AVX512_32: punpckldq SSE_32: punpckldq
145–155	Rather than creating a new function, it would seem more simple to just add a check for punpckldq, for both SSE2_32 and AVX512_32, in the existing s64_to_d function.

delena marked 2 inline comments as done.Jan 10 2016, 1:37 AM

delena added inline comments.

../test/CodeGen/X86/scalar-int-to-fp.ll
145–155	This is the code generated for s64_to_d, because the input parameters are already on stack. pushl %ebp movl %esp, %ebp andl $-8, %esp subl $8, %esp fildll 8(%ebp) fstpl (%esp) fldl (%esp) movl %ebp, %esp popl %ebp retl

Added the same bitcast for UINT_TO_FP.
Added more tests.

Closed by commit rL257285: Optimized instruction sequence for sitofp operation on X86-32 (authored by delena). · Explain WhyJan 10 2016, 1:45 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

../

lib/

Target/

X86/

X86ISelLowering.cpp

58 lines

test/

CodeGen/

X86/

dagcombine-cse.ll

2 lines

scalar-int-to-fp.ll

43 lines

Diff 44421

../lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
if (!X86ScalarSSEf64) {		if (!X86ScalarSSEf64) {
setOperationAction(ISD::BITCAST , MVT::f32 , Expand);		setOperationAction(ISD::BITCAST , MVT::f32 , Expand);
setOperationAction(ISD::BITCAST , MVT::i32 , Expand);		setOperationAction(ISD::BITCAST , MVT::i32 , Expand);
if (Subtarget->is64Bit()) {		if (Subtarget->is64Bit()) {
setOperationAction(ISD::BITCAST , MVT::f64 , Expand);		setOperationAction(ISD::BITCAST , MVT::f64 , Expand);
// Without SSE, i64->f64 goes through memory.		// Without SSE, i64->f64 goes through memory.
setOperationAction(ISD::BITCAST , MVT::i64 , Expand);		setOperationAction(ISD::BITCAST , MVT::i64 , Expand);
}		}
}		} else if (!Subtarget->is64Bit())
		setOperationAction(ISD::BITCAST , MVT::i64 , Custom);

// Scalar integer divide and remainder are lowered to use operations that		// Scalar integer divide and remainder are lowered to use operations that
// produce two results, to match the available instructions. This exposes		// produce two results, to match the available instructions. This exposes
// the two-result form to trivial CSE, which is able to combine x/y and x%y		// the two-result form to trivial CSE, which is able to combine x/y and x%y
// into a single instruction.		// into a single instruction.
//		//
// Scalar integer multiply-high is also lowered to use two-result		// Scalar integer multiply-high is also lowered to use two-result
// operations, to match the available instructions. However, plain multiply		// operations, to match the available instructions. However, plain multiply
▲ Show 20 Lines • Show All 12,369 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::LowerSINT_TO_FP(SDValue Op,
// Legal.		// Legal.
if (SrcVT == MVT::i32 && isScalarFPTypeInSSEReg(Op.getValueType()))		if (SrcVT == MVT::i32 && isScalarFPTypeInSSEReg(Op.getValueType()))
return Op;		return Op;
if (SrcVT == MVT::i64 && isScalarFPTypeInSSEReg(Op.getValueType()) &&		if (SrcVT == MVT::i64 && isScalarFPTypeInSSEReg(Op.getValueType()) &&
Subtarget->is64Bit()) {		Subtarget->is64Bit()) {
return Op;		return Op;
}		}

		SDValue ValueToStore = Op.getOperand(0);
		if (SrcVT == MVT::i64 && isScalarFPTypeInSSEReg(Op.getValueType()) &&
		!Subtarget->is64Bit())
		// Bitcasting to f64 here allows us to do a single 64-bit store from
		mbodartUnsubmitted Done Reply Inline Actions It would be useful to have a source comment here to the effect: Bitcasting to f64 here allows us to do a single 64-bit store from an SSE register, avoiding the store forwarding penalty that would come with two 32-bit stores. mbodart: It would be useful to have a source comment here to the effect: // Bitcasting to f64 here…
		// an SSE register, avoiding the store forwarding penalty that would come
		// with two 32-bit stores.
		ValueToStore = DAG.getBitcast(MVT::f64, ValueToStore);

unsigned Size = SrcVT.getSizeInBits()/8;		unsigned Size = SrcVT.getSizeInBits()/8;
MachineFunction &MF = DAG.getMachineFunction();		MachineFunction &MF = DAG.getMachineFunction();
auto PtrVT = getPointerTy(MF.getDataLayout());		auto PtrVT = getPointerTy(MF.getDataLayout());
int SSFI = MF.getFrameInfo()->CreateStackObject(Size, Size, false);		int SSFI = MF.getFrameInfo()->CreateStackObject(Size, Size, false);
SDValue StackSlot = DAG.getFrameIndex(SSFI, PtrVT);		SDValue StackSlot = DAG.getFrameIndex(SSFI, PtrVT);
SDValue Chain = DAG.getStore(		SDValue Chain = DAG.getStore(
DAG.getEntryNode(), dl, Op.getOperand(0), StackSlot,		DAG.getEntryNode(), dl, ValueToStore, StackSlot,
MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI), false,		MachinePointerInfo::getFixedStack(DAG.getMachineFunction(), SSFI), false,
false, 0);		false, 0);
return BuildFILD(Op, SrcVT, Chain, StackSlot, DAG);		return BuildFILD(Op, SrcVT, Chain, StackSlot, DAG);
}		}

SDValue X86TargetLowering::BuildFILD(SDValue Op, EVT SrcVT, SDValue Chain,		SDValue X86TargetLowering::BuildFILD(SDValue Op, EVT SrcVT, SDValue Chain,
SDValue StackSlot,		SDValue StackSlot,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
▲ Show 20 Lines • Show All 356 Lines • ▼ Show 20 Lines	if (SrcVT == MVT::i32) {
SDValue Store2 = DAG.getStore(Store1, dl, DAG.getConstant(0, dl, MVT::i32),		SDValue Store2 = DAG.getStore(Store1, dl, DAG.getConstant(0, dl, MVT::i32),
OffsetSlot, MachinePointerInfo(),		OffsetSlot, MachinePointerInfo(),
false, false, 0);		false, false, 0);
SDValue Fild = BuildFILD(Op, MVT::i64, Store2, StackSlot, DAG);		SDValue Fild = BuildFILD(Op, MVT::i64, Store2, StackSlot, DAG);
return Fild;		return Fild;
}		}

assert(SrcVT == MVT::i64 && "Unexpected type in UINT_TO_FP");		assert(SrcVT == MVT::i64 && "Unexpected type in UINT_TO_FP");
SDValue Store = DAG.getStore(DAG.getEntryNode(), dl, Op.getOperand(0),		SDValue ValueToStore = Op.getOperand(0);
		if (isScalarFPTypeInSSEReg(Op.getValueType()) && !Subtarget->is64Bit())
		// Bitcasting to f64 here allows us to do a single 64-bit store from
		// an SSE register, avoiding the store forwarding penalty that would come
		// with two 32-bit stores.
		ValueToStore = DAG.getBitcast(MVT::f64, ValueToStore);
		SDValue Store = DAG.getStore(DAG.getEntryNode(), dl, ValueToStore,
StackSlot, MachinePointerInfo(),		StackSlot, MachinePointerInfo(),
false, false, 0);		false, false, 0);
// For i64 source, we need to add the appropriate power of 2 if the input		// For i64 source, we need to add the appropriate power of 2 if the input
// was negative. This is the same as the optimization in		// was negative. This is the same as the optimization in
// DAGTypeLegalizer::ExpandIntOp_UNIT_TO_FP, and for it to be safe here,		// DAGTypeLegalizer::ExpandIntOp_UNIT_TO_FP, and for it to be safe here,
// we must be careful to do the computation in x87 extended precision, not		// we must be careful to do the computation in x87 extended precision, not
// in SSE. (The generic code can't know it's OK to do this, or how to.)		// in SSE. (The generic code can't know it's OK to do this, or how to.)
int SSFI = cast<FrameIndexSDNode>(StackSlot)->getIndex();		int SSFI = cast<FrameIndexSDNode>(StackSlot)->getIndex();
▲ Show 20 Lines • Show All 6,456 Lines • ▼ Show 20 Lines	static SDValue LowerCMP_SWAP(SDValue Op, const X86Subtarget *Subtarget,
return SDValue();		return SDValue();
}		}

static SDValue LowerBITCAST(SDValue Op, const X86Subtarget *Subtarget,		static SDValue LowerBITCAST(SDValue Op, const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT SrcVT = Op.getOperand(0).getSimpleValueType();		MVT SrcVT = Op.getOperand(0).getSimpleValueType();
MVT DstVT = Op.getSimpleValueType();		MVT DstVT = Op.getSimpleValueType();

if (SrcVT == MVT::v2i32 \|\| SrcVT == MVT::v4i16 \|\| SrcVT == MVT::v8i8) {		if (SrcVT == MVT::v2i32 \|\| SrcVT == MVT::v4i16 \|\| SrcVT == MVT::v8i8 \|\|
		SrcVT == MVT::i64) {
assert(Subtarget->hasSSE2() && "Requires at least SSE2!");		assert(Subtarget->hasSSE2() && "Requires at least SSE2!");
if (DstVT != MVT::f64)		if (DstVT != MVT::f64)
// This conversion needs to be expanded.		// This conversion needs to be expanded.
return SDValue();		return SDValue();

SDValue InVec = Op->getOperand(0);		SDValue Op0 = Op->getOperand(0);
		SmallVector<SDValue, 16> Elts;
SDLoc dl(Op);		SDLoc dl(Op);
unsigned NumElts = SrcVT.getVectorNumElements();		unsigned NumElts;
MVT SVT = SrcVT.getVectorElementType();		MVT SVT;
		if (SrcVT.isVector()) {
		NumElts = SrcVT.getVectorNumElements();
		SVT = SrcVT.getVectorElementType();

// Widen the vector in input in the case of MVT::v2i32.		// Widen the vector in input in the case of MVT::v2i32.
// Example: from MVT::v2i32 to MVT::v4i32.		// Example: from MVT::v2i32 to MVT::v4i32.
SmallVector<SDValue, 16> Elts;
for (unsigned i = 0, e = NumElts; i != e; ++i)		for (unsigned i = 0, e = NumElts; i != e; ++i)
Elts.push_back(DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, SVT, InVec,		Elts.push_back(DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, SVT, Op0,
DAG.getIntPtrConstant(i, dl)));		DAG.getIntPtrConstant(i, dl)));
		} else {
		assert(SrcVT == MVT::i64 && !Subtarget->is64Bit() &&
		"Unexpected source type in LowerBITCAST");
		Elts.push_back(DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op0,
		DAG.getIntPtrConstant(0, dl)));
		Elts.push_back(DAG.getNode(ISD::EXTRACT_ELEMENT, dl, MVT::i32, Op0,
		DAG.getIntPtrConstant(1, dl)));
		NumElts = 2;
		SVT = MVT::i32;
		}
// Explicitly mark the extra elements as Undef.		// Explicitly mark the extra elements as Undef.
Elts.append(NumElts, DAG.getUNDEF(SVT));		Elts.append(NumElts, DAG.getUNDEF(SVT));

EVT NewVT = EVT::getVectorVT(DAG.getContext(), SVT, NumElts 2);		EVT NewVT = EVT::getVectorVT(DAG.getContext(), SVT, NumElts 2);
SDValue BV = DAG.getNode(ISD::BUILD_VECTOR, dl, NewVT, Elts);		SDValue BV = DAG.getNode(ISD::BUILD_VECTOR, dl, NewVT, Elts);
SDValue ToV2F64 = DAG.getBitcast(MVT::v2f64, BV);		SDValue ToV2F64 = DAG.getBitcast(MVT::v2f64, BV);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, ToV2F64,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::f64, ToV2F64,
DAG.getIntPtrConstant(0, dl));		DAG.getIntPtrConstant(0, dl));
▲ Show 20 Lines • Show All 9,230 Lines • Show Last 20 Lines

../test/CodeGen/X86/dagcombine-cse.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: llc < %s -march=x86 -mattr=+sse2 -mtriple=i386-apple-darwin -stats 2>&1 \| grep asm-printer \| grep 14			; RUN: llc < %s -march=x86 -mattr=+sse2 -mtriple=i386-apple-darwin -stats 2>&1 \| grep asm-printer \| grep 13

	define i32 @t(i8* %ref_frame_ptr, i32 %ref_frame_stride, i32 %idxX, i32 %idxY) nounwind {			define i32 @t(i8* %ref_frame_ptr, i32 %ref_frame_stride, i32 %idxX, i32 %idxY) nounwind {
	entry:			entry:
	%tmp7 = mul i32 %idxY, %ref_frame_stride ; <i32> [#uses=2]			%tmp7 = mul i32 %idxY, %ref_frame_stride ; <i32> [#uses=2]
	%tmp9 = add i32 %tmp7, %idxX ; <i32> [#uses=1]			%tmp9 = add i32 %tmp7, %idxX ; <i32> [#uses=1]
	%tmp11 = getelementptr i8, i8* %ref_frame_ptr, i32 %tmp9 ; <i8*> [#uses=1]			%tmp11 = getelementptr i8, i8* %ref_frame_ptr, i32 %tmp9 ; <i8*> [#uses=1]
	%tmp1112 = bitcast i8* %tmp11 to i32* ; <i32*> [#uses=1]			%tmp1112 = bitcast i8* %tmp11 to i32* ; <i32*> [#uses=1]
	%tmp13 = load i32, i32* %tmp1112, align 4 ; <i32> [#uses=1]			%tmp13 = load i32, i32* %tmp1112, align 4 ; <i32> [#uses=1]
	Show All 18 Lines

../test/CodeGen/X86/scalar-int-to-fp.ll

	Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines

	; CHECK-LABEL: s32_to_x			; CHECK-LABEL: s32_to_x
	; CHECK: fildl			; CHECK: fildl
	define x86_fp80 @s32_to_x(i32 %a) nounwind {			define x86_fp80 @s32_to_x(i32 %a) nounwind {
	%r = sitofp i32 %a to x86_fp80			%r = sitofp i32 %a to x86_fp80
	ret x86_fp80 %r			ret x86_fp80 %r
	}			}

	; CHECK-LABEL: u64_to_f			; CHECK-LABEL: u64_to_f
				; AVX512_32: vmovq {{.*#+}} xmm0 = mem[0],zero
				; AVX512_32: vmovlpd %xmm0, {{[0-9]+}}(%esp)
				mbodartUnsubmitted Done Reply Inline Actions For both test functions u64_to_f and s64_to_f, we should add the following additional checks before the fildll: AVX512_32: punpckldq SSE_32: punpckldq mbodart: For both test functions u64_to_f and s64_to_f, we should add the following additional checks…
	; AVX512_32: fildll			; AVX512_32: fildll

	; AVX512_64: vcvtusi2ssq			; AVX512_64: vcvtusi2ssq

				; SSE2_32: movq {{.*#+}} xmm0 = mem[0],zero
				; SSE2_32: movq %xmm0, {{[0-9]+}}(%esp)
	; SSE2_32: fildll			; SSE2_32: fildll

	; SSE2_64: cvtsi2ssq			; SSE2_64: cvtsi2ssq
	; X87: fildll			; X87: fildll
	define float @u64_to_f(i64 %a) nounwind {			define float @u64_to_f(i64 %a) nounwind {
	%r = uitofp i64 %a to float			%r = uitofp i64 %a to float
	ret float %r			ret float %r
	}			}

	; CHECK-LABEL: s64_to_f			; CHECK-LABEL: s64_to_f
	; AVX512_32: fildll			; AVX512_32: fildll
	; AVX512_64: vcvtsi2ssq			; AVX512_64: vcvtsi2ssq
	; SSE2_32: fildll			; SSE2_32: fildll
	; SSE2_64: cvtsi2ssq			; SSE2_64: cvtsi2ssq
	; X87: fildll			; X87: fildll
	define float @s64_to_f(i64 %a) nounwind {			define float @s64_to_f(i64 %a) nounwind {
	%r = sitofp i64 %a to float			%r = sitofp i64 %a to float
	ret float %r			ret float %r
	}			}

				; CHECK-LABEL: s64_to_f_2
				; SSE2_32: movd %ecx, %xmm0
				; SSE2_32: movd %eax, %xmm1
				; SSE2_32: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
				; SSE2_32: movq %xmm1, {{[0-9]+}}(%esp)
				; SSE2_32: fildll {{[0-9]+}}(%esp)

				; AVX512_32: vmovd %eax, %xmm0
				; AVX512_32: vpinsrd $1, %ecx, %xmm0, %xmm0
				; AVX512_32: vmovlpd %xmm0, {{[0-9]+}}(%esp)
				; AVX512_32: fildll {{[0-9]+}}(%esp)

				define float @s64_to_f_2(i64 %a) nounwind {
				%a1 = add i64 %a, 5
				%r = sitofp i64 %a1 to float
				ret float %r
				}

	; CHECK-LABEL: u64_to_d			; CHECK-LABEL: u64_to_d
	; AVX512_32: vpunpckldq			; AVX512_32: vpunpckldq
	; AVX512_64: vcvtusi2sdq			; AVX512_64: vcvtusi2sdq
	; SSE2_32: punpckldq			; SSE2_32: punpckldq
	; SSE2_64: punpckldq			; SSE2_64: punpckldq
	; X87: fildll			; X87: fildll
	define double @u64_to_d(i64 %a) nounwind {			define double @u64_to_d(i64 %a) nounwind {
	%r = uitofp i64 %a to double			%r = uitofp i64 %a to double
	ret double %r			ret double %r
	}			}

	; CHECK-LABEL: s64_to_d			; CHECK-LABEL: s64_to_d
	; AVX512_32: fildll			; AVX512_32: fildll
	; AVX512_64: vcvtsi2sdq			; AVX512_64: vcvtsi2sdq
	; SSE2_32: fildll			; SSE2_32: fildll
	; SSE2_64: cvtsi2sdq			; SSE2_64: cvtsi2sdq
	; X87: fildll			; X87: fildll
	define double @s64_to_d(i64 %a) nounwind {			define double @s64_to_d(i64 %a) nounwind {
	%r = sitofp i64 %a to double			%r = sitofp i64 %a to double
	ret double %r			ret double %r
	}			}

				; CHECK-LABEL: s64_to_d_2
				; SSE2_32: movd %ecx, %xmm0
				; SSE2_32: movd %eax, %xmm1
				; SSE2_32: punpckldq %xmm0, %xmm1
				; SSE2_32: movq %xmm1, {{[0-9]+}}(%esp)
				; SSE2_32: fildll

				; AVX512_32: vmovd %eax, %xmm0
				; AVX512_32: vpinsrd $1, %ecx, %xmm0, %xmm0
				; AVX512_32: vmovlpd %xmm0, {{[0-9]+}}(%esp)
				; AVX512_32: fildll
				mbodartUnsubmitted Not Done Reply Inline Actions Rather than creating a new function, it would seem more simple to just add a check for punpckldq, for both SSE2_32 and AVX512_32, in the existing s64_to_d function. mbodart: Rather than creating a new function, it would seem more simple to just add a check for…
				delenaAuthorUnsubmitted Not Done Reply Inline Actions This is the code generated for s64_to_d, because the input parameters are already on stack. pushl %ebp movl %esp, %ebp andl $-8, %esp subl $8, %esp fildll 8(%ebp) fstpl (%esp) fldl (%esp) movl %ebp, %esp popl %ebp retl delena: This is the code generated for s64_to_d, because the input parameters are already on stack.

				define double @s64_to_d_2(i64 %a) nounwind {
				%b = add i64 %a, 5
				%f = sitofp i64 %b to double
				ret double %f
				}

	; CHECK-LABEL: u64_to_x			; CHECK-LABEL: u64_to_x
	; CHECK: fildll			; CHECK: fildll
	define x86_fp80 @u64_to_x(i64 %a) nounwind {			define x86_fp80 @u64_to_x(i64 %a) nounwind {
	%r = uitofp i64 %a to x86_fp80			%r = uitofp i64 %a to x86_fp80
	ret x86_fp80 %r			ret x86_fp80 %r
	}			}

	; CHECK-LABEL: s64_to_x			; CHECK-LABEL: s64_to_x
	; CHECK: fildll			; CHECK: fildll
	define x86_fp80 @s64_to_x(i64 %a) nounwind {			define x86_fp80 @s64_to_x(i64 %a) nounwind {
	%r = sitofp i64 %a to x86_fp80			%r = sitofp i64 %a to x86_fp80
	ret x86_fp80 %r			ret x86_fp80 %r
	}			}