This is an archive of the discontinued LLVM Phabricator instance.

Prefer blendps over insertps codegen for one special case [X86]
ClosedPublic

Authored by spatel on Mar 13 2015, 2:49 PM.

Download Raw Diff

Details

Reviewers

qcolombet
chandlerc
ab
mkuper

Commits

rGc88f724fedef: [X86] Prefer blendps over insertps codegen for one special case
rL232850: [X86] Prefer blendps over insertps codegen for one special case

Summary

I had originally made this a FIXME in D7866, but we're attacking the problem from different angles now. If we don't have a target-specific combine on insertps, we need to generate the right code in the first place.

With this patch, for this one exact case, we'll generate:

blendps %xmm0, %xmm1, $1

instead of:

insertps %xmm0, %xmm1, $0

If there's a memory operand available for load folding and we're optimizing for size, we'll still generate the insertps.

The detailed performance data motivation for this may be found in D7866; in summary, blendps has 2-3x throughput vs. insertps on widely used chips.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 21953.Mar 13 2015, 2:49 PM

spatel retitled this revision from to Prefer blendps over insertps codegen for one special case [X86].

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: chandlerc, qcolombet, mkuper, ab.

spatel added a subscriber: Unknown Object (MLST).

spatel mentioned this in D7866: [X86] tranform insertps to blendps when possible for better performance.Mar 13 2015, 2:55 PM

Ping.

qcolombet added inline comments.Mar 20 2015, 11:23 AM

lib/Target/X86/X86ISelLowering.cpp
10520 ↗	(On Diff #21953)	Instead of checking for OptimizeForSize, I would check for MinSize or both.
10521 ↗	(On Diff #21953)	As soon as there is a folding opportunity, shouldn’t it be better to use it? Could you check that with IACA?

spatel added inline comments.Mar 20 2015, 12:25 PM

lib/Target/X86/X86ISelLowering.cpp
10520 ↗	(On Diff #21953)	Hi Quentin - Thanks for looking at the patch. I had not seen MinSize used before. That corresponds to -Oz?
10521 ↗	(On Diff #21953)	I checked this with real code running on SandyBridge, Haswell, and Jaguar. Load folding does not improve performance here. The usage of insertps is the limiting factor because it can only execute on one port. Here's the SB result from the earlier patch for a microbenchmark including loads: blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter). insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).

LGTM with the MinSize fix.

lib/Target/X86/X86ISelLowering.cpp
10520 ↗	(On Diff #21953)	Yes, it is Oz.
10521 ↗	(On Diff #21953)	Thanks for checking.

Patch updated based on feedback from Quentin:
Changed function attribute check from 'optsize' to 'minsize' (-Os vs. -Oz)

Thanks Sanjay!

This revision is now accepted and ready to land.Mar 20 2015, 1:51 PM

Closed by commit rL232850: [X86] Prefer blendps over insertps codegen for one special case (authored by spatel). · Explain WhyMar 20 2015, 2:22 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

31 lines

test/

CodeGen/

X86/

sse41.ll

41 lines

Diff 22377

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,544 Lines • ▼ Show 20 Lines	if (EltVT.getSizeInBits() == 8 \|\| EltVT.getSizeInBits() == 16) {
if (N1.getValueType() != MVT::i32)		if (N1.getValueType() != MVT::i32)
N1 = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, N1);		N1 = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, N1);
if (N2.getValueType() != MVT::i32)		if (N2.getValueType() != MVT::i32)
N2 = DAG.getIntPtrConstant(IdxVal);		N2 = DAG.getIntPtrConstant(IdxVal);
return DAG.getNode(Opc, dl, VT, N0, N1, N2);		return DAG.getNode(Opc, dl, VT, N0, N1, N2);
}		}

if (EltVT == MVT::f32) {		if (EltVT == MVT::f32) {
// Bits [7:6] of the constant are the source select. This will always be		// Bits [7:6] of the constant are the source select. This will always be
// zero here. The DAG Combiner may combine an extract_elt index into		// zero here. The DAG Combiner may combine an extract_elt index into
// these		// these bits. For example (insert (extract, 3), 2) could be matched by
// bits. For example (insert (extract, 3), 2) could be matched by		// putting the '3' into bits [7:6] of X86ISD::INSERTPS.
// putting
// the '3' into bits [7:6] of X86ISD::INSERTPS.
// Bits [5:4] of the constant are the destination select. This is the		// Bits [5:4] of the constant are the destination select. This is the
// value of the incoming immediate.		// value of the incoming immediate.
// Bits [3:0] of the constant are the zero mask. The DAG Combiner may		// Bits [3:0] of the constant are the zero mask. The DAG Combiner may
// combine either bitwise AND or insert of float 0.0 to set these bits.		// combine either bitwise AND or insert of float 0.0 to set these bits.

		const Function *F = DAG.getMachineFunction().getFunction();
		bool MinSize = F->hasFnAttribute(Attribute::MinSize);
		if (IdxVal == 0 && (!MinSize \|\| !MayFoldLoad(N1))) {
		// If this is an insertion of 32-bits into the low 32-bits of
		// a vector, we prefer to generate a blend with immediate rather
		// than an insertps. Blends are simpler operations in hardware and so
		// will always have equal or better performance than insertps.
		// But if optimizing for size and there's a load folding opportunity,
		// generate insertps because blendps does not have a 32-bit memory
		// operand form.
		N2 = DAG.getIntPtrConstant(1);
		N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);
		return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1, N2);
		}
N2 = DAG.getIntPtrConstant(IdxVal << 4);		N2 = DAG.getIntPtrConstant(IdxVal << 4);
// Create this as a scalar to vector..		// Create this as a scalar to vector..
N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);		N1 = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, N1);
return DAG.getNode(X86ISD::INSERTPS, dl, VT, N0, N1, N2);		return DAG.getNode(X86ISD::INSERTPS, dl, VT, N0, N1, N2);
}		}

if (EltVT == MVT::i32 \|\| EltVT == MVT::i64) {		if (EltVT == MVT::i32 \|\| EltVT == MVT::i64) {
// PINSR* works with constant index.		// PINSR* works with constant index.
▲ Show 20 Lines • Show All 14,016 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sse41.ll

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
	; X64-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1,2,3]			; X64-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %t1, <4 x float> %t2, i32 1) nounwind readnone			%tmp1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %t1, <4 x float> %t2, i32 1) nounwind readnone
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}

	declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone			declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone

	define <4 x float> @insertps_2(<4 x float> %t1, float %t2) nounwind {			; When optimizing for speed, prefer blendps over insertps even if it means we have to
	; X32-LABEL: insertps_2:			; generate a separate movss to load the scalar operand.
				define <4 x float> @blendps_not_insertps_1(<4 x float> %t1, float %t2) nounwind {
				; X32-LABEL: blendps_not_insertps_1:
				; X32: ## BB#0:
				; X32-NEXT: movss {{.*#+}} xmm1
				; X32-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
				; X32-NEXT: retl
				;
				; X64-LABEL: blendps_not_insertps_1:
				; X64: ## BB#0:
				; X64-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
				; X64-NEXT: retq
				%tmp1 = insertelement <4 x float> %t1, float %t2, i32 0
				ret <4 x float> %tmp1
				}

				; When optimizing for size, generate an insertps if there's a load fold opportunity.
				; The difference between i386 and x86-64 ABIs for the float operand means we should
				; generate an insertps for X32 but not for X64!
				define <4 x float> @insertps_or_blendps(<4 x float> %t1, float %t2) minsize nounwind {
				; X32-LABEL: insertps_or_blendps:
	; X32: ## BB#0:			; X32: ## BB#0:
	; X32-NEXT: insertps {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]			; X32-NEXT: insertps {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: insertps_2:			; X64-LABEL: insertps_or_blendps:
	; X64: ## BB#0:			; X64: ## BB#0:
	; X64-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X64-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = insertelement <4 x float> %t1, float %t2, i32 0			%tmp1 = insertelement <4 x float> %t1, float %t2, i32 0
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}
	define <4 x float> @insertps_3(<4 x float> %t1, <4 x float> %t2) nounwind {
	; X32-LABEL: insertps_3:			; An insert into the low 32-bits of a vector from the low 32-bits of another vector
				; is always just a blendps because blendps is never more expensive than insertps.
				define <4 x float> @blendps_not_insertps_2(<4 x float> %t1, <4 x float> %t2) nounwind {
				; X32-LABEL: blendps_not_insertps_2:
	; X32: ## BB#0:			; X32: ## BB#0:
	; X32-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X32-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: insertps_3:			; X64-LABEL: blendps_not_insertps_2:
	; X64: ## BB#0:			; X64: ## BB#0:
	; X64-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X64-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp2 = extractelement <4 x float> %t2, i32 0			%tmp2 = extractelement <4 x float> %t2, i32 0
	%tmp1 = insertelement <4 x float> %t1, float %tmp2, i32 0			%tmp1 = insertelement <4 x float> %t1, float %tmp2, i32 0
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}

	define i32 @ptestz_1(<2 x i64> %t1, <2 x i64> %t2) nounwind {			define i32 @ptestz_1(<2 x i64> %t1, <2 x i64> %t2) nounwind {
	; X32-LABEL: ptestz_1:			; X32-LABEL: ptestz_1:
	▲ Show 20 Lines • Show All 975 Lines • Show Last 20 Lines