This is an archive of the discontinued LLVM Phabricator instance.

[x86, SSE] change patterns for CMPP to float types to allow matching with SSE1 (PR28044)
ClosedPublic

Authored by spatel on Jun 10 2016, 11:57 AM.

Download Raw Diff

Details

Reviewers

RKSimon
ab
craig.topper

Commits

rG977530a8c9e3: [x86, SSE] change patterns for CMPP to float types to allow matching with SSE1…
rL272511: [x86, SSE] change patterns for CMPP to float types to allow matching with…

Summary

This patch is intended to solve:
https://llvm.org/bugs/show_bug.cgi?id=28044

By changing the definition of X86ISD::CMPP to use float types, we allow it to be created and pass legalization for an SSE1-only target where v4i32 is not legal.

The motivational trail for this change includes:
https://llvm.org/bugs/show_bug.cgi?id=28001

and eventually makes this trigger:
http://reviews.llvm.org/D21190

Ie, after this step, we should be free to have Clang generate FP compare IR instead of x86 intrinsics for SSE C packed compare intrinsics. (We can auto-upgrade and remove the LLVM sse.cmp intrinsics as a follow-up step.) Once we're generating vector IR instead of x86 intrinsics, a big pile of generic optimizations can trigger.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 60380.Jun 10 2016, 11:57 AM

spatel retitled this revision from to [x86, SSE] change patterns for CMPP to float types to allow matching with SSE1 (PR28044).

spatel updated this object.

spatel added reviewers: ab, RKSimon, craig.topper.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptJun 10 2016, 11:57 AM

Is already a number of SSE1/SSE2+ test cases covering all the float comparisons? All I know of is vector-compare-results.ll

lib/Target/X86/X86ISelLowering.cpp
15146 ↗	(On Diff #60380)	Please can you place comments near each SSECC hardcoded constant explaining what they are?

In D21235#455044, @RKSimon wrote:

Is already a number of SSE1/SSE2+ test cases covering all the float comparisons? All I know of is vector-compare-results.ll

I don't think there's an exhaustive test for each possible fcmp predicate in IR, but most of them are covered in commute-fcmp.ll after:
http://reviews.llvm.org/rL272397

Also, I have a patch in progress for the intrinsic reduction that gets enabled by this, and that will change:
sse-intrinsics-fast-isel.ll
sse2-intrinsics-fast-isel.ll
to IR fcmp tests rather than llvm.x86.sse[2].cmp.[ps/pd], so we should have complete coverage after that.

Patch updated:

Add comments to explain magic constant values.
Add TODO comment for potential Intel AVX optimization (not sure how software is supposed to detect this difference between AMD and Intel).

LGTM

This revision is now accepted and ready to land.Jun 11 2016, 5:39 AM

Closed by commit rL272511: [x86, SSE] change patterns for CMPP to float types to allow matching with… (authored by spatel). · Explain WhyJun 12 2016, 8:10 AM

This revision was automatically updated to reflect the committed changes.

hokein mentioned this in D21278: Fix an enumeral mismatch warning..Jun 13 2016, 1:49 AM

hokein mentioned this in rL272539: Fix an enumeral mismatch warning..Jun 13 2016, 2:10 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

48 lines

X86InstrFragmentsSIMD.td

2 lines

X86InstrSSE.td

24 lines

test/

CodeGen/

X86/

sse1.ll

51 lines

Diff 60469

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 15,162 Lines • ▼ Show 20 Lines	static SDValue LowerVSETCC(SDValue Op, const X86Subtarget &Subtarget,
SDLoc dl(Op);		SDLoc dl(Op);

if (isFP) {		if (isFP) {
#ifndef NDEBUG		#ifndef NDEBUG
MVT EltVT = Op0.getSimpleValueType().getVectorElementType();		MVT EltVT = Op0.getSimpleValueType().getVectorElementType();
assert(EltVT == MVT::f32 \|\| EltVT == MVT::f64);		assert(EltVT == MVT::f32 \|\| EltVT == MVT::f64);
#endif		#endif

unsigned SSECC = translateX86FSETCC(SetCCOpcode, Op0, Op1);		unsigned Opc;
unsigned Opc = X86ISD::CMPP;
if (Subtarget.hasAVX512() && VT.getVectorElementType() == MVT::i1) {		if (Subtarget.hasAVX512() && VT.getVectorElementType() == MVT::i1) {
assert(VT.getVectorNumElements() <= 16);		assert(VT.getVectorNumElements() <= 16);
Opc = X86ISD::CMPM;		Opc = X86ISD::CMPM;
		} else {
		Opc = X86ISD::CMPP;
		// The SSE/AVX packed FP comparison nodes are defined with a
		// floating-point vector result that matches the operand type. This allows
		// them to work with an SSE1 target (integer vector types are not legal).
		VT = Op0.getSimpleValueType();
}		}
// In the two special cases we can't handle, emit two comparisons.
		// In the two cases not handled by SSE compare predicates (SETUEQ/SETONE),
		// emit two comparisons and a logic op to tie them together.
		// TODO: This can be avoided if Intel (and only Intel as of 2016) AVX is
		// available.
		SDValue Cmp;
		unsigned SSECC = translateX86FSETCC(SetCCOpcode, Op0, Op1);
if (SSECC == 8) {		if (SSECC == 8) {
		// LLVM predicate is SETUEQ or SETONE.
unsigned CC0, CC1;		unsigned CC0, CC1;
unsigned CombineOpc;		unsigned CombineOpc;
if (SetCCOpcode == ISD::SETUEQ) {		if (SetCCOpcode == ISD::SETUEQ) {
CC0 = 3; CC1 = 0; CombineOpc = ISD::OR;		CC0 = 3; // UNORD
		CC1 = 0; // EQ
		CombineOpc = Opc == X86ISD::CMPP ? X86ISD::FOR : ISD::OR;
} else {		} else {
assert(SetCCOpcode == ISD::SETONE);		assert(SetCCOpcode == ISD::SETONE);
CC0 = 7; CC1 = 4; CombineOpc = ISD::AND;		CC0 = 7; // ORD
		CC1 = 4; // NEQ
		CombineOpc = Opc == X86ISD::CMPP ? X86ISD::FAND : ISD::AND;
}		}

SDValue Cmp0 = DAG.getNode(Opc, dl, VT, Op0, Op1,		SDValue Cmp0 = DAG.getNode(Opc, dl, VT, Op0, Op1,
DAG.getConstant(CC0, dl, MVT::i8));		DAG.getConstant(CC0, dl, MVT::i8));
SDValue Cmp1 = DAG.getNode(Opc, dl, VT, Op0, Op1,		SDValue Cmp1 = DAG.getNode(Opc, dl, VT, Op0, Op1,
DAG.getConstant(CC1, dl, MVT::i8));		DAG.getConstant(CC1, dl, MVT::i8));
return DAG.getNode(CombineOpc, dl, VT, Cmp0, Cmp1);		Cmp = DAG.getNode(CombineOpc, dl, VT, Cmp0, Cmp1);
}		} else {
// Handle all other FP comparisons here.		// Handle all other FP comparisons here.
return DAG.getNode(Opc, dl, VT, Op0, Op1,		Cmp = DAG.getNode(Opc, dl, VT, Op0, Op1,
DAG.getConstant(SSECC, dl, MVT::i8));		DAG.getConstant(SSECC, dl, MVT::i8));
}		}

		// If this is SSE/AVX CMPP, bitcast the result back to integer to match the
		// result type of SETCC. The bitcast is expected to be optimized away
		// during combining/isel.
		if (Opc == X86ISD::CMPP)
		Cmp = DAG.getBitcast(Op.getSimpleValueType(), Cmp);

		return Cmp;
		}

MVT VTOp0 = Op0.getSimpleValueType();		MVT VTOp0 = Op0.getSimpleValueType();
assert(VTOp0 == Op1.getSimpleValueType() &&		assert(VTOp0 == Op1.getSimpleValueType() &&
"Expected operands with same type!");		"Expected operands with same type!");
assert(VT.getVectorNumElements() == VTOp0.getVectorNumElements() &&		assert(VT.getVectorNumElements() == VTOp0.getVectorNumElements() &&
"Invalid number of packed elements for source and destination!");		"Invalid number of packed elements for source and destination!");

if (VT.is128BitVector() && VTOp0.is256BitVector()) {		if (VT.is128BitVector() && VTOp0.is256BitVector()) {
// On non-AVX512 targets, a vector of MVT::i1 is promoted by the type		// On non-AVX512 targets, a vector of MVT::i1 is promoted by the type
▲ Show 20 Lines • Show All 14,435 Lines • ▼ Show 20 Lines	if (IsSEXT0 && IsVZero1) {
return DAG.getNOT(DL, LHS.getOperand(0), VT);		return DAG.getNOT(DL, LHS.getOperand(0), VT);

assert((CC == ISD::SETNE \|\| CC == ISD::SETLT) &&		assert((CC == ISD::SETNE \|\| CC == ISD::SETLT) &&
"Unexpected condition code!");		"Unexpected condition code!");
return LHS.getOperand(0);		return LHS.getOperand(0);
}		}
}		}

		// For an SSE1-only target, lower to X86ISD::CMPP early to avoid scalarization
		// via legalization because v4i32 is not a legal type.
		if (Subtarget.hasSSE1() && !Subtarget.hasSSE2() && VT == MVT::v4i32)
		return LowerVSETCC(SDValue(N, 0), Subtarget, DAG);

return SDValue();		return SDValue();
}		}

static SDValue combineGatherScatter(SDNode *N, SelectionDAG &DAG) {		static SDValue combineGatherScatter(SDNode *N, SelectionDAG &DAG) {
SDLoc DL(N);		SDLoc DL(N);
// Gather and Scatter instructions use k-registers for masks. The type of		// Gather and Scatter instructions use k-registers for masks. The type of
// the masks is v*i1. So the mask will be truncated anyway.		// the masks is v*i1. So the mask will be truncated anyway.
// The SIGN_EXTEND_INREG my be dropped.		// The SIGN_EXTEND_INREG my be dropped.
▲ Show 20 Lines • Show All 1,551 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrFragmentsSIMD.td

	Show All 29 Lines
	def load_mvmmx : PatFrag<(ops node:$ptr),			def load_mvmmx : PatFrag<(ops node:$ptr),
	(x86mmx (MMX_X86movw2d (load node:$ptr)))>;			(x86mmx (MMX_X86movw2d (load node:$ptr)))>;
	def bc_mmx : PatFrag<(ops node:$in), (x86mmx (bitconvert node:$in))>;			def bc_mmx : PatFrag<(ops node:$in), (x86mmx (bitconvert node:$in))>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// SSE specific DAG Nodes.			// SSE specific DAG Nodes.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def SDTX86VFCMP : SDTypeProfile<1, 3, [SDTCisInt<0>, SDTCisSameAs<1, 2>,			def SDTX86VFCMP : SDTypeProfile<1, 3, [SDTCisFP<0>, SDTCisSameAs<1, 2>,
	SDTCisFP<1>, SDTCisVT<3, i8>,			SDTCisFP<1>, SDTCisVT<3, i8>,
	SDTCisVec<1>]>;			SDTCisVec<1>]>;
	def SDTX86CmpTestSae : SDTypeProfile<1, 3, [SDTCisVT<0, i32>,			def SDTX86CmpTestSae : SDTypeProfile<1, 3, [SDTCisVT<0, i32>,
	SDTCisSameAs<1, 2>, SDTCisInt<3>]>;			SDTCisSameAs<1, 2>, SDTCisInt<3>]>;

	def X86fmin : SDNode<"X86ISD::FMIN", SDTFPBinOp>;			def X86fmin : SDNode<"X86ISD::FMIN", SDTFPBinOp>;
	def X86fmax : SDNode<"X86ISD::FMAX", SDTFPBinOp>;			def X86fmax : SDNode<"X86ISD::FMAX", SDTFPBinOp>;

	▲ Show 20 Lines • Show All 984 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,492 Lines • ▼ Show 20 Lines	defm CMPPS : sse12_cmp_packed<VR128, f128mem, SSECC, int_x86_sse_cmp_ps,
SSEPackedSingle, i8immZExt5, memopv4f32, SSE_ALU_F32P>, PS;		SSEPackedSingle, i8immZExt5, memopv4f32, SSE_ALU_F32P>, PS;
defm CMPPD : sse12_cmp_packed<VR128, f128mem, SSECC, int_x86_sse2_cmp_pd,		defm CMPPD : sse12_cmp_packed<VR128, f128mem, SSECC, int_x86_sse2_cmp_pd,
"cmp${cc}pd\t{$src2, $dst\|$dst, $src2}",		"cmp${cc}pd\t{$src2, $dst\|$dst, $src2}",
"cmppd\t{$cc, $src2, $dst\|$dst, $src2, $cc}",		"cmppd\t{$cc, $src2, $dst\|$dst, $src2, $cc}",
SSEPackedDouble, i8immZExt5, memopv2f64, SSE_ALU_F64P>, PD;		SSEPackedDouble, i8immZExt5, memopv2f64, SSE_ALU_F64P>, PD;
}		}

let Predicates = [HasAVX] in {		let Predicates = [HasAVX] in {
def : Pat<(v4i32 (X86cmpp (v4f32 VR128:$src1), VR128:$src2, imm:$cc)),		def : Pat<(v4f32 (X86cmpp (v4f32 VR128:$src1), VR128:$src2, imm:$cc)),
(VCMPPSrri (v4f32 VR128:$src1), (v4f32 VR128:$src2), imm:$cc)>;		(VCMPPSrri (v4f32 VR128:$src1), (v4f32 VR128:$src2), imm:$cc)>;
def : Pat<(v4i32 (X86cmpp (v4f32 VR128:$src1), (loadv4f32 addr:$src2), imm:$cc)),		def : Pat<(v4f32 (X86cmpp (v4f32 VR128:$src1), (loadv4f32 addr:$src2), imm:$cc)),
(VCMPPSrmi (v4f32 VR128:$src1), addr:$src2, imm:$cc)>;		(VCMPPSrmi (v4f32 VR128:$src1), addr:$src2, imm:$cc)>;
def : Pat<(v2i64 (X86cmpp (v2f64 VR128:$src1), VR128:$src2, imm:$cc)),		def : Pat<(v2f64 (X86cmpp (v2f64 VR128:$src1), VR128:$src2, imm:$cc)),
(VCMPPDrri VR128:$src1, VR128:$src2, imm:$cc)>;		(VCMPPDrri VR128:$src1, VR128:$src2, imm:$cc)>;
def : Pat<(v2i64 (X86cmpp (v2f64 VR128:$src1), (loadv2f64 addr:$src2), imm:$cc)),		def : Pat<(v2f64 (X86cmpp (v2f64 VR128:$src1), (loadv2f64 addr:$src2), imm:$cc)),
(VCMPPDrmi VR128:$src1, addr:$src2, imm:$cc)>;		(VCMPPDrmi VR128:$src1, addr:$src2, imm:$cc)>;

def : Pat<(v8i32 (X86cmpp (v8f32 VR256:$src1), VR256:$src2, imm:$cc)),		def : Pat<(v8f32 (X86cmpp (v8f32 VR256:$src1), VR256:$src2, imm:$cc)),
(VCMPPSYrri (v8f32 VR256:$src1), (v8f32 VR256:$src2), imm:$cc)>;		(VCMPPSYrri (v8f32 VR256:$src1), (v8f32 VR256:$src2), imm:$cc)>;
def : Pat<(v8i32 (X86cmpp (v8f32 VR256:$src1), (loadv8f32 addr:$src2), imm:$cc)),		def : Pat<(v8f32 (X86cmpp (v8f32 VR256:$src1), (loadv8f32 addr:$src2), imm:$cc)),
(VCMPPSYrmi (v8f32 VR256:$src1), addr:$src2, imm:$cc)>;		(VCMPPSYrmi (v8f32 VR256:$src1), addr:$src2, imm:$cc)>;
def : Pat<(v4i64 (X86cmpp (v4f64 VR256:$src1), VR256:$src2, imm:$cc)),		def : Pat<(v4f64 (X86cmpp (v4f64 VR256:$src1), VR256:$src2, imm:$cc)),
(VCMPPDYrri VR256:$src1, VR256:$src2, imm:$cc)>;		(VCMPPDYrri VR256:$src1, VR256:$src2, imm:$cc)>;
def : Pat<(v4i64 (X86cmpp (v4f64 VR256:$src1), (loadv4f64 addr:$src2), imm:$cc)),		def : Pat<(v4f64 (X86cmpp (v4f64 VR256:$src1), (loadv4f64 addr:$src2), imm:$cc)),
(VCMPPDYrmi VR256:$src1, addr:$src2, imm:$cc)>;		(VCMPPDYrmi VR256:$src1, addr:$src2, imm:$cc)>;
}		}

let Predicates = [UseSSE1] in {		let Predicates = [UseSSE1] in {
def : Pat<(v4i32 (X86cmpp (v4f32 VR128:$src1), VR128:$src2, imm:$cc)),		def : Pat<(v4f32 (X86cmpp (v4f32 VR128:$src1), VR128:$src2, imm:$cc)),
(CMPPSrri (v4f32 VR128:$src1), (v4f32 VR128:$src2), imm:$cc)>;		(CMPPSrri (v4f32 VR128:$src1), (v4f32 VR128:$src2), imm:$cc)>;
def : Pat<(v4i32 (X86cmpp (v4f32 VR128:$src1), (memopv4f32 addr:$src2), imm:$cc)),		def : Pat<(v4f32 (X86cmpp (v4f32 VR128:$src1), (memopv4f32 addr:$src2), imm:$cc)),
(CMPPSrmi (v4f32 VR128:$src1), addr:$src2, imm:$cc)>;		(CMPPSrmi (v4f32 VR128:$src1), addr:$src2, imm:$cc)>;
}		}

let Predicates = [UseSSE2] in {		let Predicates = [UseSSE2] in {
def : Pat<(v2i64 (X86cmpp (v2f64 VR128:$src1), VR128:$src2, imm:$cc)),		def : Pat<(v2f64 (X86cmpp (v2f64 VR128:$src1), VR128:$src2, imm:$cc)),
(CMPPDrri VR128:$src1, VR128:$src2, imm:$cc)>;		(CMPPDrri VR128:$src1, VR128:$src2, imm:$cc)>;
def : Pat<(v2i64 (X86cmpp (v2f64 VR128:$src1), (memopv2f64 addr:$src2), imm:$cc)),		def : Pat<(v2f64 (X86cmpp (v2f64 VR128:$src1), (memopv2f64 addr:$src2), imm:$cc)),
(CMPPDrmi VR128:$src1, addr:$src2, imm:$cc)>;		(CMPPDrmi VR128:$src1, addr:$src2, imm:$cc)>;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SSE 1 & 2 - Shuffle Instructions		// SSE 1 & 2 - Shuffle Instructions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// sse12_shuffle - sse 1 & 2 fp shuffle instructions		/// sse12_shuffle - sse 1 & 2 fp shuffle instructions
▲ Show 20 Lines • Show All 6,309 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sse1.ll

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	entry:
ret <4 x float> %a14		ret <4 x float> %a14
}		}

; v4i32 isn't legal for SSE1, but this should be cmpps.		; v4i32 isn't legal for SSE1, but this should be cmpps.

define <4 x float> @PR28044(<4 x float> %a0, <4 x float> %a1) nounwind {		define <4 x float> @PR28044(<4 x float> %a0, <4 x float> %a1) nounwind {
; CHECK-LABEL: PR28044:		; CHECK-LABEL: PR28044:
; CHECK: # BB#0:		; CHECK: # BB#0:
; CHECK: movaps %xmm1, %xmm2		; CHECK-NEXT: cmpeqps %xmm1, %xmm0
; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[3,1,2,3]		; CHECK-NEXT: ret
; CHECK-NEXT: movaps %xmm0, %xmm3
; CHECK-NEXT: shufps {{.*#+}} xmm3 = xmm3[3,1,2,3]
; CHECK-NEXT: ucomiss %xmm2, %xmm3
; CHECK-NEXT: setnp %al
; CHECK-NEXT: sete %cl
; CHECK-NEXT: andb %al, %cl
; CHECK-NEXT: movzbl %cl, %eax
; CHECK-NEXT: shll $31, %eax
; CHECK-NEXT: sarl $31, %eax
; CHECK-NEXT: movl %eax,
; CHECK-NEXT: movaps %xmm1, %xmm2
; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[1,1,2,3]
; CHECK-NEXT: movaps %xmm0, %xmm3
; CHECK-NEXT: shufps {{.*#+}} xmm3 = xmm3[1,1,2,3]
; CHECK-NEXT: ucomiss %xmm2, %xmm3
; CHECK-NEXT: setnp %al
; CHECK-NEXT: sete %cl
; CHECK-NEXT: andb %al, %cl
; CHECK-NEXT: movzbl %cl, %eax
; CHECK-NEXT: shll $31, %eax
; CHECK-NEXT: sarl $31, %eax
; CHECK-NEXT: movl %eax,
; CHECK-NEXT: ucomiss %xmm1, %xmm0
; CHECK-NEXT: setnp %al
; CHECK-NEXT: sete %cl
; CHECK-NEXT: andb %al, %cl
; CHECK-NEXT: movzbl %cl, %eax
; CHECK-NEXT: shll $31, %eax
; CHECK-NEXT: sarl $31, %eax
; CHECK-NEXT: movl %eax,
; CHECK-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,1,2,3]
; CHECK-NEXT: shufps {{.*#+}} xmm0 = xmm0[2,1,2,3]
; CHECK-NEXT: ucomiss %xmm1, %xmm0
; CHECK-NEXT: setnp %al
; CHECK-NEXT: sete %cl
; CHECK-NEXT: andb %al, %cl
; CHECK-NEXT: movzbl %cl, %eax
; CHECK-NEXT: shll $31, %eax
; CHECK-NEXT: sarl $31, %eax
; CHECK-NEXT: movl %eax,
; CHECK-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; CHECK-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; CHECK-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
; CHECK-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; CHECK-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; CHECK-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
; CHECK-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
;		;
%cmp = fcmp oeq <4 x float> %a0, %a1		%cmp = fcmp oeq <4 x float> %a0, %a1
%sext = sext <4 x i1> %cmp to <4 x i32>		%sext = sext <4 x i1> %cmp to <4 x i32>
%res = bitcast <4 x i32> %sext to <4 x float>		%res = bitcast <4 x i32> %sext to <4 x float>
ret <4 x float> %res		ret <4 x float> %res
}		}