This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
4/4
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx512fp16-arith-intrinsics.ll
4
vector-bo-select.ll

Differential D118644

[x86] invert a vector select IR canonicalization with a binop identity constant
ClosedPublic

Authored by spatel on Jan 31 2022, 12:50 PM.

Download Raw Diff

Details

Reviewers

LuoYuanke
craig.topper
xbolva00
pengfei
RKSimon

Commits

rG6592bcecd4ff: [x86] invert a vector select IR canonicalization with a binop identity constant

Summary

This is an intentionally limited/different form of D90113. That patch bravely tries to generalize folds where we pull a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...across types and targets. This is the inverse of IR canonicalization as discussed in D113442

I'm not sure if this is even profitable within x86, so I'm only proposing to handle x86 vector fadd/fsub as a 1st step. The intent is to prevent AVX512 regressions as mentioned in D113442. Please look closely at the test diffs to confirm if this is correct and better.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jan 31 2022, 12:50 PM

Herald added subscribers: hiraditya, kristof.beyls, mcrosier. · View Herald TranscriptJan 31 2022, 12:50 PM

spatel requested review of this revision.Jan 31 2022, 12:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 31 2022, 12:50 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

spatel mentioned this in D113442: [InstCombine] Enable fold select into operand for FAdd, FMul, FSub and FDiv..Jan 31 2022, 12:54 PM

xbolva00 added inline comments.Jan 31 2022, 12:57 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
48969	Target independent location with target hook?

Harbormaster completed remote builds in B146739: Diff 404675.Jan 31 2022, 4:00 PM

spatel marked an inline comment as done.Jan 31 2022, 4:10 PM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
48969	Yes, that is the most likely outcome. I think it would be fine to start with just generic opcodes, and then we could extend it with target-specific as needed. That's similar to what we have for TLI.isCommutativeBinop() (used below here).

pengfei added inline comments.Feb 1 2022, 2:23 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
48974–48977	Do we need to consider the nsz case?
llvm/test/CodeGen/X86/vector-bo-select.ll
696–697	Is the left code better?
704–705	The left seems better.

If we're targeting the D113442 regressions for 14.x I'd probably suggest we just limit this to AVX512 targets as a quick fix and iterate on it for 15.x.

llvm/test/CodeGen/X86/vector-bo-select.ll
697	This is definitely a regression - VBLENDVPS is a lot slower than VPAND - but the codegen is awful anyway - combineToExtendBoolVectorInReg is supposed to deal with this, but it obviously fails :(
705	Not a notable regression, but non-AVX512VL targets are going to see this change - do we care?

In D118644#3286836, @RKSimon wrote:

If we're targeting the D113442 regressions for 14.x I'd probably suggest we just limit this to AVX512 targets as a quick fix and iterate on it for 15.x.

I agree. I didn't restrict it in this first draft just to show that there are likely going to be several subtle opportunities/regressions that need to be dealt with. The blendv codegen with AVX2 is one.

For AVX512, I'm not sure exactly which flavors we want to include/exclude - should I add more/different RUN lines to the test file? Do we want more tests with 512-bit types?

In D118644#3287128, @spatel wrote:

In D118644#3286836, @RKSimon wrote:

If we're targeting the D113442 regressions for 14.x I'd probably suggest we just limit this to AVX512 targets as a quick fix and iterate on it for 15.x.

I agree. I didn't restrict it in this first draft just to show that there are likely going to be several subtle opportunities/regressions that need to be dealt with. The blendv codegen with AVX2 is one.

For AVX512, I'm not sure exactly which flavors we want to include/exclude - should I add more/different RUN lines to the test file? Do we want more tests with 512-bit types?

I'd probably say float/double 512-bit vectors for AVX512 - and 128/256-bit vectors with AVX512VL

spatel marked an inline comment as done.Feb 1 2022, 8:39 AM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
48974–48977	In IR, it looks like we still create the -0.0 constant even with nsz, but yes, I suspect we will want to handle both forms of zero if we have NSZ. We'll need more test coverage. I'll add another TODO for now.

Patch updated:

Added checks for AVX512 and AVX512VL.
Added tests to show more diffs.
Added TODO comments (there are many!).

Harbormaster completed remote builds in B146917: Diff 404948.Feb 1 2022, 8:43 AM

Patch updated:
The previous upload didn't have all of the TODO comments updated.

Harbormaster completed remote builds in B146918: Diff 404950.Feb 1 2022, 8:46 AM

LGTM - @pengfei ?

spatel mentioned this in rG819147224637: [x86] add more tests for select with identity constant; NFC.Feb 1 2022, 12:43 PM

LGMT. Thanks, Sanjay. I think we can combine more operator based on this patch.

This revision is now accepted and ready to land.Feb 1 2022, 10:52 PM

LGTM.

This revision was landed with ongoing or failed builds.Feb 2 2022, 5:18 AM

Closed by commit rG6592bcecd4ff: [x86] invert a vector select IR canonicalization with a binop identity constant (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG6592bcecd4ff: [x86] invert a vector select IR canonicalization with a binop identity constant.

Please also backport to llvm 14 branch.

LuoYuanke mentioned this in D119111: [X86] Invert a vector select IR canonicalization with a binop identity constant.Feb 7 2022, 12:33 AM

spatel mentioned this in D90113: [DAGCombiner] Fold BinOp into Select containing identity constant.Feb 8 2022, 5:16 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

80 lines

test/

CodeGen/

X86/

avx512fp16-arith-intrinsics.ll

4 lines

vector-bo-select.ll

44 lines

Diff 405229

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 48,936 Lines • ▼ Show 20 Lines	static SDValue combineFaddCFmul(SDNode *N, SelectionDAG &DAG,
unsigned NewOp = IsConj ? X86ISD::VFCMADDC : X86ISD::VFMADDC;		unsigned NewOp = IsConj ? X86ISD::VFCMADDC : X86ISD::VFMADDC;
// FIXME: How do we handle when fast math flags of FADD are different from		// FIXME: How do we handle when fast math flags of FADD are different from
// CFMUL's?		// CFMUL's?
SDValue CFmul =		SDValue CFmul =
DAG.getNode(NewOp, SDLoc(N), CVT, MulOp0, MulOp1, FAddOp1, N->getFlags());		DAG.getNode(NewOp, SDLoc(N), CVT, MulOp0, MulOp1, FAddOp1, N->getFlags());
return DAG.getBitcast(VT, CFmul);		return DAG.getBitcast(VT, CFmul);
}		}

		/// This inverts a canonicalization in IR that replaces a variable select arm
		/// with an identity constant. Codegen improves if we re-use the variable
		/// operand rather than load a constant. This can also be converted into a
		/// masked vector operation if the target supports it.
		static SDValue foldSelectWithIdentityConstant(SDNode *N, SelectionDAG &DAG,
		bool ShouldCommuteOperands) {
		// Match a select as operand 1. The identity constant that we are looking for
		// is only valid as operand 1 of a non-commutative binop.
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		if (ShouldCommuteOperands)
		std::swap(N0, N1);

		// TODO: Should this apply to scalar select too?
		if (!N1.hasOneUse() \|\| N1.getOpcode() != ISD::VSELECT)
		return SDValue();

		unsigned Opcode = N->getOpcode();
		EVT VT = N->getValueType(0);
		SDValue Cond = N1.getOperand(0);
		SDValue TVal = N1.getOperand(1);
		SDValue FVal = N1.getOperand(2);

		// TODO: This (and possibly the entire function) belongs in a
		// target-independent location with target hooks.
		xbolva00Unsubmitted Done Reply Inline Actions Target independent location with target hook? xbolva00: Target independent location with target hook?
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, that is the most likely outcome. I think it would be fine to start with just generic opcodes, and then we could extend it with target-specific as needed. That's similar to what we have for TLI.isCommutativeBinop() (used below here). spatel: Yes, that is the most likely outcome. I think it would be fine to start with just generic…
		// TODO: The cases should match with IR's ConstantExpr::getBinOpIdentity().
		// TODO: With fast-math (NSZ), allow the opposite-sign form of zero?
		auto isIdentityConstantForOpcode = [](unsigned Opcode, SDValue V) {
		if (ConstantFPSDNode *C = isConstOrConstSplatFP(V)) {
		switch (Opcode) {
		case ISD::FADD: // X + -0.0 --> X
		return C->isZero() && C->isNegative();
		case ISD::FSUB: // X - 0.0 --> X
		pengfeiUnsubmitted Done Reply Inline Actions Do we need to consider the nsz case? pengfei: Do we need to consider the nsz case?
		spatelAuthorUnsubmitted Done Reply Inline Actions In IR, it looks like we still create the -0.0 constant even with nsz, but yes, I suspect we will want to handle both forms of zero if we have NSZ. We'll need more test coverage. I'll add another TODO for now. spatel: In IR, it looks like we still create the -0.0 constant even with nsz, but yes, I suspect we…
		return C->isZero() && !C->isNegative();
		}
		}
		return false;
		};

		// This transform increases uses of N0, so freeze it to be safe.
		// binop N0, (vselect Cond, IDC, FVal) --> vselect Cond, N0, (binop N0, FVal)
		if (isIdentityConstantForOpcode(Opcode, TVal)) {
		SDValue F0 = DAG.getFreeze(N0);
		SDValue NewBO = DAG.getNode(Opcode, SDLoc(N), VT, F0, FVal, N->getFlags());
		return DAG.getSelect(SDLoc(N), VT, Cond, F0, NewBO);
		}
		// binop N0, (vselect Cond, TVal, IDC) --> vselect Cond, (binop N0, TVal), N0
		if (isIdentityConstantForOpcode(Opcode, FVal)) {
		SDValue F0 = DAG.getFreeze(N0);
		SDValue NewBO = DAG.getNode(Opcode, SDLoc(N), VT, F0, TVal, N->getFlags());
		return DAG.getSelect(SDLoc(N), VT, Cond, NewBO, F0);
		}

		return SDValue();
		}

		static SDValue combineBinopWithSelect(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		// TODO: This is too general. There are cases where pre-AVX512 codegen would
		// benefit. The transform may also be profitable for scalar code.
		if (!Subtarget.hasAVX512())
		return SDValue();

		if (!Subtarget.hasVLX() && !N->getValueType(0).is512BitVector())
		return SDValue();

		if (SDValue Sel = foldSelectWithIdentityConstant(N, DAG, false))
		return Sel;

		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (TLI.isCommutativeBinOp(N->getOpcode()))
		if (SDValue Sel = foldSelectWithIdentityConstant(N, DAG, true))
		return Sel;

		return SDValue();
		}

/// Do target-specific dag combines on floating-point adds/subs.		/// Do target-specific dag combines on floating-point adds/subs.
static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,		static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (SDValue HOp = combineToHorizontalAddSub(N, DAG, Subtarget))		if (SDValue HOp = combineToHorizontalAddSub(N, DAG, Subtarget))
return HOp;		return HOp;

if (SDValue COp = combineFaddCFmul(N, DAG, Subtarget))		if (SDValue COp = combineFaddCFmul(N, DAG, Subtarget))
return COp;		return COp;

		if (SDValue Sel = combineBinopWithSelect(N, DAG, Subtarget))
		return Sel;

return SDValue();		return SDValue();
}		}

/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify		/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify
/// the codegen.		/// the codegen.
/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )		/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )
/// TODO: This overlaps with the generic combiner's visitTRUNCATE. Remove		/// TODO: This overlaps with the generic combiner's visitTRUNCATE. Remove
/// anything that is guaranteed to be transformed by DAGCombiner.		/// anything that is guaranteed to be transformed by DAGCombiner.
▲ Show 20 Lines • Show All 6,281 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/avx512fp16-arith-intrinsics.ll

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
ret <32 x half> %res		ret <32 x half> %res
}		}

define <32 x half> @test_int_x86_avx512fp16_maskz_sub_ph_512(<32 x half> %src, <32 x half> %x1, <32 x half> %x2, i32 %msk, <32 x half>* %ptr) {		define <32 x half> @test_int_x86_avx512fp16_maskz_sub_ph_512(<32 x half> %src, <32 x half> %x1, <32 x half> %x2, i32 %msk, <32 x half>* %ptr) {
; CHECK-LABEL: test_int_x86_avx512fp16_maskz_sub_ph_512:		; CHECK-LABEL: test_int_x86_avx512fp16_maskz_sub_ph_512:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: kmovd %edi, %k1		; CHECK-NEXT: kmovd %edi, %k1
; CHECK-NEXT: vsubph %zmm2, %zmm1, %zmm0 {%k1} {z}		; CHECK-NEXT: vsubph %zmm2, %zmm1, %zmm0 {%k1} {z}
; CHECK-NEXT: vsubph (%rsi), %zmm1, %zmm1 {%k1} {z}		; CHECK-NEXT: vsubph (%rsi), %zmm1, %zmm1
; CHECK-NEXT: vsubph %zmm1, %zmm0, %zmm0		; CHECK-NEXT: vsubph %zmm1, %zmm0, %zmm0 {%k1}
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%mask = bitcast i32 %msk to <32 x i1>		%mask = bitcast i32 %msk to <32 x i1>
%val = load <32 x half>, <32 x half>* %ptr		%val = load <32 x half>, <32 x half>* %ptr
%res0 = call <32 x half> @llvm.x86.avx512fp16.sub.ph.512(<32 x half> %x1, <32 x half> %x2, i32 4)		%res0 = call <32 x half> @llvm.x86.avx512fp16.sub.ph.512(<32 x half> %x1, <32 x half> %x2, i32 4)
%res1 = select <32 x i1> %mask, <32 x half> %res0, <32 x half> zeroinitializer		%res1 = select <32 x i1> %mask, <32 x half> %res0, <32 x half> zeroinitializer
%t2 = call <32 x half> @llvm.x86.avx512fp16.sub.ph.512(<32 x half> %x1, <32 x half> %val, i32 4)		%t2 = call <32 x half> @llvm.x86.avx512fp16.sub.ph.512(<32 x half> %x1, <32 x half> %val, i32 4)
%res2 = select <32 x i1> %mask, <32 x half> %t2, <32 x half> zeroinitializer		%res2 = select <32 x i1> %mask, <32 x half> %t2, <32 x half> zeroinitializer
%res3 = fsub <32 x half> %res1, %res2		%res3 = fsub <32 x half> %res1, %res2
▲ Show 20 Lines • Show All 550 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-bo-select.ll

	Show All 21 Lines
	; AVX512F-NEXT: vaddps %xmm0, %xmm1, %xmm0			; AVX512F-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: fadd_v4f32:			; AVX512VL-LABEL: fadd_v4f32:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vpslld $31, %xmm0, %xmm0			; AVX512VL-NEXT: vpslld $31, %xmm0, %xmm0
	; AVX512VL-NEXT: vptestmd %xmm0, %xmm0, %k1			; AVX512VL-NEXT: vptestmd %xmm0, %xmm0, %k1
	; AVX512VL-NEXT: vbroadcastss {{.*#+}} xmm0 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]			; AVX512VL-NEXT: vaddps %xmm2, %xmm1, %xmm1 {%k1}
	; AVX512VL-NEXT: vmovaps %xmm2, %xmm0 {%k1}			; AVX512VL-NEXT: vmovaps %xmm1, %xmm0
	; AVX512VL-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	%s = select <4 x i1> %b, <4 x float> %y, <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>			%s = select <4 x i1> %b, <4 x float> %y, <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>
	%r = fadd <4 x float> %x, %s			%r = fadd <4 x float> %x, %s
	ret <4 x float> %r			ret <4 x float> %r
	}			}

	define <8 x float> @fadd_v8f32_commute(<8 x i1> %b, <8 x float> noundef %x, <8 x float> noundef %y) {			define <8 x float> @fadd_v8f32_commute(<8 x i1> %b, <8 x float> noundef %x, <8 x float> noundef %y) {
	; AVX2-LABEL: fadd_v8f32_commute:			; AVX2-LABEL: fadd_v8f32_commute:
	Show All 16 Lines
	; AVX512F-NEXT: vaddps %ymm1, %ymm0, %ymm0			; AVX512F-NEXT: vaddps %ymm1, %ymm0, %ymm0
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: fadd_v8f32_commute:			; AVX512VL-LABEL: fadd_v8f32_commute:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vpmovsxwd %xmm0, %ymm0			; AVX512VL-NEXT: vpmovsxwd %xmm0, %ymm0
	; AVX512VL-NEXT: vpslld $31, %ymm0, %ymm0			; AVX512VL-NEXT: vpslld $31, %ymm0, %ymm0
	; AVX512VL-NEXT: vptestmd %ymm0, %ymm0, %k1			; AVX512VL-NEXT: vptestmd %ymm0, %ymm0, %k1
	; AVX512VL-NEXT: vbroadcastss {{.*#+}} ymm0 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]			; AVX512VL-NEXT: vaddps %ymm2, %ymm1, %ymm1 {%k1}
	; AVX512VL-NEXT: vmovaps %ymm2, %ymm0 {%k1}			; AVX512VL-NEXT: vmovaps %ymm1, %ymm0
	; AVX512VL-NEXT: vaddps %ymm1, %ymm0, %ymm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	%s = select <8 x i1> %b, <8 x float> %y, <8 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>			%s = select <8 x i1> %b, <8 x float> %y, <8 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>
	%r = fadd <8 x float> %s, %x			%r = fadd <8 x float> %s, %x
	ret <8 x float> %r			ret <8 x float> %r
	}			}

	define <16 x float> @fadd_v16f32_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {			define <16 x float> @fadd_v16f32_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {
	; AVX2-LABEL: fadd_v16f32_swap:			; AVX2-LABEL: fadd_v16f32_swap:
	Show All 11 Lines
	; AVX2-NEXT: vaddps %ymm4, %ymm2, %ymm1			; AVX2-NEXT: vaddps %ymm4, %ymm2, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: fadd_v16f32_swap:			; AVX512-LABEL: fadd_v16f32_swap:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0			; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0
	; AVX512-NEXT: vpslld $31, %zmm0, %zmm0			; AVX512-NEXT: vpslld $31, %zmm0, %zmm0
	; AVX512-NEXT: vptestmd %zmm0, %zmm0, %k1			; AVX512-NEXT: vptestmd %zmm0, %zmm0, %k1
	; AVX512-NEXT: vbroadcastss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm2 {%k1}
	; AVX512-NEXT: vaddps %zmm2, %zmm1, %zmm0			; AVX512-NEXT: vaddps %zmm2, %zmm1, %zmm0
				; AVX512-NEXT: vmovaps %zmm1, %zmm0 {%k1}
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%s = select <16 x i1> %b, <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, <16 x float> %y			%s = select <16 x i1> %b, <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, <16 x float> %y
	%r = fadd <16 x float> %x, %s			%r = fadd <16 x float> %x, %s
	ret <16 x float> %r			ret <16 x float> %r
	}			}

	define <16 x float> @fadd_v16f32_commute_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {			define <16 x float> @fadd_v16f32_commute_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {
	; AVX2-LABEL: fadd_v16f32_commute_swap:			; AVX2-LABEL: fadd_v16f32_commute_swap:
	Show All 11 Lines
	; AVX2-NEXT: vaddps %ymm2, %ymm4, %ymm1			; AVX2-NEXT: vaddps %ymm2, %ymm4, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: fadd_v16f32_commute_swap:			; AVX512-LABEL: fadd_v16f32_commute_swap:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0			; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0
	; AVX512-NEXT: vpslld $31, %zmm0, %zmm0			; AVX512-NEXT: vpslld $31, %zmm0, %zmm0
	; AVX512-NEXT: vptestmd %zmm0, %zmm0, %k1			; AVX512-NEXT: vptestmd %zmm0, %zmm0, %k1
	; AVX512-NEXT: vbroadcastss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %zmm2 {%k1}			; AVX512-NEXT: vaddps %zmm2, %zmm1, %zmm0
	; AVX512-NEXT: vaddps %zmm1, %zmm2, %zmm0			; AVX512-NEXT: vmovaps %zmm1, %zmm0 {%k1}
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%s = select <16 x i1> %b, <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, <16 x float> %y			%s = select <16 x i1> %b, <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, <16 x float> %y
	%r = fadd <16 x float> %s, %x			%r = fadd <16 x float> %s, %x
	ret <16 x float> %r			ret <16 x float> %r
	}			}

	define <4 x float> @fsub_v4f32(<4 x i1> %b, <4 x float> noundef %x, <4 x float> noundef %y) {			define <4 x float> @fsub_v4f32(<4 x i1> %b, <4 x float> noundef %x, <4 x float> noundef %y) {
	; AVX2-LABEL: fsub_v4f32:			; AVX2-LABEL: fsub_v4f32:
	Show All 13 Lines
	; AVX512F-NEXT: vsubps %xmm0, %xmm1, %xmm0			; AVX512F-NEXT: vsubps %xmm0, %xmm1, %xmm0
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: fsub_v4f32:			; AVX512VL-LABEL: fsub_v4f32:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vpslld $31, %xmm0, %xmm0			; AVX512VL-NEXT: vpslld $31, %xmm0, %xmm0
	; AVX512VL-NEXT: vptestmd %xmm0, %xmm0, %k1			; AVX512VL-NEXT: vptestmd %xmm0, %xmm0, %k1
	; AVX512VL-NEXT: vmovaps %xmm2, %xmm0 {%k1} {z}			; AVX512VL-NEXT: vsubps %xmm2, %xmm1, %xmm1 {%k1}
	; AVX512VL-NEXT: vsubps %xmm0, %xmm1, %xmm0			; AVX512VL-NEXT: vmovaps %xmm1, %xmm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	%s = select <4 x i1> %b, <4 x float> %y, <4 x float> zeroinitializer			%s = select <4 x i1> %b, <4 x float> %y, <4 x float> zeroinitializer
	%r = fsub <4 x float> %x, %s			%r = fsub <4 x float> %x, %s
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; negative test - fsub is not commutative; there is no identity constant for operand 0

	define <8 x float> @fsub_v8f32_commute(<8 x i1> %b, <8 x float> noundef %x, <8 x float> noundef %y) {			define <8 x float> @fsub_v8f32_commute(<8 x i1> %b, <8 x float> noundef %x, <8 x float> noundef %y) {
	; AVX2-LABEL: fsub_v8f32_commute:			; AVX2-LABEL: fsub_v8f32_commute:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero			; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; AVX2-NEXT: vpslld $31, %ymm0, %ymm0			; AVX2-NEXT: vpslld $31, %ymm0, %ymm0
	; AVX2-NEXT: vpsrad $31, %ymm0, %ymm0			; AVX2-NEXT: vpsrad $31, %ymm0, %ymm0
	; AVX2-NEXT: vpand %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vpand %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vsubps %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vsubps %ymm1, %ymm0, %ymm0
	Show All 38 Lines
	; AVX2-NEXT: vsubps %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vsubps %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: vsubps %ymm4, %ymm2, %ymm1			; AVX2-NEXT: vsubps %ymm4, %ymm2, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: fsub_v16f32_swap:			; AVX512-LABEL: fsub_v16f32_swap:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0			; AVX512-NEXT: vpmovsxbd %xmm0, %zmm0
	; AVX512-NEXT: vpslld $31, %zmm0, %zmm0			; AVX512-NEXT: vpslld $31, %zmm0, %zmm0
	; AVX512-NEXT: vptestnmd %zmm0, %zmm0, %k1			; AVX512-NEXT: vptestmd %zmm0, %zmm0, %k1
	; AVX512-NEXT: vmovaps %zmm2, %zmm0 {%k1} {z}			; AVX512-NEXT: vsubps %zmm2, %zmm1, %zmm0
	; AVX512-NEXT: vsubps %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vmovaps %zmm1, %zmm0 {%k1}
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%s = select <16 x i1> %b, <16 x float> zeroinitializer, <16 x float> %y			%s = select <16 x i1> %b, <16 x float> zeroinitializer, <16 x float> %y
	%r = fsub <16 x float> %x, %s			%r = fsub <16 x float> %x, %s
	ret <16 x float> %r			ret <16 x float> %r
	}			}

				; negative test - fsub is not commutative; there is no identity constant for operand 0

	define <16 x float> @fsub_v16f32_commute_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {			define <16 x float> @fsub_v16f32_commute_swap(<16 x i1> %b, <16 x float> noundef %x, <16 x float> noundef %y) {
	; AVX2-LABEL: fsub_v16f32_commute_swap:			; AVX2-LABEL: fsub_v16f32_commute_swap:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm5 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]			; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm5 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
	; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero,xmm5[4],zero,xmm5[5],zero,xmm5[6],zero,xmm5[7],zero			; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero,xmm5[4],zero,xmm5[5],zero,xmm5[6],zero,xmm5[7],zero
	; AVX2-NEXT: vpslld $31, %ymm5, %ymm5			; AVX2-NEXT: vpslld $31, %ymm5, %ymm5
	; AVX2-NEXT: vpsrad $31, %ymm5, %ymm5			; AVX2-NEXT: vpsrad $31, %ymm5, %ymm5
	; AVX2-NEXT: vpandn %ymm4, %ymm5, %ymm4			; AVX2-NEXT: vpandn %ymm4, %ymm5, %ymm4
	▲ Show 20 Lines • Show All 331 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: vbroadcastss {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]			; AVX512F-NEXT: vbroadcastss {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]
	; AVX512F-NEXT: vmovaps %zmm1, %zmm2 {%k1}			; AVX512F-NEXT: vmovaps %zmm1, %zmm2 {%k1}
	; AVX512F-NEXT: vaddps %ymm2, %ymm0, %ymm0			; AVX512F-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: fadd_v8f32_cast_cond:			; AVX512VL-LABEL: fadd_v8f32_cast_cond:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: kmovw %edi, %k1			; AVX512VL-NEXT: kmovw %edi, %k1
	; AVX512VL-NEXT: vbroadcastss {{.*#+}} ymm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]			; AVX512VL-NEXT: vaddps %ymm1, %ymm0, %ymm0 {%k1}
	; AVX512VL-NEXT: vmovaps %ymm1, %ymm2 {%k1}
	; AVX512VL-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	%b = bitcast i8 %pb to <8 x i1>			%b = bitcast i8 %pb to <8 x i1>
	%s = select <8 x i1> %b, <8 x float> %y, <8 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>			%s = select <8 x i1> %b, <8 x float> %y, <8 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>
	%r = fadd <8 x float> %x, %s			%r = fadd <8 x float> %x, %s
	ret <8 x float> %r			ret <8 x float> %r
	}			}

	define <8 x double> @fadd_v8f64_cast_cond(i8 noundef zeroext %pb, <8 x double> noundef %x, <8 x double> noundef %y) {			define <8 x double> @fadd_v8f64_cast_cond(i8 noundef zeroext %pb, <8 x double> noundef %x, <8 x double> noundef %y) {
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vblendvpd %ymm4, %ymm2, %ymm6, %ymm2			; AVX2-NEXT: vblendvpd %ymm4, %ymm2, %ymm6, %ymm2
	; AVX2-NEXT: vaddpd %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vaddpd %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vaddpd %ymm3, %ymm1, %ymm1			; AVX2-NEXT: vaddpd %ymm3, %ymm1, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: fadd_v8f64_cast_cond:			; AVX512-LABEL: fadd_v8f64_cast_cond:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: kmovw %edi, %k1			; AVX512-NEXT: kmovw %edi, %k1
	; AVX512-NEXT: vbroadcastsd {{.*#+}} zmm2 = [-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0,-0.0E+0]			; AVX512-NEXT: vaddpd %zmm1, %zmm0, %zmm0 {%k1}
	; AVX512-NEXT: vmovapd %zmm1, %zmm2 {%k1}
	; AVX512-NEXT: vaddpd %zmm2, %zmm0, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%b = bitcast i8 %pb to <8 x i1>			%b = bitcast i8 %pb to <8 x i1>
	%s = select <8 x i1> %b, <8 x double> %y, <8 x double> <double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0>			%s = select <8 x i1> %b, <8 x double> %y, <8 x double> <double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0>
	%r = fadd <8 x double> %x, %s			%r = fadd <8 x double> %x, %s
	ret <8 x double> %r			ret <8 x double> %r
	}			}

	define <8 x float> @fsub_v8f32_cast_cond(i8 noundef zeroext %pb, <8 x float> noundef %x, <8 x float> noundef %y) {			define <8 x float> @fsub_v8f32_cast_cond(i8 noundef zeroext %pb, <8 x float> noundef %x, <8 x float> noundef %y) {
	Show All 40 Lines
	; AVX2-NEXT: vpinsrd $2, %eax, %xmm3, %xmm3			; AVX2-NEXT: vpinsrd $2, %eax, %xmm3, %xmm3
	; AVX2-NEXT: shrb $3, %dil			; AVX2-NEXT: shrb $3, %dil
	; AVX2-NEXT: movzbl %dil, %eax			; AVX2-NEXT: movzbl %dil, %eax
	; AVX2-NEXT: andl $1, %eax			; AVX2-NEXT: andl $1, %eax
	; AVX2-NEXT: negl %eax			; AVX2-NEXT: negl %eax
	; AVX2-NEXT: vpinsrd $3, %eax, %xmm3, %xmm3			; AVX2-NEXT: vpinsrd $3, %eax, %xmm3, %xmm3
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2			; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2
	; AVX2-NEXT: vpand %ymm1, %ymm2, %ymm1			; AVX2-NEXT: vpand %ymm1, %ymm2, %ymm1
	; AVX2-NEXT: vsubps %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vsubps %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
				pengfeiUnsubmitted Not Done Reply Inline Actions Is the left code better? pengfei: Is the left code better?
				RKSimonUnsubmitted Not Done Reply Inline Actions This is definitely a regression - VBLENDVPS is a lot slower than VPAND - but the codegen is awful anyway - combineToExtendBoolVectorInReg is supposed to deal with this, but it obviously fails :( RKSimon: This is definitely a regression - VBLENDVPS is a lot slower than VPAND - but the codegen is…
	;			;
	; AVX512F-LABEL: fsub_v8f32_cast_cond:			; AVX512F-LABEL: fsub_v8f32_cast_cond:
	; AVX512F: # %bb.0:			; AVX512F: # %bb.0:
	; AVX512F-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1			; AVX512F-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1
	; AVX512F-NEXT: kmovw %edi, %k1			; AVX512F-NEXT: kmovw %edi, %k1
	; AVX512F-NEXT: vmovaps %zmm1, %zmm1 {%k1} {z}			; AVX512F-NEXT: vmovaps %zmm1, %zmm1 {%k1} {z}
	; AVX512F-NEXT: vsubps %ymm1, %ymm0, %ymm0			; AVX512F-NEXT: vsubps %ymm1, %ymm0, %ymm0
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
				pengfeiUnsubmitted Not Done Reply Inline Actions The left seems better. pengfei: The left seems better.
				RKSimonUnsubmitted Not Done Reply Inline Actions Not a notable regression, but non-AVX512VL targets are going to see this change - do we care? RKSimon: Not a notable regression, but non-AVX512VL targets are going to see this change - do we care?
	;			;
	; AVX512VL-LABEL: fsub_v8f32_cast_cond:			; AVX512VL-LABEL: fsub_v8f32_cast_cond:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: kmovw %edi, %k1			; AVX512VL-NEXT: kmovw %edi, %k1
	; AVX512VL-NEXT: vmovaps %ymm1, %ymm1 {%k1} {z}			; AVX512VL-NEXT: vsubps %ymm1, %ymm0, %ymm0 {%k1}
	; AVX512VL-NEXT: vsubps %ymm1, %ymm0, %ymm0
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	%b = bitcast i8 %pb to <8 x i1>			%b = bitcast i8 %pb to <8 x i1>
	%s = select <8 x i1> %b, <8 x float> %y, <8 x float> zeroinitializer			%s = select <8 x i1> %b, <8 x float> %y, <8 x float> zeroinitializer
	%r = fsub <8 x float> %x, %s			%r = fsub <8 x float> %x, %s
	ret <8 x float> %r			ret <8 x float> %r
	}			}

	define <8 x double> @fsub_v8f64_cast_cond(i8 noundef zeroext %pb, <8 x double> noundef %x, <8 x double> noundef %y) {			define <8 x double> @fsub_v8f64_cast_cond(i8 noundef zeroext %pb, <8 x double> noundef %x, <8 x double> noundef %y) {
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vpand %ymm2, %ymm4, %ymm2			; AVX2-NEXT: vpand %ymm2, %ymm4, %ymm2
	; AVX2-NEXT: vsubpd %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vsubpd %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vsubpd %ymm3, %ymm1, %ymm1			; AVX2-NEXT: vsubpd %ymm3, %ymm1, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: fsub_v8f64_cast_cond:			; AVX512-LABEL: fsub_v8f64_cast_cond:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: kmovw %edi, %k1			; AVX512-NEXT: kmovw %edi, %k1
	; AVX512-NEXT: vmovapd %zmm1, %zmm1 {%k1} {z}			; AVX512-NEXT: vsubpd %zmm1, %zmm0, %zmm0 {%k1}
	; AVX512-NEXT: vsubpd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%b = bitcast i8 %pb to <8 x i1>			%b = bitcast i8 %pb to <8 x i1>
	%s = select <8 x i1> %b, <8 x double> %y, <8 x double> zeroinitializer			%s = select <8 x i1> %b, <8 x double> %y, <8 x double> zeroinitializer
	%r = fsub <8 x double> %x, %s			%r = fsub <8 x double> %x, %s
	ret <8 x double> %r			ret <8 x double> %r
	}			}