This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG][AArch64][X86] Move legalization of vector MULHS/MULHU from LegalizeDAG to LegalizeVectorOps
ClosedPublic

Authored by craig.topper on Nov 8 2018, 2:10 PM.

Download Raw Diff

Details

Reviewers

RKSimon
efriedma
zatrazz
SjoerdMeijer
fhahn
t.p.northover

Commits

rG129d529ab3e9: [SelectionDAG][AArch64][X86] Move legalization of vector MULHS/MULHU from…
rL347902: [SelectionDAG][AArch64][X86] Move legalization of vector MULHS/MULHU from…

Summary

I believe we should be legalizing these with the rest of vector binary operations. If any custom lowering is required for these nodes, this will give the DAG combine between LegalizeVectorOps and LegalizeDAG to run on the custom code before constant build_vectors are lowered in LegalizeDAG.

Unfortunately, this regressed some AArch64 tests because that DAG combine is now combining (extract_subvector (build_vector)) into a narrower build_vector. This caused it to stop matching an isel pattern that was looking for a multipy and extract_subvector to generate u/smull2. Perhaps the MULHS/MULHU lowering should pick some new AArchISD node that can be directly matched u/smull2?

X86 is showing some weird code too due to aggressive constant broadcast formation using GPR even when it doesn't remove the BUILD_VECTOR completely.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.Nov 8 2018, 2:10 PM

Herald added subscribers: kristof.beyls, javed.absar. · View Herald TranscriptNov 8 2018, 2:10 PM

If at all possible the X86 BROADCAST issue needs fixing first.

RKSimon added inline comments.Nov 9 2018, 11:21 AM

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
669 ↗	(On Diff #173224)	On AVX512 shouldn't these be using the broadcast load fold?

Handle AArch64 expansion with isel patterns instead of custom lowering. This prevents DAG combine from seeing the extract+build_vector opportunity.

Add a one use check to lowerShuffleVectorAsBroadcast in X86. This seems to reign in the aggressive broadcast formation. We now seem to be sharing constant pool entries a little better.

Harbormaster completed remote builds in B24847: Diff 173525.Nov 10 2018, 12:29 PM

RKSimon added inline comments.Nov 12 2018, 3:33 AM

test/CodeGen/X86/combine-udiv.ll
726 ↗	(On Diff #173525)	It's a little odd that this hasn't been simplified to a VPSHUFB (or at least removed the xmm2 zero register - VPPERM can generates its own zero elements) - at that point this wouldn't be XOP specific at all......

craig.topper added inline comments.Nov 12 2018, 10:35 AM

test/CodeGen/X86/combine-udiv.ll
726 ↗	(On Diff #173525)	It looks like a packus was turned in vpperm before some zero_extend_vector_inreg and mul constant folding kicked in later in the same dag combine round. So at the time the VPPERM was created it wasn't known that it was zero. This all occurred on the DAG combine between vector op legalization and DAG legalization. This vpperm itself didn't get revisited after the input became 0 due to a bitcast. But the last DAG combine after legalize DAG should have revisited it. Should we have been able to recombine the VPPERM to a VPPERM with zero controls?

Rebase after r346697

craig.topper added inline comments.Nov 12 2018, 12:00 PM

test/CodeGen/X86/combine-udiv.ll
709 ↗	(On Diff #173726)	This is zero extending a truncate implemented with packuswb which means the upper bits before the truncate were already zero. So that's pretty silly. I'll file a bug.

RKSimon added inline comments.Nov 13 2018, 3:21 AM

test/CodeGen/X86/combine-udiv.ll
726 ↗	(On Diff #173525)	Yes either VPPERM with zero or just a VSHUFB (it'd be unary by then). The current codegen avoids the variable shuffles completely which is better still.

craig.topper mentioned this in D54278: [SelectionDAG] Teach getNode to constant fold SIGN/ZERO/ANY_EXTEND_VECTOR_INREG.Nov 19 2018, 2:27 PM

Ping

LGTM

This revision is now accepted and ready to land.Nov 29 2018, 3:39 AM

Closed by commit rL347902: [SelectionDAG][AArch64][X86] Move legalization of vector MULHS/MULHU from… (authored by ctopper). · Explain WhyNov 29 2018, 11:39 AM

This revision was automatically updated to reflect the committed changes.

efriedma mentioned this in D69831: [AArch64] Re-add patterns for (s/u)mull2..Nov 4 2019, 3:58 PM

efriedma mentioned this in rG35cf9a1fc5d2: [AArch64] Re-add patterns for (s/u)mull2..Nov 6 2019, 12:30 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

LegalizeVectorOps.cpp

2 lines

Target/

AArch64/

AArch64ISelLowering.cpp

67 lines

AArch64InstrInfo.td

54 lines

X86/

X86ISelLowering.cpp

2 lines

test/

CodeGen/

X86/

combine-sdiv.ll

3 lines

combine-udiv.ll

48 lines

urem-seteq-vec-nonsplat.ll

12 lines

Diff 175914

llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp

Show First 20 Lines • Show All 324 Lines • ▼ Show 20 Lines	case ISD::STRICT_FTRUNC:
// is also legal, but if ISD::FSQRT requires expansion then so does		// is also legal, but if ISD::FSQRT requires expansion then so does
// ISD::STRICT_FSQRT.		// ISD::STRICT_FSQRT.
Action = TLI.getStrictFPOperationAction(Node->getOpcode(),		Action = TLI.getStrictFPOperationAction(Node->getOpcode(),
Node->getValueType(0));		Node->getValueType(0));
break;		break;
case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
case ISD::MUL:		case ISD::MUL:
		case ISD::MULHS:
		case ISD::MULHU:
case ISD::SDIV:		case ISD::SDIV:
case ISD::UDIV:		case ISD::UDIV:
case ISD::SREM:		case ISD::SREM:
case ISD::UREM:		case ISD::UREM:
case ISD::SDIVREM:		case ISD::SDIVREM:
case ISD::UDIVREM:		case ISD::UDIVREM:
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB:		case ISD::FSUB:
▲ Show 20 Lines • Show All 898 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 708 Lines • ▼ Show 20 Lines	if (Subtarget->hasNEON()) {
setOperationAction(ISD::ANY_EXTEND, MVT::v4i32, Legal);		setOperationAction(ISD::ANY_EXTEND, MVT::v4i32, Legal);
setTruncStoreAction(MVT::v2i32, MVT::v2i16, Expand);		setTruncStoreAction(MVT::v2i32, MVT::v2i16, Expand);
// Likewise, narrowing and extending vector loads/stores aren't handled		// Likewise, narrowing and extending vector loads/stores aren't handled
// directly.		// directly.
for (MVT VT : MVT::vector_valuetypes()) {		for (MVT VT : MVT::vector_valuetypes()) {
setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Expand);		setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Expand);

if (VT == MVT::v16i8 \|\| VT == MVT::v8i16 \|\| VT == MVT::v4i32) {		if (VT == MVT::v16i8 \|\| VT == MVT::v8i16 \|\| VT == MVT::v4i32) {
setOperationAction(ISD::MULHS, VT, Custom);		setOperationAction(ISD::MULHS, VT, Legal);
setOperationAction(ISD::MULHU, VT, Custom);		setOperationAction(ISD::MULHU, VT, Legal);
} else {		} else {
setOperationAction(ISD::MULHS, VT, Expand);		setOperationAction(ISD::MULHS, VT, Expand);
setOperationAction(ISD::MULHU, VT, Expand);		setOperationAction(ISD::MULHU, VT, Expand);
}		}
setOperationAction(ISD::SMUL_LOHI, VT, Expand);		setOperationAction(ISD::SMUL_LOHI, VT, Expand);
setOperationAction(ISD::UMUL_LOHI, VT, Expand);		setOperationAction(ISD::UMUL_LOHI, VT, Expand);

setOperationAction(ISD::BSWAP, VT, Expand);		setOperationAction(ISD::BSWAP, VT, Expand);
▲ Show 20 Lines • Show All 1,938 Lines • ▼ Show 20 Lines	static SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) {
EVT Op1VT = Op1.getValueType();		EVT Op1VT = Op1.getValueType();
return DAG.getNode(N0->getOpcode(), DL, VT,		return DAG.getNode(N0->getOpcode(), DL, VT,
DAG.getNode(NewOpc, DL, VT,		DAG.getNode(NewOpc, DL, VT,
DAG.getNode(ISD::BITCAST, DL, Op1VT, N00), Op1),		DAG.getNode(ISD::BITCAST, DL, Op1VT, N00), Op1),
DAG.getNode(NewOpc, DL, VT,		DAG.getNode(NewOpc, DL, VT,
DAG.getNode(ISD::BITCAST, DL, Op1VT, N01), Op1));		DAG.getNode(ISD::BITCAST, DL, Op1VT, N01), Op1));
}		}

// Lower vector multiply high (ISD::MULHS and ISD::MULHU).
static SDValue LowerMULH(SDValue Op, SelectionDAG &DAG) {
// Multiplications are only custom-lowered for 128-bit vectors so that
// {S,U}MULL{2} can be detected. Otherwise v2i64 multiplications are not
// legal.
EVT VT = Op.getValueType();
assert(VT.is128BitVector() && VT.isInteger() &&
"unexpected type for custom-lowering ISD::MULH{U,S}");

SDValue V0 = Op.getOperand(0);
SDValue V1 = Op.getOperand(1);

SDLoc DL(Op);

EVT ExtractVT = VT.getHalfNumVectorElementsVT(*DAG.getContext());

// We turn (V0 mulhs/mulhu V1) to:
//
// (uzp2 (smull (extract_subvector (ExtractVT V128:V0, (i64 0)),
// (extract_subvector (ExtractVT V128:V1, (i64 0))))),
// (smull (extract_subvector (ExtractVT V128:V0, (i64 VMull2Idx)),
// (extract_subvector (ExtractVT V128:V2, (i64 VMull2Idx))))))
//
// Where ExtractVT is a subvector with half number of elements, and
// VMullIdx2 is the index of the middle element (the high part).
//
// The vector hight part extract and multiply will be matched against
// {S,U}MULL{v16i8_v8i16,v8i16_v4i32,v4i32_v2i64} which in turn will
// issue a {s}mull2 instruction.
//
// This basically multiply the lower subvector with '{s,u}mull', the high
// subvector with '{s,u}mull2', and shuffle both results high part in
// resulting vector.
unsigned Mull2VectorIdx = VT.getVectorNumElements () / 2;
SDValue VMullIdx = DAG.getConstant(0, DL, MVT::i64);
SDValue VMull2Idx = DAG.getConstant(Mull2VectorIdx, DL, MVT::i64);

SDValue VMullV0 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V0, VMullIdx);
SDValue VMullV1 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V1, VMullIdx);

SDValue VMull2V0 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V0, VMull2Idx);
SDValue VMull2V1 =
DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V1, VMull2Idx);

unsigned MullOpc = Op.getOpcode() == ISD::MULHS ? AArch64ISD::SMULL
: AArch64ISD::UMULL;

EVT MullVT = ExtractVT.widenIntegerVectorElementType(*DAG.getContext());
SDValue Mull = DAG.getNode(MullOpc, DL, MullVT, VMullV0, VMullV1);
SDValue Mull2 = DAG.getNode(MullOpc, DL, MullVT, VMull2V0, VMull2V1);

Mull = DAG.getNode(ISD::BITCAST, DL, VT, Mull);
Mull2 = DAG.getNode(ISD::BITCAST, DL, VT, Mull2);

return DAG.getNode(AArch64ISD::UZP2, DL, VT, Mull, Mull2);
}

SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,		SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
SDLoc dl(Op);		SDLoc dl(Op);
switch (IntNo) {		switch (IntNo) {
default: return SDValue(); // Don't custom lower most intrinsics.		default: return SDValue(); // Don't custom lower most intrinsics.
case Intrinsic::thread_pointer: {		case Intrinsic::thread_pointer: {
EVT PtrVT = getPointerTy(DAG.getDataLayout());		EVT PtrVT = getPointerTy(DAG.getDataLayout());
▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
case ISD::FP_TO_UINT:		case ISD::FP_TO_UINT:
return LowerFP_TO_INT(Op, DAG);		return LowerFP_TO_INT(Op, DAG);
case ISD::FSINCOS:		case ISD::FSINCOS:
return LowerFSINCOS(Op, DAG);		return LowerFSINCOS(Op, DAG);
case ISD::FLT_ROUNDS_:		case ISD::FLT_ROUNDS_:
return LowerFLT_ROUNDS_(Op, DAG);		return LowerFLT_ROUNDS_(Op, DAG);
case ISD::MUL:		case ISD::MUL:
return LowerMUL(Op, DAG);		return LowerMUL(Op, DAG);
case ISD::MULHS:
case ISD::MULHU:
return LowerMULH(Op, DAG);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return LowerINTRINSIC_WO_CHAIN(Op, DAG);		return LowerINTRINSIC_WO_CHAIN(Op, DAG);
case ISD::STORE:		case ISD::STORE:
return LowerSTORE(Op, DAG);		return LowerSTORE(Op, DAG);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
case ISD::VECREDUCE_SMAX:		case ISD::VECREDUCE_SMAX:
case ISD::VECREDUCE_SMIN:		case ISD::VECREDUCE_SMIN:
case ISD::VECREDUCE_UMAX:		case ISD::VECREDUCE_UMAX:
▲ Show 20 Lines • Show All 8,893 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,102 Lines • ▼ Show 20 Lines	def : Pat<(v2i64 (opnode (v2i32 V64:$Rn), (v2i32 V64:$Rm))),
(INST2S V64:$Rn, V64:$Rm)>;		(INST2S V64:$Rn, V64:$Rm)>;
}		}

defm : Neon_mul_widen_patterns<AArch64smull, SMULLv8i8_v8i16,		defm : Neon_mul_widen_patterns<AArch64smull, SMULLv8i8_v8i16,
SMULLv4i16_v4i32, SMULLv2i32_v2i64>;		SMULLv4i16_v4i32, SMULLv2i32_v2i64>;
defm : Neon_mul_widen_patterns<AArch64umull, UMULLv8i8_v8i16,		defm : Neon_mul_widen_patterns<AArch64umull, UMULLv8i8_v8i16,
UMULLv4i16_v4i32, UMULLv2i32_v2i64>;		UMULLv4i16_v4i32, UMULLv2i32_v2i64>;

// Patterns for smull2/umull2.
multiclass Neon_mul_high_patterns<SDPatternOperator opnode,
Instruction INST8B, Instruction INST4H, Instruction INST2S> {
def : Pat<(v8i16 (opnode (extract_high_v16i8 V128:$Rn),
(extract_high_v16i8 V128:$Rm))),
(INST8B V128:$Rn, V128:$Rm)>;
def : Pat<(v4i32 (opnode (extract_high_v8i16 V128:$Rn),
(extract_high_v8i16 V128:$Rm))),
(INST4H V128:$Rn, V128:$Rm)>;
def : Pat<(v2i64 (opnode (extract_high_v4i32 V128:$Rn),
(extract_high_v4i32 V128:$Rm))),
(INST2S V128:$Rn, V128:$Rm)>;
}

defm : Neon_mul_high_patterns<AArch64smull, SMULLv16i8_v8i16,
SMULLv8i16_v4i32, SMULLv4i32_v2i64>;
defm : Neon_mul_high_patterns<AArch64umull, UMULLv16i8_v8i16,
UMULLv8i16_v4i32, UMULLv4i32_v2i64>;

// Additional patterns for SMLAL/SMLSL and UMLAL/UMLSL		// Additional patterns for SMLAL/SMLSL and UMLAL/UMLSL
multiclass Neon_mulacc_widen_patterns<SDPatternOperator opnode,		multiclass Neon_mulacc_widen_patterns<SDPatternOperator opnode,
Instruction INST8B, Instruction INST4H, Instruction INST2S> {		Instruction INST8B, Instruction INST4H, Instruction INST2S> {
def : Pat<(v8i16 (opnode (v8i16 V128:$Rd), (v8i8 V64:$Rn), (v8i8 V64:$Rm))),		def : Pat<(v8i16 (opnode (v8i16 V128:$Rd), (v8i8 V64:$Rn), (v8i8 V64:$Rm))),
(INST8B V128:$Rd, V64:$Rn, V64:$Rm)>;		(INST8B V128:$Rd, V64:$Rn, V64:$Rm)>;
def : Pat<(v4i32 (opnode (v4i32 V128:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm))),		def : Pat<(v4i32 (opnode (v4i32 V128:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm))),
(INST4H V128:$Rd, V64:$Rn, V64:$Rm)>;		(INST4H V128:$Rd, V64:$Rn, V64:$Rm)>;
def : Pat<(v2i64 (opnode (v2i64 V128:$Rd), (v2i32 V64:$Rn), (v2i32 V64:$Rm))),		def : Pat<(v2i64 (opnode (v2i64 V128:$Rd), (v2i32 V64:$Rn), (v2i32 V64:$Rm))),
▲ Show 20 Lines • Show All 1,847 Lines • ▼ Show 20 Lines

// To truncate, we can simply extract from a subregister.		// To truncate, we can simply extract from a subregister.
def : Pat<(i32 (trunc GPR64sp:$src)),		def : Pat<(i32 (trunc GPR64sp:$src)),
(i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;		(i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;

// __builtin_trap() uses the BRK instruction on AArch64.		// __builtin_trap() uses the BRK instruction on AArch64.
def : Pat<(trap), (BRK 1)>;		def : Pat<(trap), (BRK 1)>;

		// Multiply high patterns which multiply the lower subvector using smull/umull
		// and the upper subvector with smull2/umull2. Then shuffle the high the high
		// part of both results together.
		def : Pat<(v16i8 (mulhs V128:$Rn, V128:$Rm)),
		(UZP2v16i8
		(SMULLv8i8_v8i16 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(SMULLv16i8_v8i16 V128:$Rn, V128:$Rm))>;
		def : Pat<(v8i16 (mulhs V128:$Rn, V128:$Rm)),
		(UZP2v8i16
		(SMULLv4i16_v4i32 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(SMULLv8i16_v4i32 V128:$Rn, V128:$Rm))>;
		def : Pat<(v4i32 (mulhs V128:$Rn, V128:$Rm)),
		(UZP2v4i32
		(SMULLv2i32_v2i64 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(SMULLv4i32_v2i64 V128:$Rn, V128:$Rm))>;

		def : Pat<(v16i8 (mulhu V128:$Rn, V128:$Rm)),
		(UZP2v16i8
		(UMULLv8i8_v8i16 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(UMULLv16i8_v8i16 V128:$Rn, V128:$Rm))>;
		def : Pat<(v8i16 (mulhu V128:$Rn, V128:$Rm)),
		(UZP2v8i16
		(UMULLv4i16_v4i32 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(UMULLv8i16_v4i32 V128:$Rn, V128:$Rm))>;
		def : Pat<(v4i32 (mulhu V128:$Rn, V128:$Rm)),
		(UZP2v4i32
		(UMULLv2i32_v2i64 (EXTRACT_SUBREG V128:$Rn, dsub),
		(EXTRACT_SUBREG V128:$Rm, dsub)),
		(UMULLv4i32_v2i64 V128:$Rn, V128:$Rm))>;

// Conversions within AdvSIMD types in the same register size are free.		// Conversions within AdvSIMD types in the same register size are free.
// But because we need a consistent lane ordering, in big endian many		// But because we need a consistent lane ordering, in big endian many
// conversions require one or more REV instructions.		// conversions require one or more REV instructions.
//		//
// Consider a simple memory load followed by a bitconvert then a store.		// Consider a simple memory load followed by a bitconvert then a store.
// v0 = load v2i32		// v0 = load v2i32
// v1 = BITCAST v2i32 v0 to v4i16		// v1 = BITCAST v2i32 v0 to v4i16
// store v4i16 v2		// store v4i16 v2
▲ Show 20 Lines • Show All 728 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,523 Lines • ▼ Show 20 Lines	if (SDValue TruncBroadcast = lowerVectorShuffleAsTruncBroadcast(
return TruncBroadcast;		return TruncBroadcast;

MVT BroadcastVT = VT;		MVT BroadcastVT = VT;

// Peek through any bitcast (only useful for loads).		// Peek through any bitcast (only useful for loads).
SDValue BC = peekThroughBitcasts(V);		SDValue BC = peekThroughBitcasts(V);

// Also check the simpler case, where we can directly reuse the scalar.		// Also check the simpler case, where we can directly reuse the scalar.
if (V.getOpcode() == ISD::BUILD_VECTOR \|\|		if ((V.getOpcode() == ISD::BUILD_VECTOR && V.hasOneUse()) \|\|
(V.getOpcode() == ISD::SCALAR_TO_VECTOR && BroadcastIdx == 0)) {		(V.getOpcode() == ISD::SCALAR_TO_VECTOR && BroadcastIdx == 0)) {
V = V.getOperand(BroadcastIdx);		V = V.getOperand(BroadcastIdx);

// If we can't broadcast from a register, check that the input is a load.		// If we can't broadcast from a register, check that the input is a load.
if (!BroadcastFromReg && !isShuffleFoldableLoad(V))		if (!BroadcastFromReg && !isShuffleFoldableLoad(V))
return SDValue();		return SDValue();
} else if (MayFoldLoad(BC) && !cast<LoadSDNode>(BC)->isVolatile()) {		} else if (MayFoldLoad(BC) && !cast<LoadSDNode>(BC)->isVolatile()) {
// 32-bit targets need to load i64 as a f64 and then bitcast the result.		// 32-bit targets need to load i64 as a f64 and then bitcast the result.
▲ Show 20 Lines • Show All 30,692 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/combine-sdiv.ll

	Show First 20 Lines • Show All 3,140 Lines • ▼ Show 20 Lines
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	;			;
	; XOP-LABEL: pr38658:			; XOP-LABEL: pr38658:
	; XOP: # %bb.0:			; XOP: # %bb.0:
	; XOP-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; XOP-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; XOP-NEXT: vpmovsxbw %xmm1, %xmm1			; XOP-NEXT: vpmovsxbw %xmm1, %xmm1
	; XOP-NEXT: vpmullw {{.*}}(%rip), %xmm1, %xmm1			; XOP-NEXT: vpmullw {{.*}}(%rip), %xmm1, %xmm1
				; XOP-NEXT: vpsrlw $8, %xmm1, %xmm1
	; XOP-NEXT: vpxor %xmm2, %xmm2, %xmm2			; XOP-NEXT: vpxor %xmm2, %xmm2, %xmm2
	; XOP-NEXT: vpperm {{.*#+}} xmm1 = xmm2[1,3,5,7,9,11,13,15],xmm1[1,3,5,7,9,11,13,15]			; XOP-NEXT: vpackuswb %xmm1, %xmm2, %xmm1
	; XOP-NEXT: vpaddb %xmm0, %xmm1, %xmm0			; XOP-NEXT: vpaddb %xmm0, %xmm1, %xmm0
	; XOP-NEXT: vpshab {{.*}}(%rip), %xmm0, %xmm1			; XOP-NEXT: vpshab {{.*}}(%rip), %xmm0, %xmm1
	; XOP-NEXT: vpshlb {{.*}}(%rip), %xmm0, %xmm0			; XOP-NEXT: vpshlb {{.*}}(%rip), %xmm0, %xmm0
	; XOP-NEXT: vpand {{.*}}(%rip), %xmm0, %xmm0			; XOP-NEXT: vpand {{.*}}(%rip), %xmm0, %xmm0
	; XOP-NEXT: vpaddb %xmm0, %xmm1, %xmm0			; XOP-NEXT: vpaddb %xmm0, %xmm1, %xmm0
	; XOP-NEXT: retq			; XOP-NEXT: retq
	%1 = sdiv <16 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 7>			%1 = sdiv <16 x i8> %x, <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 7>
	ret <16 x i8> %1			ret <16 x i8> %1
	Show All 19 Lines

llvm/trunk/test/CodeGen/X86/combine-udiv.ll

	Show First 20 Lines • Show All 641 Lines • ▼ Show 20 Lines
	}			}

	define <16 x i8> @combine_vec_udiv_nonuniform4(<16 x i8> %x) {			define <16 x i8> @combine_vec_udiv_nonuniform4(<16 x i8> %x) {
	; SSE2-LABEL: combine_vec_udiv_nonuniform4:			; SSE2-LABEL: combine_vec_udiv_nonuniform4:
	; SSE2: # %bb.0:			; SSE2: # %bb.0:
	; SSE2-NEXT: pxor %xmm1, %xmm1			; SSE2-NEXT: pxor %xmm1, %xmm1
	; SSE2-NEXT: movdqa %xmm0, %xmm2			; SSE2-NEXT: movdqa %xmm0, %xmm2
	; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]			; SSE2-NEXT: punpcklbw {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1],xmm2[2],xmm1[2],xmm2[3],xmm1[3],xmm2[4],xmm1[4],xmm2[5],xmm1[5],xmm2[6],xmm1[6],xmm2[7],xmm1[7]
	; SSE2-NEXT: movl $255, %eax
	; SSE2-NEXT: movd %eax, %xmm1
	; SSE2-NEXT: movl $171, %eax			; SSE2-NEXT: movl $171, %eax
	; SSE2-NEXT: movd %eax, %xmm3			; SSE2-NEXT: movd %eax, %xmm1
	; SSE2-NEXT: pand %xmm1, %xmm3			; SSE2-NEXT: pmullw %xmm2, %xmm1
	; SSE2-NEXT: pmullw %xmm2, %xmm3			; SSE2-NEXT: psrlw $8, %xmm1
	; SSE2-NEXT: psrlw $8, %xmm3			; SSE2-NEXT: pmullw {{.*}}(%rip), %xmm1
	; SSE2-NEXT: pmullw {{.*}}(%rip), %xmm3			; SSE2-NEXT: psrlw $8, %xmm1
	; SSE2-NEXT: psrlw $8, %xmm3			; SSE2-NEXT: movl $255, %eax
	; SSE2-NEXT: pand %xmm1, %xmm3			; SSE2-NEXT: movd %eax, %xmm2
				; SSE2-NEXT: pand %xmm1, %xmm2
	; SSE2-NEXT: pand {{.*}}(%rip), %xmm0			; SSE2-NEXT: pand {{.*}}(%rip), %xmm0
	; SSE2-NEXT: por %xmm3, %xmm0			; SSE2-NEXT: por %xmm2, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE41-LABEL: combine_vec_udiv_nonuniform4:			; SSE41-LABEL: combine_vec_udiv_nonuniform4:
	; SSE41: # %bb.0:			; SSE41: # %bb.0:
	; SSE41-NEXT: movdqa %xmm0, %xmm1			; SSE41-NEXT: movdqa %xmm0, %xmm1
	; SSE41-NEXT: movl $171, %eax			; SSE41-NEXT: movl $171, %eax
	; SSE41-NEXT: movd %eax, %xmm0			; SSE41-NEXT: movd %eax, %xmm0
	; SSE41-NEXT: pmovzxbw {{.*#+}} xmm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero			; SSE41-NEXT: pmovzxbw {{.*#+}} xmm2 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
	; SSE41-NEXT: pmullw %xmm0, %xmm2			; SSE41-NEXT: pmullw %xmm0, %xmm2
	; SSE41-NEXT: psrlw $8, %xmm2			; SSE41-NEXT: psrlw $8, %xmm2
	; SSE41-NEXT: movdqa %xmm2, %xmm0			; SSE41-NEXT: movdqa %xmm2, %xmm0
	; SSE41-NEXT: psllw $8, %xmm0			; SSE41-NEXT: psllw $1, %xmm0
	; SSE41-NEXT: pxor %xmm3, %xmm3			; SSE41-NEXT: psllw $8, %xmm2
	; SSE41-NEXT: packuswb %xmm3, %xmm2			; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm0[0],xmm2[1,2,3,4,5,6,7]
	; SSE41-NEXT: pmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
	; SSE41-NEXT: psllw $1, %xmm2
	; SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm2[0],xmm0[1,2,3,4,5,6,7]
	; SSE41-NEXT: psrlw $8, %xmm2			; SSE41-NEXT: psrlw $8, %xmm2
	; SSE41-NEXT: packuswb %xmm0, %xmm2			; SSE41-NEXT: packuswb %xmm0, %xmm2
	; SSE41-NEXT: movaps {{.*#+}} xmm0 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; SSE41-NEXT: movaps {{.*#+}} xmm0 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
	; SSE41-NEXT: pblendvb %xmm0, %xmm1, %xmm2			; SSE41-NEXT: pblendvb %xmm0, %xmm1, %xmm2
	; SSE41-NEXT: movdqa %xmm2, %xmm0			; SSE41-NEXT: movdqa %xmm2, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: combine_vec_udiv_nonuniform4:			; AVX1-LABEL: combine_vec_udiv_nonuniform4:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: movl $171, %eax			; AVX1-NEXT: movl $171, %eax
	; AVX1-NEXT: vmovd %eax, %xmm1			; AVX1-NEXT: vmovd %eax, %xmm1
	; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero			; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; AVX1-NEXT: vpmullw %xmm1, %xmm2, %xmm1			; AVX1-NEXT: vpmullw %xmm1, %xmm2, %xmm1
	; AVX1-NEXT: vpsrlw $8, %xmm1, %xmm1			; AVX1-NEXT: vpsrlw $8, %xmm1, %xmm1
	; AVX1-NEXT: vpsllw $8, %xmm1, %xmm2			; AVX1-NEXT: vpsllw $1, %xmm1, %xmm2
	; AVX1-NEXT: vpxor %xmm3, %xmm3, %xmm3			; AVX1-NEXT: vpsllw $8, %xmm1, %xmm1
	; AVX1-NEXT: vpackuswb %xmm3, %xmm1, %xmm1			; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm2[0],xmm1[1,2,3,4,5,6,7]
	; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
	; AVX1-NEXT: vpsllw $1, %xmm1, %xmm1
	; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm2[1,2,3,4,5,6,7]
	; AVX1-NEXT: vpsrlw $8, %xmm1, %xmm1			; AVX1-NEXT: vpsrlw $8, %xmm1, %xmm1
	; AVX1-NEXT: vpackuswb %xmm0, %xmm1, %xmm1			; AVX1-NEXT: vpackuswb %xmm0, %xmm1, %xmm1
	; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
	; AVX1-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: combine_vec_udiv_nonuniform4:			; AVX2-LABEL: combine_vec_udiv_nonuniform4:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	Show All 13 Lines
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; XOP-LABEL: combine_vec_udiv_nonuniform4:			; XOP-LABEL: combine_vec_udiv_nonuniform4:
	; XOP: # %bb.0:			; XOP: # %bb.0:
	; XOP-NEXT: movl $171, %eax			; XOP-NEXT: movl $171, %eax
	; XOP-NEXT: vmovd %eax, %xmm1			; XOP-NEXT: vmovd %eax, %xmm1
	; XOP-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero			; XOP-NEXT: vpmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; XOP-NEXT: vpmullw %xmm1, %xmm2, %xmm2			; XOP-NEXT: vpmullw %xmm1, %xmm2, %xmm1
	; XOP-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[2,3,0,1]			; XOP-NEXT: vpsrlw $8, %xmm1, %xmm1
	; XOP-NEXT: vpmovzxbw {{.*#+}} xmm3 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero			; XOP-NEXT: vpxor %xmm2, %xmm2, %xmm2
	; XOP-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]			; XOP-NEXT: vpackuswb %xmm2, %xmm1, %xmm1
	; XOP-NEXT: vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
	; XOP-NEXT: vpmullw %xmm1, %xmm3, %xmm1
	; XOP-NEXT: vpperm {{.*#+}} xmm1 = xmm2[1,3,5,7,9,11,13,15],xmm1[1,3,5,7,9,11,13,15]
	; XOP-NEXT: movl $249, %eax			; XOP-NEXT: movl $249, %eax
	; XOP-NEXT: vmovd %eax, %xmm2			; XOP-NEXT: vmovd %eax, %xmm2
	; XOP-NEXT: vpshlb %xmm2, %xmm1, %xmm1			; XOP-NEXT: vpshlb %xmm2, %xmm1, %xmm1
	; XOP-NEXT: vmovdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; XOP-NEXT: vmovdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
	; XOP-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm0			; XOP-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm0
	; XOP-NEXT: retq			; XOP-NEXT: retq
	%div = udiv <16 x i8> %x, <i8 -64, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>			%div = udiv <16 x i8> %x, <i8 -64, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
	ret <16 x i8> %div			ret <16 x i8> %div
	▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/urem-seteq-vec-nonsplat.ll

	Show First 20 Lines • Show All 623 Lines • ▼ Show 20 Lines
	; CHECK-SSE41-NEXT: pmulld {{.*}}(%rip), %xmm1			; CHECK-SSE41-NEXT: pmulld {{.*}}(%rip), %xmm1
	; CHECK-SSE41-NEXT: psubd %xmm1, %xmm0			; CHECK-SSE41-NEXT: psubd %xmm1, %xmm0
	; CHECK-SSE41-NEXT: pcmpeqd {{.*}}(%rip), %xmm0			; CHECK-SSE41-NEXT: pcmpeqd {{.*}}(%rip), %xmm0
	; CHECK-SSE41-NEXT: psrld $31, %xmm0			; CHECK-SSE41-NEXT: psrld $31, %xmm0
	; CHECK-SSE41-NEXT: retq			; CHECK-SSE41-NEXT: retq
	;			;
	; CHECK-AVX1-LABEL: test_urem_both:			; CHECK-AVX1-LABEL: test_urem_both:
	; CHECK-AVX1: # %bb.0:			; CHECK-AVX1: # %bb.0:
	; CHECK-AVX1-NEXT: vmovddup {{.*#+}} xmm1 = [-9.255967385052751E+61,-9.255967385052751E+61]			; CHECK-AVX1-NEXT: vmovdqa {{.*#+}} xmm1 = [2863311531,3435973837,2863311531,3435973837]
	; CHECK-AVX1-NEXT: # xmm1 = mem[0,0]			; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
				; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]
				; CHECK-AVX1-NEXT: vpmuludq %xmm2, %xmm3, %xmm2
	; CHECK-AVX1-NEXT: vpmuludq %xmm1, %xmm0, %xmm1			; CHECK-AVX1-NEXT: vpmuludq %xmm1, %xmm0, %xmm1
	; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]			; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
	; CHECK-AVX1-NEXT: vpmuludq {{.*}}(%rip), %xmm2, %xmm2
	; CHECK-AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]			; CHECK-AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm2[2,3],xmm1[4,5],xmm2[6,7]
	; CHECK-AVX1-NEXT: vpsrld $2, %xmm1, %xmm1			; CHECK-AVX1-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1			; CHECK-AVX1-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpsubd %xmm1, %xmm0, %xmm0			; CHECK-AVX1-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX1-NEXT: vpcmpeqd {{.*}}(%rip), %xmm0, %xmm0			; CHECK-AVX1-NEXT: vpcmpeqd {{.*}}(%rip), %xmm0, %xmm0
	; CHECK-AVX1-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX1-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX1-NEXT: retq			; CHECK-AVX1-NEXT: retq
	;			;
	; CHECK-AVX2-LABEL: test_urem_both:			; CHECK-AVX2-LABEL: test_urem_both:
	; CHECK-AVX2: # %bb.0:			; CHECK-AVX2: # %bb.0:
	; CHECK-AVX2-NEXT: vpbroadcastq {{.*#+}} xmm1 = [14757395262689946283,14757395262689946283]			; CHECK-AVX2-NEXT: vmovdqa {{.*#+}} xmm1 = [2863311531,3435973837,2863311531,3435973837]
	; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]			; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
	; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]			; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]
	; CHECK-AVX2-NEXT: vpmuludq %xmm2, %xmm3, %xmm2			; CHECK-AVX2-NEXT: vpmuludq %xmm2, %xmm3, %xmm2
	; CHECK-AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm1			; CHECK-AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm1
	; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]			; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; CHECK-AVX2-NEXT: vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]			; CHECK-AVX2-NEXT: vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
	; CHECK-AVX2-NEXT: vpsrld $2, %xmm1, %xmm1			; CHECK-AVX2-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1			; CHECK-AVX2-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpsubd %xmm1, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpcmpeqd {{.*}}(%rip), %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpcmpeqd {{.*}}(%rip), %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: retq			; CHECK-AVX2-NEXT: retq
	;			;
	; CHECK-AVX512VL-LABEL: test_urem_both:			; CHECK-AVX512VL-LABEL: test_urem_both:
	; CHECK-AVX512VL: # %bb.0:			; CHECK-AVX512VL: # %bb.0:
	; CHECK-AVX512VL-NEXT: vpbroadcastq {{.*#+}} xmm1 = [14757395262689946283,14757395262689946283]			; CHECK-AVX512VL-NEXT: vmovdqa {{.*#+}} xmm1 = [2863311531,3435973837,2863311531,3435973837]
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]			; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]			; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm2, %xmm3, %xmm2			; CHECK-AVX512VL-NEXT: vpmuludq %xmm2, %xmm3, %xmm2
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm1, %xmm0, %xmm1			; CHECK-AVX512VL-NEXT: vpmuludq %xmm1, %xmm0, %xmm1
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]			; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; CHECK-AVX512VL-NEXT: vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]			; CHECK-AVX512VL-NEXT: vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
	; CHECK-AVX512VL-NEXT: vpsrld $2, %xmm1, %xmm1			; CHECK-AVX512VL-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1			; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1
	▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines