This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil
ClosedPublic

Authored by mike.dvoretsky on Jun 12 2018, 2:59 AM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
RKSimon

Commits

rG8393f907176c: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil
rL335039: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil

Summary

Currently, X86 floor and ceil intrinsics for vectors are implemented as target-specific intrinsics that use the generic rounding instruction of the corresponding vector processing feature (ROUND* or VRNDSCALE*). This patch replaces those specific cases with calls to target-independent @llvm.floor.* and @llvm.ceil.* intrinsics. This doesn't affect the resulting machine code, as those intrinsics are lowered to the same instructions, but exposes these specific rounding cases to generic optimizations.

This patch relies on D45203 to fold the resulting IR patterns and serves as an alternative to D45202.

Diff Detail

Repository: rL LLVM

Event Timeline

mike.dvoretsky created this revision.Jun 12 2018, 2:59 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald TranscriptJun 12 2018, 2:59 AM

mike.dvoretsky mentioned this in D45202: [X86] Replacing X86-specific floor and ceil vector intrinsics with generic LLVM intrinsics.Jun 12 2018, 3:00 AM

craig.topper added inline comments.Jun 12 2018, 10:27 AM

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
582 ↗	(On Diff #150911)	Why int and getSExtValue? Feels like it should be unsigned and getZExtValue.
586 ↗	(On Diff #150911)	Not sure if we should assume the rounding mode or SAE is a constant int. The clang frontend guarantees it, but handcrafted IR tests could break it.
619 ↗	(On Diff #150911)	Why would MaskTy vary here? It's fixed by the intrinsic isn't it?
621 ↗	(On Diff #150911)	There's an overload of CreateAnd that accepts a uint64_t for RHS, so you can probably just pass 1 here without creating the ConstantInt yourself.
652 ↗	(On Diff #150911)	For less than 8, you should cast i8 to v8i1 first then extract the subvector. We don't want i4 and i2 types floating around.
2575 ↗	(On Diff #150911)	Why are we calling simplifyX86Round nested under SimplifyDemandedVectorElts? Simplifying should be completely orthogonal. If a simplification happened and the round intrinsic still exists, InstCombine will revisit here and SimplifyDemandedVectorElts will return nullptr. No need to try to handle every case in one shot.

I think where we ultimately want to end up is to remove the masking from the packed intrinsics and replace the scalar intrinsics with versions that use f32/f64 as their types. The IR would then look similar to where we're trying to end up for things like sqrt. But instead of a target independent intrinsic we would have a target specific intrinsic. All the the masking and insert/extract would be completely separate from the operation itself. That would greatly simplify the InstCombine code here because you would just need to trade out the target specific intrinsic for the floor/ceil intrinsic without having to worry about anything else. Thoughts?

Changes made per comments. Note that zext IR instructions have been fully excluded from all patterns, which will require altering vec_floor.ll tests in D45203 if this revision is accepted.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
586 ↗	(On Diff #150911)	Rounding and SAE are immediate. Not using constants here leads to isel failure. Not sure this code should accept such explicitly incorrect cases.

lebedev.ri added a subscriber: lebedev.ri.Jun 14 2018, 4:34 AM

lebedev.ri added inline comments.

llvm/test/Transforms/InstCombine/X86/x86-avx.ll
34 ↗	(On Diff #151327)	ceil, not deil
llvm/test/Transforms/InstCombine/X86/x86-sse41.ll
7 ↗	(On Diff #151327)	Can you regenerate this file right now in trunk (and commit), so this noise is gone?

Yes an non-constant value will fail isel and print a readable error message. But if you assume constant in instcombine by just using “cast” and it’s not we’ll get a segmentation fault in release builds before we get to isel. This is a much worse experience for users. So I suggest just ignoring any intrinsics with non-constant here and let isel throw an error.

Fixed the typo in the test name and added checks to make the transform stop if the rounding mode immediate and/or SAE are not constant.

LGTM

This revision is now accepted and ready to land.Jun 14 2018, 12:31 PM

mike.dvoretsky mentioned this in D45203: [X86] VRNDSCALE* folding from masked and scalar ffloor and fceil patterns.Jun 15 2018, 1:26 AM

Closed by commit rL335039: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil (authored by mike.dvoretsky). · Explain WhyJun 19 2018, 3:54 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

130 lines

test/

Transforms/

InstCombine/

X86/

x86-avx.ll

41 lines

x86-avx512.ll

207 lines

x86-sse41.ll

44 lines

Diff 151891

llvm/trunk/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	for (unsigned Elt = 0; Elt != NumDstEltsPerLane; ++Elt) {

Vals.push_back(ConstantInt::get(ResTy->getScalarType(), Val));		Vals.push_back(ConstantInt::get(ResTy->getScalarType(), Val));
}		}
}		}

return ConstantVector::get(Vals);		return ConstantVector::get(Vals);
}		}

		// Replace X86-specific intrinsics with generic floor-ceil where applicable.
		static Value *simplifyX86round(IntrinsicInst &II,
		InstCombiner::BuilderTy &Builder) {
		ConstantInt *Arg = nullptr;
		Intrinsic::ID IntrinsicID = II.getIntrinsicID();

		if (IntrinsicID == Intrinsic::x86_sse41_round_ss \|\|
		IntrinsicID == Intrinsic::x86_sse41_round_sd)
		Arg = dyn_cast<ConstantInt>(II.getArgOperand(2));
		else if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd)
		Arg = dyn_cast<ConstantInt>(II.getArgOperand(4));
		else
		Arg = dyn_cast<ConstantInt>(II.getArgOperand(1));
		if (!Arg)
		return nullptr;
		unsigned RoundControl = Arg->getZExtValue();

		Arg = nullptr;
		unsigned SAE = 0;
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_512 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_512)
		Arg = dyn_cast<ConstantInt>(II.getArgOperand(4));
		else if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd)
		Arg = dyn_cast<ConstantInt>(II.getArgOperand(5));
		else
		SAE = 4;
		if (!SAE) {
		if (!Arg)
		return nullptr;
		SAE = Arg->getZExtValue();
		}

		if (SAE != 4 \|\| (RoundControl != 2 /ceil/ && RoundControl != 1 /floor/))
		return nullptr;

		Value Src, Dst, *Mask;
		bool IsScalar = false;
		if (IntrinsicID == Intrinsic::x86_sse41_round_ss \|\|
		IntrinsicID == Intrinsic::x86_sse41_round_sd \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		IsScalar = true;
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		Mask = II.getArgOperand(3);
		Value *Zero = Constant::getNullValue(Mask->getType());
		Mask = Builder.CreateAnd(Mask, 1);
		Mask = Builder.CreateICmp(ICmpInst::ICMP_NE, Mask, Zero);
		Dst = II.getArgOperand(2);
		} else
		Dst = II.getArgOperand(0);
		Src = Builder.CreateExtractElement(II.getArgOperand(1), (uint64_t)0);
		} else {
		Src = II.getArgOperand(0);
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_128 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_256 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_512 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_128 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_256 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_512) {
		Dst = II.getArgOperand(2);
		Mask = II.getArgOperand(3);
		} else {
		Dst = Src;
		Mask = ConstantInt::getAllOnesValue(
		Builder.getIntNTy(Src->getType()->getVectorNumElements()));
		}
		}

		Intrinsic::ID ID = (RoundControl == 2) ? Intrinsic::ceil : Intrinsic::floor;
		Value *Res = Builder.CreateIntrinsic(ID, {Src}, &II);
		if (!IsScalar) {
		if (auto *C = dyn_cast<Constant>(Mask))
		if (C->isAllOnesValue())
		return Res;
		auto *MaskTy = VectorType::get(
		Builder.getInt1Ty(), cast<IntegerType>(Mask->getType())->getBitWidth());
		Mask = Builder.CreateBitCast(Mask, MaskTy);
		unsigned Width = Src->getType()->getVectorNumElements();
		if (MaskTy->getVectorNumElements() > Width) {
		uint32_t Indices[4];
		for (unsigned i = 0; i != Width; ++i)
		Indices[i] = i;
		Mask = Builder.CreateShuffleVector(Mask, Mask,
		makeArrayRef(Indices, Width));
		}
		return Builder.CreateSelect(Mask, Res, Dst);
		}
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		Dst = Builder.CreateExtractElement(Dst, (uint64_t)0);
		Res = Builder.CreateSelect(Mask, Res, Dst);
		Dst = II.getArgOperand(0);
		}
		return Builder.CreateInsertElement(Dst, Res, (uint64_t)0);
		}

static Value *simplifyX86movmsk(const IntrinsicInst &II) {		static Value *simplifyX86movmsk(const IntrinsicInst &II) {
Value *Arg = II.getArgOperand(0);		Value *Arg = II.getArgOperand(0);
Type *ResTy = II.getType();		Type *ResTy = II.getType();
Type *ArgTy = Arg->getType();		Type *ArgTy = Arg->getType();

// movmsk(undef) -> zero as we must ensure the upper bits are zero.		// movmsk(undef) -> zero as we must ensure the upper bits are zero.
if (isa<UndefValue>(Arg))		if (isa<UndefValue>(Arg))
return Constant::getNullValue(ResTy);		return Constant::getNullValue(ResTy);
▲ Show 20 Lines • Show All 1,630 Lines • ▼ Show 20 Lines	case Intrinsic::x86_avx512_cvttsd2usi64: {
unsigned VWidth = Arg->getType()->getVectorNumElements();		unsigned VWidth = Arg->getType()->getVectorNumElements();
if (Value *V = SimplifyDemandedVectorEltsLow(Arg, VWidth, 1)) {		if (Value *V = SimplifyDemandedVectorEltsLow(Arg, VWidth, 1)) {
II->setArgOperand(0, V);		II->setArgOperand(0, V);
return II;		return II;
}		}
break;		break;
}		}

		case Intrinsic::x86_sse41_round_ps:
		case Intrinsic::x86_sse41_round_pd:
		case Intrinsic::x86_avx_round_ps_256:
		case Intrinsic::x86_avx_round_pd_256:
		case Intrinsic::x86_avx512_mask_rndscale_ps_128:
		case Intrinsic::x86_avx512_mask_rndscale_ps_256:
		case Intrinsic::x86_avx512_mask_rndscale_ps_512:
		case Intrinsic::x86_avx512_mask_rndscale_pd_128:
		case Intrinsic::x86_avx512_mask_rndscale_pd_256:
		case Intrinsic::x86_avx512_mask_rndscale_pd_512:
		case Intrinsic::x86_avx512_mask_rndscale_ss:
		case Intrinsic::x86_avx512_mask_rndscale_sd:
		if (Value V = simplifyX86round(II, Builder))
		return replaceInstUsesWith(*II, V);
		break;

case Intrinsic::x86_mmx_pmovmskb:		case Intrinsic::x86_mmx_pmovmskb:
case Intrinsic::x86_sse_movmsk_ps:		case Intrinsic::x86_sse_movmsk_ps:
case Intrinsic::x86_sse2_movmsk_pd:		case Intrinsic::x86_sse2_movmsk_pd:
case Intrinsic::x86_sse2_pmovmskb_128:		case Intrinsic::x86_sse2_pmovmskb_128:
case Intrinsic::x86_avx_movmsk_pd_256:		case Intrinsic::x86_avx_movmsk_pd_256:
case Intrinsic::x86_avx_movmsk_ps_256:		case Intrinsic::x86_avx_movmsk_ps_256:
case Intrinsic::x86_avx2_pmovmskb:		case Intrinsic::x86_avx2_pmovmskb:
if (Value V = simplifyX86movmsk(II))		if (Value V = simplifyX86movmsk(II))
▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitCallInst(CallInst &CI) {
case Intrinsic::x86_fma_vfmadd_ss:		case Intrinsic::x86_fma_vfmadd_ss:
case Intrinsic::x86_fma_vfmadd_sd:		case Intrinsic::x86_fma_vfmadd_sd:
case Intrinsic::x86_sse_cmp_ss:		case Intrinsic::x86_sse_cmp_ss:
case Intrinsic::x86_sse_min_ss:		case Intrinsic::x86_sse_min_ss:
case Intrinsic::x86_sse_max_ss:		case Intrinsic::x86_sse_max_ss:
case Intrinsic::x86_sse2_cmp_sd:		case Intrinsic::x86_sse2_cmp_sd:
case Intrinsic::x86_sse2_min_sd:		case Intrinsic::x86_sse2_min_sd:
case Intrinsic::x86_sse2_max_sd:		case Intrinsic::x86_sse2_max_sd:
case Intrinsic::x86_sse41_round_ss:
case Intrinsic::x86_sse41_round_sd:
case Intrinsic::x86_xop_vfrcz_ss:		case Intrinsic::x86_xop_vfrcz_ss:
case Intrinsic::x86_xop_vfrcz_sd: {		case Intrinsic::x86_xop_vfrcz_sd: {
unsigned VWidth = II->getType()->getVectorNumElements();		unsigned VWidth = II->getType()->getVectorNumElements();
APInt UndefElts(VWidth, 0);		APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {		if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
if (V != II)		if (V != II)
return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
return II;		return II;
}		}
break;		break;
}		}
		case Intrinsic::x86_sse41_round_ss:
		case Intrinsic::x86_sse41_round_sd: {
		unsigned VWidth = II->getType()->getVectorNumElements();
		APInt UndefElts(VWidth, 0);
		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
		if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
		if (V != II)
		return replaceInstUsesWith(*II, V);
		return II;
		} else if (Value V = simplifyX86round(II, Builder))
		return replaceInstUsesWith(*II, V);
		break;
		}

// Constant fold ashr( <A x Bi>, Ci ).		// Constant fold ashr( <A x Bi>, Ci ).
// Constant fold lshr( <A x Bi>, Ci ).		// Constant fold lshr( <A x Bi>, Ci ).
// Constant fold shl( <A x Bi>, Ci ).		// Constant fold shl( <A x Bi>, Ci ).
case Intrinsic::x86_sse2_psrai_d:		case Intrinsic::x86_sse2_psrai_d:
case Intrinsic::x86_sse2_psrai_w:		case Intrinsic::x86_sse2_psrai_w:
case Intrinsic::x86_avx2_psrai_d:		case Intrinsic::x86_avx2_psrai_d:
case Intrinsic::x86_avx2_psrai_w:		case Intrinsic::x86_avx2_psrai_w:
▲ Show 20 Lines • Show All 1,910 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/X86/x86-avx.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s

				declare <8 x float> @llvm.x86.avx.round.ps.256(<8 x float>, i32)
				declare <4 x double> @llvm.x86.avx.round.pd.256(<4 x double>, i32)

				define <8 x float> @test_round_ps_floor(<8 x float> %a) {
				; CHECK-LABEL: @test_round_ps_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.floor.v8f32(<8 x float> [[A:%.]])
				; CHECK-NEXT: ret <8 x float> [[TMP1]]
				;
				%1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a, i32 1)
				ret <8 x float> %1
				}

				define <8 x float> @test_round_ps_ceil(<8 x float> %a) {
				; CHECK-LABEL: @test_round_ps_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.ceil.v8f32(<8 x float> [[A:%.]])
				; CHECK-NEXT: ret <8 x float> [[TMP1]]
				;
				%1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a, i32 2)
				ret <8 x float> %1
				}

				define <4 x double> @test_round_pd_floor(<4 x double> %a) {
				; CHECK-LABEL: @test_round_pd_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.floor.v4f64(<4 x double> [[A:%.]])
				; CHECK-NEXT: ret <4 x double> [[TMP1]]
				;
				%1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a, i32 1)
				ret <4 x double> %1
				}

				define <4 x double> @test_round_pd_ceil(<4 x double> %a) {
				; CHECK-LABEL: @test_round_pd_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.ceil.v4f64(<4 x double> [[A:%.]])
				; CHECK-NEXT: ret <4 x double> [[TMP1]]
				;
				%1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a, i32 2)
				ret <4 x double> %1
				}

llvm/trunk/test/Transforms/InstCombine/X86/x86-avx512.ll

	Show First 20 Lines • Show All 910 Lines • ▼ Show 20 Lines
	declare i64 @llvm.x86.avx512.vcvtss2usi64(<4 x float>, i32)			declare i64 @llvm.x86.avx512.vcvtss2usi64(<4 x float>, i32)
	declare i32 @llvm.x86.avx512.cvttss2usi(<4 x float>, i32)			declare i32 @llvm.x86.avx512.cvttss2usi(<4 x float>, i32)
	declare i64 @llvm.x86.avx512.cvttss2usi64(<4 x float>, i32)			declare i64 @llvm.x86.avx512.cvttss2usi64(<4 x float>, i32)
	declare i32 @llvm.x86.avx512.vcvtsd2usi32(<2 x double>, i32)			declare i32 @llvm.x86.avx512.vcvtsd2usi32(<2 x double>, i32)
	declare i64 @llvm.x86.avx512.vcvtsd2usi64(<2 x double>, i32)			declare i64 @llvm.x86.avx512.vcvtsd2usi64(<2 x double>, i32)
	declare i32 @llvm.x86.avx512.cvttsd2usi(<2 x double>, i32)			declare i32 @llvm.x86.avx512.cvttsd2usi(<2 x double>, i32)
	declare i64 @llvm.x86.avx512.cvttsd2usi64(<2 x double>, i32)			declare i64 @llvm.x86.avx512.cvttsd2usi64(<2 x double>, i32)

				declare <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32, i32)
				declare <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double>, <2 x double>, <2 x double>, i8, i32, i32)
				declare <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float>, i32, <4 x float>, i8)
				declare <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float>, i32, <8 x float>, i8)
				declare <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float>, i32, <16 x float>, i16, i32)
				declare <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double>, i32, <2 x double>, i8)
				declare <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double>, i32, <4 x double>, i8)
				declare <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double>, i32, <8 x double>, i8, i32)

				define <4 x float> @test_rndscale_ss_floor(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ss_floor(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call float @llvm.floor.f32(float [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <4 x float> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], float [[TMP5]], float [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <4 x float> [[SRC0:%.]], float [[TMP6]], i64 0
				; CHECK-NEXT: ret <4 x float> [[TMP7]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k, i32 1, i32 4)
				ret <4 x float> %1
				}

				define <4 x float> @test_rndscale_ss_ceil(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ss_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call float @llvm.ceil.f32(float [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <4 x float> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], float [[TMP5]], float [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <4 x float> [[SRC0:%.]], float [[TMP6]], i64 0
				; CHECK-NEXT: ret <4 x float> [[TMP7]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k, i32 2, i32 4)
				ret <4 x float> %1
				}

				define <2 x double> @test_rndscale_sd_floor(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_sd_floor(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x double> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call double @llvm.floor.f64(double [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x double> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], double [[TMP5]], double [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x double> [[SRC0:%.]], double [[TMP6]], i64 0
				; CHECK-NEXT: ret <2 x double> [[TMP7]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k, i32 1, i32 4)
				ret <2 x double> %1
				}

				define <2 x double> @test_rndscale_sd_ceil(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_sd_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x double> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call double @llvm.ceil.f64(double [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x double> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], double [[TMP5]], double [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x double> [[SRC0:%.]], double [[TMP6]], i64 0
				; CHECK-NEXT: ret <2 x double> [[TMP7]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k, i32 2, i32 4)
				ret <2 x double> %1
				}

				define <4 x float> @test_rndscale_ps_128_floor(<4 x float> %src, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_128_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x float> @llvm.floor.v4f32(<4 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x float> [[TMP1]], <4 x float> [[DST:%.]]
				; CHECK-NEXT: ret <4 x float> [[TMP4]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float> %src, i32 1, <4 x float> %dst, i8 %k)
				ret <4 x float> %1
				}

				define <4 x float> @test_rndscale_ps_128_ceil(<4 x float> %src, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_128_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x float> @llvm.ceil.v4f32(<4 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x float> [[TMP1]], <4 x float> [[DST:%.]]
				; CHECK-NEXT: ret <4 x float> [[TMP4]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float> %src, i32 2, <4 x float> %dst, i8 %k)
				ret <4 x float> %1
				}

				define <8 x float> @test_rndscale_ps_256_floor(<8 x float> %src, <8 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_256_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.floor.v8f32(<8 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x float> [[TMP1]], <8 x float> [[DST:%.]]
				; CHECK-NEXT: ret <8 x float> [[TMP3]]
				;
				%1 = call <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float> %src, i32 1, <8 x float> %dst, i8 %k)
				ret <8 x float> %1
				}

				define <8 x float> @test_rndscale_ps_256_ceil(<8 x float> %src, <8 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_256_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.ceil.v8f32(<8 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x float> [[TMP1]], <8 x float> [[DST:%.]]
				; CHECK-NEXT: ret <8 x float> [[TMP3]]
				;
				%1 = call <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float> %src, i32 2, <8 x float> %dst, i8 %k)
				ret <8 x float> %1
				}

				define <16 x float> @test_rndscale_ps_512_floor(<16 x float> %src, <16 x float> %dst, i16 %k) {
				; CHECK-LABEL: @test_rndscale_ps_512_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <16 x float> @llvm.floor.v16f32(<16 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[K:%.]] to <16 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <16 x i1> [[TMP2]], <16 x float> [[TMP1]], <16 x float> [[DST:%.]]
				; CHECK-NEXT: ret <16 x float> [[TMP3]]
				;
				%1 = call <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float> %src, i32 1, <16 x float> %dst, i16 %k, i32 4)
				ret <16 x float> %1
				}

				define <16 x float> @test_rndscale_ps_512_ceil(<16 x float> %src, <16 x float> %dst, i16 %k) {
				; CHECK-LABEL: @test_rndscale_ps_512_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <16 x float> @llvm.ceil.v16f32(<16 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[K:%.]] to <16 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <16 x i1> [[TMP2]], <16 x float> [[TMP1]], <16 x float> [[DST:%.]]
				; CHECK-NEXT: ret <16 x float> [[TMP3]]
				;
				%1 = call <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float> %src, i32 2, <16 x float> %dst, i16 %k, i32 4)
				ret <16 x float> %1
				}

				define <2 x double> @test_rndscale_pd_128_floor(<2 x double> %src, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_128_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <2 x double> @llvm.floor.v2f64(<2 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP4:%.]] = select <2 x i1> [[TMP3]], <2 x double> [[TMP1]], <2 x double> [[DST:%.]]
				; CHECK-NEXT: ret <2 x double> [[TMP4]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double> %src, i32 1, <2 x double> %dst, i8 %k)
				ret <2 x double> %1
				}

				define <2 x double> @test_rndscale_pd_128_ceil(<2 x double> %src, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_128_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <2 x double> @llvm.ceil.v2f64(<2 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <2 x i32> <i32 0, i32 1>
				; CHECK-NEXT: [[TMP4:%.]] = select <2 x i1> [[TMP3]], <2 x double> [[TMP1]], <2 x double> [[DST:%.]]
				; CHECK-NEXT: ret <2 x double> [[TMP4]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double> %src, i32 2, <2 x double> %dst, i8 %k)
				ret <2 x double> %1
				}

				define <4 x double> @test_rndscale_pd_256_floor(<4 x double> %src, <4 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_256_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.floor.v4f64(<4 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x double> [[TMP1]], <4 x double> [[DST:%.]]
				; CHECK-NEXT: ret <4 x double> [[TMP4]]
				;
				%1 = call <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double> %src, i32 1, <4 x double> %dst, i8 %k)
				ret <4 x double> %1
				}

				define <4 x double> @test_rndscale_pd_256_ceil(<4 x double> %src, <4 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_256_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.ceil.v4f64(<4 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <8 x i1> [[TMP2]], <8 x i1> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x double> [[TMP1]], <4 x double> [[DST:%.]]
				; CHECK-NEXT: ret <4 x double> [[TMP4]]
				;
				%1 = call <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double> %src, i32 2, <4 x double> %dst, i8 %k)
				ret <4 x double> %1
				}

				define <8 x double> @test_rndscale_pd_512_floor(<8 x double> %src, <8 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_512_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x double> @llvm.floor.v8f64(<8 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x double> [[TMP1]], <8 x double> [[DST:%.]]
				; CHECK-NEXT: ret <8 x double> [[TMP3]]
				;
				%1 = call <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double> %src, i32 1, <8 x double> %dst, i8 %k, i32 4)
				ret <8 x double> %1
				}

				define <8 x double> @test_rndscale_pd_512_ceil(<8 x double> %src, <8 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_512_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x double> @llvm.ceil.v8f64(<8 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x double> [[TMP1]], <8 x double> [[DST:%.]]
				; CHECK-NEXT: ret <8 x double> [[TMP3]]
				;
				%1 = call <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double> %src, i32 2, <8 x double> %dst, i8 %k, i32 4)
				ret <8 x double> %1
				}

	declare <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32)			declare <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32)

	define <4 x float> @test_mask_vfmadd_ss(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {			define <4 x float> @test_mask_vfmadd_ss(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
	; CHECK-LABEL: @test_mask_vfmadd_ss(			; CHECK-LABEL: @test_mask_vfmadd_ss(
	; CHECK-NEXT: [[RES:%.]] = tail call <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float> [[A:%.]], <4 x float> [[B:%.]], <4 x float> [[C:%.]], i8 [[MASK:%.*]], i32 4)			; CHECK-NEXT: [[RES:%.]] = tail call <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float> [[A:%.]], <4 x float> [[B:%.]], <4 x float> [[C:%.]], i8 [[MASK:%.*]], i32 4)
	; CHECK-NEXT: ret <4 x float> [[RES]]			; CHECK-NEXT: ret <4 x float> [[RES]]
	;			;
	%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1			%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
	▲ Show 20 Lines • Show All 2,035 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/X86/x86-sse41.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -instcombine -S \| FileCheck %s		; RUN: opt < %s -instcombine -S \| FileCheck %s
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

define <2 x double> @test_round_sd(<2 x double> %a, <2 x double> %b) {		define <2 x double> @test_round_sd(<2 x double> %a, <2 x double> %b) {
; CHECK-LABEL: @test_round_sd(		; CHECK-LABEL: @test_round_sd(
; CHECK-NEXT: [[TMP1:%.]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> [[A:%.]], <2 x double> [[B:%.*]], i32 10)		; CHECK-NEXT: [[TMP1:%.]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> [[A:%.]], <2 x double> [[B:%.*]], i32 10)
; CHECK-NEXT: ret <2 x double> [[TMP1]]		; CHECK-NEXT: ret <2 x double> [[TMP1]]
;		;
%1 = insertelement <2 x double> %a, double 1.000000e+00, i32 0		%1 = insertelement <2 x double> %a, double 1.000000e+00, i32 0
%2 = insertelement <2 x double> %b, double 2.000000e+00, i32 1		%2 = insertelement <2 x double> %b, double 2.000000e+00, i32 1
%3 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %1, <2 x double> %2, i32 10)		%3 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %1, <2 x double> %2, i32 10)
ret <2 x double> %3		ret <2 x double> %3
}		}

		define <2 x double> @test_round_sd_floor(<2 x double> %a, <2 x double> %b) {
		; CHECK-LABEL: @test_round_sd_floor(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x double> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call double @llvm.floor.f64(double [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x double> [[A:%.]], double [[TMP2]], i64 0
		; CHECK-NEXT: ret <2 x double> [[TMP3]]
		;
		%1 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a, <2 x double> %b, i32 1)
		ret <2 x double> %1
		}

		define <2 x double> @test_round_sd_ceil(<2 x double> %a, <2 x double> %b) {
		; CHECK-LABEL: @test_round_sd_ceil(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x double> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call double @llvm.ceil.f64(double [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x double> [[A:%.]], double [[TMP2]], i64 0
		; CHECK-NEXT: ret <2 x double> [[TMP3]]
		;
		%1 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a, <2 x double> %b, i32 2)
		ret <2 x double> %1
		}

define double @test_round_sd_0(double %a, double %b) {		define double @test_round_sd_0(double %a, double %b) {
; CHECK-LABEL: @test_round_sd_0(		; CHECK-LABEL: @test_round_sd_0(
; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> undef, double [[B:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> undef, double [[B:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> undef, <2 x double> [[TMP1]], i32 10)		; CHECK-NEXT: [[TMP2:%.*]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> undef, <2 x double> [[TMP1]], i32 10)
; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
; CHECK-NEXT: ret double [[TMP3]]		; CHECK-NEXT: ret double [[TMP3]]
;		;
%1 = insertelement <2 x double> undef, double %a, i32 0		%1 = insertelement <2 x double> undef, double %a, i32 0
Show All 28 Lines	;
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%4 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%5 = insertelement <4 x float> %4, float 2.000000e+00, i32 2		%5 = insertelement <4 x float> %4, float 2.000000e+00, i32 2
%6 = insertelement <4 x float> %5, float 3.000000e+00, i32 3		%6 = insertelement <4 x float> %5, float 3.000000e+00, i32 3
%7 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %3, <4 x float> %6, i32 10)		%7 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %3, <4 x float> %6, i32 10)
ret <4 x float> %7		ret <4 x float> %7
}		}

		define <4 x float> @test_round_ss_floor(<4 x float> %a, <4 x float> %b) {
		; CHECK-LABEL: @test_round_ss_floor(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call float @llvm.floor.f32(float [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x float> [[A:%.]], float [[TMP2]], i64 0
		; CHECK-NEXT: ret <4 x float> [[TMP3]]
		;
		%1 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a, <4 x float> %b, i32 1)
		ret <4 x float> %1
		}

		define <4 x float> @test_round_ss_ceil(<4 x float> %a, <4 x float> %b) {
		; CHECK-LABEL: @test_round_ss_ceil(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call float @llvm.ceil.f32(float [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x float> [[A:%.]], float [[TMP2]], i64 0
		; CHECK-NEXT: ret <4 x float> [[TMP3]]
		;
		%1 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a, <4 x float> %b, i32 2)
		ret <4 x float> %1
		}

define float @test_round_ss_0(float %a, float %b) {		define float @test_round_ss_0(float %a, float %b) {
; CHECK-LABEL: @test_round_ss_0(		; CHECK-LABEL: @test_round_ss_0(
; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x float> undef, float [[B:%.]], i32 0		; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x float> undef, float [[B:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> undef, <4 x float> [[TMP1]], i32 10)		; CHECK-NEXT: [[TMP2:%.*]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> undef, <4 x float> [[TMP1]], i32 10)
; CHECK-NEXT: [[R:%.*]] = extractelement <4 x float> [[TMP2]], i32 0		; CHECK-NEXT: [[R:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
; CHECK-NEXT: ret float [[R]]		; CHECK-NEXT: ret float [[R]]
;		;
%1 = insertelement <4 x float> undef, float %a, i32 0		%1 = insertelement <4 x float> undef, float %a, i32 0
Show All 31 Lines