This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil
ClosedPublic

Authored by mike.dvoretsky on Jun 12 2018, 2:59 AM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
RKSimon

Commits

rG8393f907176c: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil
rL335039: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil

Summary

Currently, X86 floor and ceil intrinsics for vectors are implemented as target-specific intrinsics that use the generic rounding instruction of the corresponding vector processing feature (ROUND* or VRNDSCALE*). This patch replaces those specific cases with calls to target-independent @llvm.floor.* and @llvm.ceil.* intrinsics. This doesn't affect the resulting machine code, as those intrinsics are lowered to the same instructions, but exposes these specific rounding cases to generic optimizations.

This patch relies on D45203 to fold the resulting IR patterns and serves as an alternative to D45202.

Diff Detail

Event Timeline

mike.dvoretsky created this revision.Jun 12 2018, 2:59 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald TranscriptJun 12 2018, 2:59 AM

mike.dvoretsky mentioned this in D45202: [X86] Replacing X86-specific floor and ceil vector intrinsics with generic LLVM intrinsics.Jun 12 2018, 3:00 AM

craig.topper added inline comments.Jun 12 2018, 10:27 AM

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
582	Why int and getSExtValue? Feels like it should be unsigned and getZExtValue.
586	Not sure if we should assume the rounding mode or SAE is a constant int. The clang frontend guarantees it, but handcrafted IR tests could break it.
619	Why would MaskTy vary here? It's fixed by the intrinsic isn't it?
621	There's an overload of CreateAnd that accepts a uint64_t for RHS, so you can probably just pass 1 here without creating the ConstantInt yourself.
652	For less than 8, you should cast i8 to v8i1 first then extract the subvector. We don't want i4 and i2 types floating around.
2575	Why are we calling simplifyX86Round nested under SimplifyDemandedVectorElts? Simplifying should be completely orthogonal. If a simplification happened and the round intrinsic still exists, InstCombine will revisit here and SimplifyDemandedVectorElts will return nullptr. No need to try to handle every case in one shot.

I think where we ultimately want to end up is to remove the masking from the packed intrinsics and replace the scalar intrinsics with versions that use f32/f64 as their types. The IR would then look similar to where we're trying to end up for things like sqrt. But instead of a target independent intrinsic we would have a target specific intrinsic. All the the masking and insert/extract would be completely separate from the operation itself. That would greatly simplify the InstCombine code here because you would just need to trade out the target specific intrinsic for the floor/ceil intrinsic without having to worry about anything else. Thoughts?

Changes made per comments. Note that zext IR instructions have been fully excluded from all patterns, which will require altering vec_floor.ll tests in D45203 if this revision is accepted.

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
586	Rounding and SAE are immediate. Not using constants here leads to isel failure. Not sure this code should accept such explicitly incorrect cases.

lebedev.ri added a subscriber: lebedev.ri.Jun 14 2018, 4:34 AM

lebedev.ri added inline comments.

llvm/test/Transforms/InstCombine/X86/x86-avx.ll
35	ceil, not deil
llvm/test/Transforms/InstCombine/X86/x86-sse41.ll
7	Can you regenerate this file right now in trunk (and commit), so this noise is gone?

Yes an non-constant value will fail isel and print a readable error message. But if you assume constant in instcombine by just using “cast” and it’s not we’ll get a segmentation fault in release builds before we get to isel. This is a much worse experience for users. So I suggest just ignoring any intrinsics with non-constant here and let isel throw an error.

Fixed the typo in the test name and added checks to make the transform stop if the rounding mode immediate and/or SAE are not constant.

LGTM

This revision is now accepted and ready to land.Jun 14 2018, 12:31 PM

mike.dvoretsky mentioned this in D45203: [X86] VRNDSCALE* folding from masked and scalar ffloor and fceil patterns.Jun 15 2018, 1:26 AM

Closed by commit rL335039: [InstCombine] Replacing X86-specific rounding intrinsics with generic floor-ceil (authored by mike.dvoretsky). · Explain WhyJun 19 2018, 3:54 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

132 lines

test/

Transforms/

InstCombine/

X86/

x86-avx.ll

41 lines

x86-avx512.ll

207 lines

x86-sse41.ll

52 lines

Diff 150911

llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	for (unsigned Elt = 0; Elt != NumDstEltsPerLane; ++Elt) {

Vals.push_back(ConstantInt::get(ResTy->getScalarType(), Val));		Vals.push_back(ConstantInt::get(ResTy->getScalarType(), Val));
}		}
}		}

return ConstantVector::get(Vals);		return ConstantVector::get(Vals);
}		}

		// Replace X86-specific intrinsics with generic floor-ceil where applicable.
		static Value *simplifyX86round(IntrinsicInst &II,
		InstCombiner::BuilderTy &Builder) {
		int RoundControl;
		craig.topperUnsubmitted Done Reply Inline Actions Why int and getSExtValue? Feels like it should be unsigned and getZExtValue. craig.topper: Why int and getSExtValue? Feels like it should be unsigned and getZExtValue.
		Intrinsic::ID IntrinsicID = II.getIntrinsicID();
		if (IntrinsicID == Intrinsic::x86_sse41_round_ss \|\|
		IntrinsicID == Intrinsic::x86_sse41_round_sd)
		RoundControl = cast<ConstantInt>(II.getArgOperand(2))->getSExtValue();
		craig.topperUnsubmitted Done Reply Inline Actions Not sure if we should assume the rounding mode or SAE is a constant int. The clang frontend guarantees it, but handcrafted IR tests could break it. craig.topper: Not sure if we should assume the rounding mode or SAE is a constant int. The clang frontend…
		mike.dvoretskyAuthorUnsubmitted Done Reply Inline Actions Rounding and SAE are immediate. Not using constants here leads to isel failure. Not sure this code should accept such explicitly incorrect cases. mike.dvoretsky: Rounding and SAE are immediate. Not using constants here leads to isel failure. Not sure this…
		else if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd)
		RoundControl = cast<ConstantInt>(II.getArgOperand(4))->getSExtValue();
		else
		RoundControl = cast<ConstantInt>(II.getArgOperand(1))->getSExtValue();

		int SAE;
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_512 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_512)
		SAE = cast<ConstantInt>(II.getArgOperand(4))->getSExtValue();
		else if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd)
		SAE = cast<ConstantInt>(II.getArgOperand(5))->getSExtValue();
		else
		SAE = 4;

		if (SAE != 4 \|\| (RoundControl != 2 /ceil/ && RoundControl != 1 /floor/))
		return nullptr;

		Value Src, Dst, *Mask;
		bool IsScalar = false;
		if (IntrinsicID == Intrinsic::x86_sse41_round_ss \|\|
		IntrinsicID == Intrinsic::x86_sse41_round_sd \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		IsScalar = true;
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		Type *MaskTy = II.getArgOperand(3)->getType();
		Type *I32Ty = Builder.getInt32Ty();
		Value *One = ConstantInt::get(I32Ty, 1);
		Value *Zero = Constant::getNullValue(I32Ty);
		Mask = (MaskTy == I32Ty) ? II.getArgOperand(3)
		craig.topperUnsubmitted Done Reply Inline Actions Why would MaskTy vary here? It's fixed by the intrinsic isn't it? craig.topper: Why would MaskTy vary here? It's fixed by the intrinsic isn't it?
		: Builder.CreateZExt(II.getArgOperand(3), I32Ty);
		Mask = Builder.CreateAnd(Mask, One);
		craig.topperUnsubmitted Done Reply Inline Actions There's an overload of CreateAnd that accepts a uint64_t for RHS, so you can probably just pass 1 here without creating the ConstantInt yourself. craig.topper: There's an overload of CreateAnd that accepts a uint64_t for RHS, so you can probably just pass…
		Mask = Builder.CreateICmp(ICmpInst::ICMP_NE, Mask, Zero);
		Dst = II.getArgOperand(2);
		} else
		Dst = II.getArgOperand(0);
		Src = Builder.CreateExtractElement(II.getArgOperand(1), (uint64_t)0);
		} else {
		Src = II.getArgOperand(0);
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_128 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_256 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ps_512 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_128 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_256 \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_pd_512) {
		Dst = II.getArgOperand(2);
		Mask = II.getArgOperand(3);
		} else {
		Dst = Src;
		Mask = ConstantInt::getAllOnesValue(
		Builder.getIntNTy(Src->getType()->getVectorNumElements()));
		}
		}

		Intrinsic::ID ID = (RoundControl == 2) ? Intrinsic::ceil : Intrinsic::floor;
		Value *Res = Builder.CreateIntrinsic(ID, {Src}, &II);
		if (!IsScalar) {
		if (auto *C = dyn_cast<Constant>(Mask))
		if (C->isAllOnesValue())
		return Res;
		int Width = Src->getType()->getVectorNumElements();
		if (Width < 8) {
		auto *MaskIntTy = Builder.getIntNTy(Width);
		craig.topperUnsubmitted Done Reply Inline Actions For less than 8, you should cast i8 to v8i1 first then extract the subvector. We don't want i4 and i2 types floating around. craig.topper: For less than 8, you should cast i8 to v8i1 first then extract the subvector. We don't want i4…
		Mask = Builder.CreateTrunc(Mask, MaskIntTy);
		}
		auto *MaskTy = VectorType::get(
		Builder.getInt1Ty(), cast<IntegerType>(Mask->getType())->getBitWidth());
		Mask = Builder.CreateBitCast(Mask, MaskTy);
		return Builder.CreateSelect(Mask, Res, Dst);
		}
		if (IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_ss \|\|
		IntrinsicID == Intrinsic::x86_avx512_mask_rndscale_sd) {
		Dst = Builder.CreateExtractElement(Dst, (uint64_t)0);
		Res = Builder.CreateSelect(Mask, Res, Dst);
		Dst = II.getArgOperand(0);
		}
		return Builder.CreateInsertElement(Dst, Res, (uint64_t)0);
		}

static Value *simplifyX86movmsk(const IntrinsicInst &II) {		static Value *simplifyX86movmsk(const IntrinsicInst &II) {
Value *Arg = II.getArgOperand(0);		Value *Arg = II.getArgOperand(0);
Type *ResTy = II.getType();		Type *ResTy = II.getType();
Type *ArgTy = Arg->getType();		Type *ArgTy = Arg->getType();

// movmsk(undef) -> zero as we must ensure the upper bits are zero.		// movmsk(undef) -> zero as we must ensure the upper bits are zero.
if (isa<UndefValue>(Arg))		if (isa<UndefValue>(Arg))
return Constant::getNullValue(ResTy);		return Constant::getNullValue(ResTy);
▲ Show 20 Lines • Show All 1,630 Lines • ▼ Show 20 Lines	case Intrinsic::x86_avx512_cvttsd2usi64: {
unsigned VWidth = Arg->getType()->getVectorNumElements();		unsigned VWidth = Arg->getType()->getVectorNumElements();
if (Value *V = SimplifyDemandedVectorEltsLow(Arg, VWidth, 1)) {		if (Value *V = SimplifyDemandedVectorEltsLow(Arg, VWidth, 1)) {
II->setArgOperand(0, V);		II->setArgOperand(0, V);
return II;		return II;
}		}
break;		break;
}		}

		case Intrinsic::x86_sse41_round_ps:
		case Intrinsic::x86_sse41_round_pd:
		case Intrinsic::x86_avx_round_ps_256:
		case Intrinsic::x86_avx_round_pd_256:
		case Intrinsic::x86_avx512_mask_rndscale_ps_128:
		case Intrinsic::x86_avx512_mask_rndscale_ps_256:
		case Intrinsic::x86_avx512_mask_rndscale_ps_512:
		case Intrinsic::x86_avx512_mask_rndscale_pd_128:
		case Intrinsic::x86_avx512_mask_rndscale_pd_256:
		case Intrinsic::x86_avx512_mask_rndscale_pd_512:
		case Intrinsic::x86_avx512_mask_rndscale_ss:
		case Intrinsic::x86_avx512_mask_rndscale_sd:
		if (Value V = simplifyX86round(II, Builder))
		return replaceInstUsesWith(*II, V);
		break;

case Intrinsic::x86_mmx_pmovmskb:		case Intrinsic::x86_mmx_pmovmskb:
case Intrinsic::x86_sse_movmsk_ps:		case Intrinsic::x86_sse_movmsk_ps:
case Intrinsic::x86_sse2_movmsk_pd:		case Intrinsic::x86_sse2_movmsk_pd:
case Intrinsic::x86_sse2_pmovmskb_128:		case Intrinsic::x86_sse2_pmovmskb_128:
case Intrinsic::x86_avx_movmsk_pd_256:		case Intrinsic::x86_avx_movmsk_pd_256:
case Intrinsic::x86_avx_movmsk_ps_256:		case Intrinsic::x86_avx_movmsk_ps_256:
case Intrinsic::x86_avx2_pmovmskb:		case Intrinsic::x86_avx2_pmovmskb:
if (Value V = simplifyX86movmsk(II))		if (Value V = simplifyX86movmsk(II))
▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitCallInst(CallInst &CI) {
case Intrinsic::x86_fma_vfmadd_ss:		case Intrinsic::x86_fma_vfmadd_ss:
case Intrinsic::x86_fma_vfmadd_sd:		case Intrinsic::x86_fma_vfmadd_sd:
case Intrinsic::x86_sse_cmp_ss:		case Intrinsic::x86_sse_cmp_ss:
case Intrinsic::x86_sse_min_ss:		case Intrinsic::x86_sse_min_ss:
case Intrinsic::x86_sse_max_ss:		case Intrinsic::x86_sse_max_ss:
case Intrinsic::x86_sse2_cmp_sd:		case Intrinsic::x86_sse2_cmp_sd:
case Intrinsic::x86_sse2_min_sd:		case Intrinsic::x86_sse2_min_sd:
case Intrinsic::x86_sse2_max_sd:		case Intrinsic::x86_sse2_max_sd:
case Intrinsic::x86_sse41_round_ss:
case Intrinsic::x86_sse41_round_sd:
case Intrinsic::x86_xop_vfrcz_ss:		case Intrinsic::x86_xop_vfrcz_ss:
case Intrinsic::x86_xop_vfrcz_sd: {		case Intrinsic::x86_xop_vfrcz_sd: {
unsigned VWidth = II->getType()->getVectorNumElements();		unsigned VWidth = II->getType()->getVectorNumElements();
APInt UndefElts(VWidth, 0);		APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {		if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
if (V != II)		if (V != II)
return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
return II;		return II;
}		}
break;		break;
}		}
		case Intrinsic::x86_sse41_round_ss:
		case Intrinsic::x86_sse41_round_sd: {
		unsigned VWidth = II->getType()->getVectorNumElements();
		APInt UndefElts(VWidth, 0);
		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
		if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
		if (V != II) {
		if (auto NewII = dyn_cast<IntrinsicInst>(V))
		if (NewII->getIntrinsicID() == Intrinsic::x86_sse41_round_ss \|\|
		NewII->getIntrinsicID() == Intrinsic::x86_sse41_round_sd) {
		Value NewV = simplifyX86round(NewII, Builder);
		if (NewV)
		V = NewV;
		}
		return replaceInstUsesWith(*II, V);
		}
		V = simplifyX86round(*II, Builder);
		craig.topperUnsubmitted Done Reply Inline Actions Why are we calling simplifyX86Round nested under SimplifyDemandedVectorElts? Simplifying should be completely orthogonal. If a simplification happened and the round intrinsic still exists, InstCombine will revisit here and SimplifyDemandedVectorElts will return nullptr. No need to try to handle every case in one shot. craig.topper: Why are we calling simplifyX86Round nested under SimplifyDemandedVectorElts? Simplifying should…
		if (V)
		return replaceInstUsesWith(*II, V);
		return II;
		} else if (Value V = simplifyX86round(II, Builder))
		return replaceInstUsesWith(*II, V);
		break;
		}

// Constant fold ashr( <A x Bi>, Ci ).		// Constant fold ashr( <A x Bi>, Ci ).
// Constant fold lshr( <A x Bi>, Ci ).		// Constant fold lshr( <A x Bi>, Ci ).
// Constant fold shl( <A x Bi>, Ci ).		// Constant fold shl( <A x Bi>, Ci ).
case Intrinsic::x86_sse2_psrai_d:		case Intrinsic::x86_sse2_psrai_d:
case Intrinsic::x86_sse2_psrai_w:		case Intrinsic::x86_sse2_psrai_w:
case Intrinsic::x86_avx2_psrai_d:		case Intrinsic::x86_avx2_psrai_d:
case Intrinsic::x86_avx2_psrai_w:		case Intrinsic::x86_avx2_psrai_w:
▲ Show 20 Lines • Show All 1,910 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/X86/x86-avx.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instcombine -S \| FileCheck %s

				declare <8 x float> @llvm.x86.avx.round.ps.256(<8 x float>, i32)
				declare <4 x double> @llvm.x86.avx.round.pd.256(<4 x double>, i32)

				define <8 x float> @test_round_ps_floor(<8 x float> %a) {
				; CHECK-LABEL: @test_round_ps_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.floor.v8f32(<8 x float> [[A:%.]])
				; CHECK-NEXT: ret <8 x float> [[TMP1]]
				;
				%1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a, i32 1)
				ret <8 x float> %1
				}

				define <8 x float> @test_round_ps_ceil(<8 x float> %a) {
				; CHECK-LABEL: @test_round_ps_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.ceil.v8f32(<8 x float> [[A:%.]])
				; CHECK-NEXT: ret <8 x float> [[TMP1]]
				;
				%1 = call <8 x float> @llvm.x86.avx.round.ps.256(<8 x float> %a, i32 2)
				ret <8 x float> %1
				}

				define <4 x double> @test_round_pd_floor(<4 x double> %a) {
				; CHECK-LABEL: @test_round_pd_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.floor.v4f64(<4 x double> [[A:%.]])
				; CHECK-NEXT: ret <4 x double> [[TMP1]]
				;
				%1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a, i32 1)
				ret <4 x double> %1
				}

				define <4 x double> @test_round_pd_deil(<4 x double> %a) {
				; CHECK-LABEL: @test_round_pd_deil(
				lebedev.riUnsubmitted Done Reply Inline Actions ceil, not deil lebedev.ri: ceil, not deil
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.ceil.v4f64(<4 x double> [[A:%.]])
				; CHECK-NEXT: ret <4 x double> [[TMP1]]
				;
				%1 = call <4 x double> @llvm.x86.avx.round.pd.256(<4 x double> %a, i32 2)
				ret <4 x double> %1
				}

llvm/test/Transforms/InstCombine/X86/x86-avx512.ll

	Show First 20 Lines • Show All 910 Lines • ▼ Show 20 Lines
	declare i64 @llvm.x86.avx512.vcvtss2usi64(<4 x float>, i32)			declare i64 @llvm.x86.avx512.vcvtss2usi64(<4 x float>, i32)
	declare i32 @llvm.x86.avx512.cvttss2usi(<4 x float>, i32)			declare i32 @llvm.x86.avx512.cvttss2usi(<4 x float>, i32)
	declare i64 @llvm.x86.avx512.cvttss2usi64(<4 x float>, i32)			declare i64 @llvm.x86.avx512.cvttss2usi64(<4 x float>, i32)
	declare i32 @llvm.x86.avx512.vcvtsd2usi32(<2 x double>, i32)			declare i32 @llvm.x86.avx512.vcvtsd2usi32(<2 x double>, i32)
	declare i64 @llvm.x86.avx512.vcvtsd2usi64(<2 x double>, i32)			declare i64 @llvm.x86.avx512.vcvtsd2usi64(<2 x double>, i32)
	declare i32 @llvm.x86.avx512.cvttsd2usi(<2 x double>, i32)			declare i32 @llvm.x86.avx512.cvttsd2usi(<2 x double>, i32)
	declare i64 @llvm.x86.avx512.cvttsd2usi64(<2 x double>, i32)			declare i64 @llvm.x86.avx512.cvttsd2usi64(<2 x double>, i32)

				declare <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32, i32)
				declare <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double>, <2 x double>, <2 x double>, i8, i32, i32)
				declare <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float>, i32, <4 x float>, i8)
				declare <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float>, i32, <8 x float>, i8)
				declare <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float>, i32, <16 x float>, i16, i32)
				declare <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double>, i32, <2 x double>, i8)
				declare <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double>, i32, <4 x double>, i8)
				declare <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double>, i32, <8 x double>, i8, i32)

				define <4 x float> @test_rndscale_ss_floor(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ss_floor(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call float @llvm.floor.f32(float [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <4 x float> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], float [[TMP5]], float [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <4 x float> [[SRC0:%.]], float [[TMP6]], i64 0
				; CHECK-NEXT: ret <4 x float> [[TMP7]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k, i32 1, i32 4)
				ret <4 x float> %1
				}

				define <4 x float> @test_rndscale_ss_ceil(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ss_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <4 x float> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call float @llvm.ceil.f32(float [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <4 x float> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], float [[TMP5]], float [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <4 x float> [[SRC0:%.]], float [[TMP6]], i64 0
				; CHECK-NEXT: ret <4 x float> [[TMP7]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ss(<4 x float> %src0, <4 x float> %src1, <4 x float> %dst, i8 %k, i32 2, i32 4)
				ret <4 x float> %1
				}

				define <2 x double> @test_rndscale_sd_floor(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_sd_floor(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x double> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call double @llvm.floor.f64(double [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x double> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], double [[TMP5]], double [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x double> [[SRC0:%.]], double [[TMP6]], i64 0
				; CHECK-NEXT: ret <2 x double> [[TMP7]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k, i32 1, i32 4)
				ret <2 x double> %1
				}

				define <2 x double> @test_rndscale_sd_ceil(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_sd_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = and i8 [[K:%.]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i8 [[TMP1]], 0
				; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x double> [[SRC1:%.]], i64 0
				; CHECK-NEXT: [[TMP4:%.*]] = call double @llvm.ceil.f64(double [[TMP3]])
				; CHECK-NEXT: [[TMP5:%.]] = extractelement <2 x double> [[DST:%.]], i64 0
				; CHECK-NEXT: [[TMP6:%.*]] = select i1 [[TMP2]], double [[TMP5]], double [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x double> [[SRC0:%.]], double [[TMP6]], i64 0
				; CHECK-NEXT: ret <2 x double> [[TMP7]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.sd(<2 x double> %src0, <2 x double> %src1, <2 x double> %dst, i8 %k, i32 2, i32 4)
				ret <2 x double> %1
				}

				define <4 x float> @test_rndscale_ps_128_floor(<4 x float> %src, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_128_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x float> @llvm.floor.v4f32(<4 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i4
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i4 [[TMP2]] to <4 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x float> [[TMP1]], <4 x float> [[DST:%.]]
				; CHECK-NEXT: ret <4 x float> [[TMP4]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float> %src, i32 1, <4 x float> %dst, i8 %k)
				ret <4 x float> %1
				}

				define <4 x float> @test_rndscale_ps_128_ceil(<4 x float> %src, <4 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_128_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x float> @llvm.ceil.v4f32(<4 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i4
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i4 [[TMP2]] to <4 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x float> [[TMP1]], <4 x float> [[DST:%.]]
				; CHECK-NEXT: ret <4 x float> [[TMP4]]
				;
				%1 = call <4 x float> @llvm.x86.avx512.mask.rndscale.ps.128(<4 x float> %src, i32 2, <4 x float> %dst, i8 %k)
				ret <4 x float> %1
				}

				define <8 x float> @test_rndscale_ps_256_floor(<8 x float> %src, <8 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_256_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.floor.v8f32(<8 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x float> [[TMP1]], <8 x float> [[DST:%.]]
				; CHECK-NEXT: ret <8 x float> [[TMP3]]
				;
				%1 = call <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float> %src, i32 1, <8 x float> %dst, i8 %k)
				ret <8 x float> %1
				}

				define <8 x float> @test_rndscale_ps_256_ceil(<8 x float> %src, <8 x float> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_ps_256_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x float> @llvm.ceil.v8f32(<8 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x float> [[TMP1]], <8 x float> [[DST:%.]]
				; CHECK-NEXT: ret <8 x float> [[TMP3]]
				;
				%1 = call <8 x float> @llvm.x86.avx512.mask.rndscale.ps.256(<8 x float> %src, i32 2, <8 x float> %dst, i8 %k)
				ret <8 x float> %1
				}

				define <16 x float> @test_rndscale_ps_512_floor(<16 x float> %src, <16 x float> %dst, i16 %k) {
				; CHECK-LABEL: @test_rndscale_ps_512_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <16 x float> @llvm.floor.v16f32(<16 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[K:%.]] to <16 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <16 x i1> [[TMP2]], <16 x float> [[TMP1]], <16 x float> [[DST:%.]]
				; CHECK-NEXT: ret <16 x float> [[TMP3]]
				;
				%1 = call <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float> %src, i32 1, <16 x float> %dst, i16 %k, i32 4)
				ret <16 x float> %1
				}

				define <16 x float> @test_rndscale_ps_512_ceil(<16 x float> %src, <16 x float> %dst, i16 %k) {
				; CHECK-LABEL: @test_rndscale_ps_512_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <16 x float> @llvm.ceil.v16f32(<16 x float> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i16 [[K:%.]] to <16 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <16 x i1> [[TMP2]], <16 x float> [[TMP1]], <16 x float> [[DST:%.]]
				; CHECK-NEXT: ret <16 x float> [[TMP3]]
				;
				%1 = call <16 x float> @llvm.x86.avx512.mask.rndscale.ps.512(<16 x float> %src, i32 2, <16 x float> %dst, i16 %k, i32 4)
				ret <16 x float> %1
				}

				define <2 x double> @test_rndscale_pd_128_floor(<2 x double> %src, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_128_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <2 x double> @llvm.floor.v2f64(<2 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i2
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i2 [[TMP2]] to <2 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <2 x i1> [[TMP3]], <2 x double> [[TMP1]], <2 x double> [[DST:%.]]
				; CHECK-NEXT: ret <2 x double> [[TMP4]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double> %src, i32 1, <2 x double> %dst, i8 %k)
				ret <2 x double> %1
				}

				define <2 x double> @test_rndscale_pd_128_ceil(<2 x double> %src, <2 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_128_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <2 x double> @llvm.ceil.v2f64(<2 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i2
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i2 [[TMP2]] to <2 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <2 x i1> [[TMP3]], <2 x double> [[TMP1]], <2 x double> [[DST:%.]]
				; CHECK-NEXT: ret <2 x double> [[TMP4]]
				;
				%1 = call <2 x double> @llvm.x86.avx512.mask.rndscale.pd.128(<2 x double> %src, i32 2, <2 x double> %dst, i8 %k)
				ret <2 x double> %1
				}

				define <4 x double> @test_rndscale_pd_256_floor(<4 x double> %src, <4 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_256_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.floor.v4f64(<4 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i4
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i4 [[TMP2]] to <4 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x double> [[TMP1]], <4 x double> [[DST:%.]]
				; CHECK-NEXT: ret <4 x double> [[TMP4]]
				;
				%1 = call <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double> %src, i32 1, <4 x double> %dst, i8 %k)
				ret <4 x double> %1
				}

				define <4 x double> @test_rndscale_pd_256_ceil(<4 x double> %src, <4 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_256_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <4 x double> @llvm.ceil.v4f64(<4 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = trunc i8 [[K:%.]] to i4
				; CHECK-NEXT: [[TMP3:%.*]] = bitcast i4 [[TMP2]] to <4 x i1>
				; CHECK-NEXT: [[TMP4:%.]] = select <4 x i1> [[TMP3]], <4 x double> [[TMP1]], <4 x double> [[DST:%.]]
				; CHECK-NEXT: ret <4 x double> [[TMP4]]
				;
				%1 = call <4 x double> @llvm.x86.avx512.mask.rndscale.pd.256(<4 x double> %src, i32 2, <4 x double> %dst, i8 %k)
				ret <4 x double> %1
				}

				define <8 x double> @test_rndscale_pd_512_floor(<8 x double> %src, <8 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_512_floor(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x double> @llvm.floor.v8f64(<8 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x double> [[TMP1]], <8 x double> [[DST:%.]]
				; CHECK-NEXT: ret <8 x double> [[TMP3]]
				;
				%1 = call <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double> %src, i32 1, <8 x double> %dst, i8 %k, i32 4)
				ret <8 x double> %1
				}

				define <8 x double> @test_rndscale_pd_512_ceil(<8 x double> %src, <8 x double> %dst, i8 %k) {
				; CHECK-LABEL: @test_rndscale_pd_512_ceil(
				; CHECK-NEXT: [[TMP1:%.]] = call <8 x double> @llvm.ceil.v8f64(<8 x double> [[SRC:%.]])
				; CHECK-NEXT: [[TMP2:%.]] = bitcast i8 [[K:%.]] to <8 x i1>
				; CHECK-NEXT: [[TMP3:%.]] = select <8 x i1> [[TMP2]], <8 x double> [[TMP1]], <8 x double> [[DST:%.]]
				; CHECK-NEXT: ret <8 x double> [[TMP3]]
				;
				%1 = call <8 x double> @llvm.x86.avx512.mask.rndscale.pd.512(<8 x double> %src, i32 2, <8 x double> %dst, i8 %k, i32 4)
				ret <8 x double> %1
				}

	declare <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32)			declare <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float>, <4 x float>, <4 x float>, i8, i32)

	define <4 x float> @test_mask_vfmadd_ss(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {			define <4 x float> @test_mask_vfmadd_ss(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
	; CHECK-LABEL: @test_mask_vfmadd_ss(			; CHECK-LABEL: @test_mask_vfmadd_ss(
	; CHECK-NEXT: [[RES:%.]] = tail call <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float> [[A:%.]], <4 x float> [[B:%.]], <4 x float> [[C:%.]], i8 [[MASK:%.*]], i32 4)			; CHECK-NEXT: [[RES:%.]] = tail call <4 x float> @llvm.x86.avx512.mask.vfmadd.ss(<4 x float> [[A:%.]], <4 x float> [[B:%.]], <4 x float> [[C:%.]], i8 [[MASK:%.*]], i32 4)
	; CHECK-NEXT: ret <4 x float> [[RES]]			; CHECK-NEXT: ret <4 x float> [[RES]]
	;			;
	%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1			%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
	▲ Show 20 Lines • Show All 2,035 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/X86/x86-sse41.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -instcombine -S \| FileCheck %s		; RUN: opt < %s -instcombine -S \| FileCheck %s
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

define <2 x double> @test_round_sd(<2 x double> %a, <2 x double> %b) {		define <2 x double> @test_round_sd(<2 x double> %a, <2 x double> %b) {
; CHECK-LABEL: @test_round_sd(		; CHECK-LABEL: @test_round_sd(
; CHECK-NEXT: [[TMP1:%.*]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a, <2 x double> %b, i32 10)		; CHECK-NEXT: [[TMP1:%.]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> [[A:%.]], <2 x double> [[B:%.*]], i32 10)
		lebedev.riUnsubmitted Not Done Reply Inline Actions Can you regenerate this file right now in trunk (and commit), so this noise is gone? lebedev.ri: Can you regenerate this file right now in trunk (and commit), so this noise is gone?
; CHECK-NEXT: ret <2 x double> [[TMP1]]		; CHECK-NEXT: ret <2 x double> [[TMP1]]
;		;
%1 = insertelement <2 x double> %a, double 1.000000e+00, i32 0		%1 = insertelement <2 x double> %a, double 1.000000e+00, i32 0
%2 = insertelement <2 x double> %b, double 2.000000e+00, i32 1		%2 = insertelement <2 x double> %b, double 2.000000e+00, i32 1
%3 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %1, <2 x double> %2, i32 10)		%3 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %1, <2 x double> %2, i32 10)
ret <2 x double> %3		ret <2 x double> %3
}		}

		define <2 x double> @test_round_sd_floor(<2 x double> %a, <2 x double> %b) {
		; CHECK-LABEL: @test_round_sd_floor(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x double> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call double @llvm.floor.f64(double [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x double> [[A:%.]], double [[TMP2]], i64 0
		; CHECK-NEXT: ret <2 x double> [[TMP3]]
		;
		%1 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a, <2 x double> %b, i32 1)
		ret <2 x double> %1
		}

		define <2 x double> @test_round_sd_ceil(<2 x double> %a, <2 x double> %b) {
		; CHECK-LABEL: @test_round_sd_ceil(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <2 x double> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call double @llvm.ceil.f64(double [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <2 x double> [[A:%.]], double [[TMP2]], i64 0
		; CHECK-NEXT: ret <2 x double> [[TMP3]]
		;
		%1 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %a, <2 x double> %b, i32 2)
		ret <2 x double> %1
		}

define double @test_round_sd_0(double %a, double %b) {		define double @test_round_sd_0(double %a, double %b) {
; CHECK-LABEL: @test_round_sd_0(		; CHECK-LABEL: @test_round_sd_0(
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> undef, double %b, i32 0		; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> undef, double [[B:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> undef, <2 x double> [[TMP1]], i32 10)		; CHECK-NEXT: [[TMP2:%.*]] = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> undef, <2 x double> [[TMP1]], i32 10)
; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
; CHECK-NEXT: ret double [[TMP3]]		; CHECK-NEXT: ret double [[TMP3]]
;		;
%1 = insertelement <2 x double> undef, double %a, i32 0		%1 = insertelement <2 x double> undef, double %a, i32 0
%2 = insertelement <2 x double> %1, double 1.000000e+00, i32 1		%2 = insertelement <2 x double> %1, double 1.000000e+00, i32 1
%3 = insertelement <2 x double> undef, double %b, i32 0		%3 = insertelement <2 x double> undef, double %b, i32 0
%4 = insertelement <2 x double> %3, double 2.000000e+00, i32 1		%4 = insertelement <2 x double> %3, double 2.000000e+00, i32 1
Show All 12 Lines	;
%4 = insertelement <2 x double> %3, double 2.000000e+00, i32 1		%4 = insertelement <2 x double> %3, double 2.000000e+00, i32 1
%5 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %2, <2 x double> %4, i32 10)		%5 = tail call <2 x double> @llvm.x86.sse41.round.sd(<2 x double> %2, <2 x double> %4, i32 10)
%6 = extractelement <2 x double> %5, i32 1		%6 = extractelement <2 x double> %5, i32 1
ret double %6		ret double %6
}		}

define <4 x float> @test_round_ss(<4 x float> %a, <4 x float> %b) {		define <4 x float> @test_round_ss(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @test_round_ss(		; CHECK-LABEL: @test_round_ss(
; CHECK-NEXT: [[TMP1:%.*]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> <float undef, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>, <4 x float> %b, i32 10)		; CHECK-NEXT: [[TMP1:%.]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> <float undef, float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>, <4 x float> [[B:%.]], i32 10)
; CHECK-NEXT: ret <4 x float> [[TMP1]]		; CHECK-NEXT: ret <4 x float> [[TMP1]]
;		;
%1 = insertelement <4 x float> %a, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %a, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%4 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%5 = insertelement <4 x float> %4, float 2.000000e+00, i32 2		%5 = insertelement <4 x float> %4, float 2.000000e+00, i32 2
%6 = insertelement <4 x float> %5, float 3.000000e+00, i32 3		%6 = insertelement <4 x float> %5, float 3.000000e+00, i32 3
%7 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %3, <4 x float> %6, i32 10)		%7 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %3, <4 x float> %6, i32 10)
ret <4 x float> %7		ret <4 x float> %7
}		}

		define <4 x float> @test_round_ss_floor(<4 x float> %a, <4 x float> %b) {
		; CHECK-LABEL: @test_round_ss_floor(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call float @llvm.floor.f32(float [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x float> [[A:%.]], float [[TMP2]], i64 0
		; CHECK-NEXT: ret <4 x float> [[TMP3]]
		;
		%1 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a, <4 x float> %b, i32 1)
		ret <4 x float> %1
		}

		define <4 x float> @test_round_ss_ceil(<4 x float> %a, <4 x float> %b) {
		; CHECK-LABEL: @test_round_ss_ceil(
		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[B:%.]], i64 0
		; CHECK-NEXT: [[TMP2:%.*]] = call float @llvm.ceil.f32(float [[TMP1]])
		; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x float> [[A:%.]], float [[TMP2]], i64 0
		; CHECK-NEXT: ret <4 x float> [[TMP3]]
		;
		%1 = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> %a, <4 x float> %b, i32 2)
		ret <4 x float> %1
		}

define float @test_round_ss_0(float %a, float %b) {		define float @test_round_ss_0(float %a, float %b) {
; CHECK-LABEL: @test_round_ss_0(		; CHECK-LABEL: @test_round_ss_0(
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <4 x float> undef, float %b, i32 0		; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x float> undef, float [[B:%.]], i32 0
; CHECK-NEXT: [[TMP2:%.*]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> undef, <4 x float> [[TMP1]], i32 10)		; CHECK-NEXT: [[TMP2:%.*]] = tail call <4 x float> @llvm.x86.sse41.round.ss(<4 x float> undef, <4 x float> [[TMP1]], i32 10)
; CHECK-NEXT: [[R:%.*]] = extractelement <4 x float> [[TMP2]], i32 0		; CHECK-NEXT: [[R:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
; CHECK-NEXT: ret float [[R]]		; CHECK-NEXT: ret float [[R]]
;		;
%1 = insertelement <4 x float> undef, float %a, i32 0		%1 = insertelement <4 x float> undef, float %a, i32 0
%2 = insertelement <4 x float> %1, float 1.000000e+00, i32 1		%2 = insertelement <4 x float> %1, float 1.000000e+00, i32 1
%3 = insertelement <4 x float> %2, float 2.000000e+00, i32 2		%3 = insertelement <4 x float> %2, float 2.000000e+00, i32 2
%4 = insertelement <4 x float> %3, float 3.000000e+00, i32 3		%4 = insertelement <4 x float> %3, float 3.000000e+00, i32 3
Show All 28 Lines