This is an archive of the discontinued LLVM Phabricator instance.

SimplifyDemandedVectorElts for all intrinsics
ClosedPublic

Authored by reames on Jan 29 2019, 10:32 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
craig.topper

Commits

rGc71e996aed81: SimplifyDemandedVectorElts for all intrinsics
rL352653: SimplifyDemandedVectorElts for all intrinsics

Summary

The point is that this simplifies integration of new intrinsics into SimplifiedDemandedVectorElts, and ensures we don't miss any existing ones.

This is intended to be NFC-ish, but as seen from the diffs, can produce slightly different output.

Diff Detail

Event Timeline

reames created this revision.Jan 29 2019, 10:32 AM

Herald added subscribers: bollu, mcrosier. · View Herald TranscriptJan 29 2019, 10:32 AM

reames mentioned this in D57140: [WIP] Teach instcombine how to destroy vector GEPs/loads/stores.Jan 29 2019, 10:33 AM

Any insight into why those tests have changed?

In D57398#1376055, @RKSimon wrote:

Any insight into why those tests have changed?

Thank you for pushing on this. While trying to find an answer to your question, I found a bug in the patch.

The answer is that x86_avx512_mask_mul_sd_round and friends have the order of operation changed on two transforms. Previously, we did their dedicated transform, then fell through into the code I removed. Now, we simplify via demanded elements first, then process them via dedicated transforms.

That fallthrough is where the bug is though. We're now falling through into the wrong handler. The fallthrough needs replaced w/a break.

Fix bug noted in last comment.

LGTM - future patches should hopefully be able to move more of the SimplifyDemandedVectorElts/SimplifyDemandedVectorEltsLow code over.

This revision is now accepted and ready to land.Jan 30 2019, 1:01 AM

In D57398#1376798, @RKSimon wrote:

future patches should hopefully be able to move more of the SimplifyDemandedVectorElts/SimplifyDemandedVectorEltsLow code over.

Not sure what you mean here. Over to where? from where?

Closed by commit rL352653: SimplifyDemandedVectorElts for all intrinsics (authored by reames). · Explain WhyJan 30 2019, 11:21 AM

This revision was automatically updated to reflect the committed changes.

In D57398#1377469, @reames wrote:

In D57398#1376798, @RKSimon wrote:

future patches should hopefully be able to move more of the SimplifyDemandedVectorElts/SimplifyDemandedVectorEltsLow code over.

Not sure what you mean here. Over to where? from where?

I meant for cases like the SimplifyDemandedVectorEltsLow call in Intrinsic::x86_vcvtph2ps_128/256 which can now be removed and similar code moved into InstCombiner::SimplifyDemandedVectorElts so it is correctly handled within a recursive call. Not all intrinsics would work, but some are probably worth it.

In D57398#1377731, @RKSimon wrote:

In D57398#1377469, @reames wrote:

In D57398#1376798, @RKSimon wrote:

future patches should hopefully be able to move more of the SimplifyDemandedVectorElts/SimplifyDemandedVectorEltsLow code over.

Not sure what you mean here. Over to where? from where?

I meant for cases like the SimplifyDemandedVectorEltsLow call in Intrinsic::x86_vcvtph2ps_128/256 which can now be removed and similar code moved into InstCombiner::SimplifyDemandedVectorElts so it is correctly handled within a recursive call. Not all intrinsics would work, but some are probably worth it.

Ah, yes. I don't plan to focus on the x86 specific intrinsics, but I agree that would be a worthwhile cleanup.

At a high level, do you know why we need these intrinsics at all? What are they expressing which can be well represented/pattern matched from normal IR?

In D57398#1379071, @reames wrote:

In D57398#1377731, @RKSimon wrote:

In D57398#1377469, @reames wrote:

In D57398#1376798, @RKSimon wrote:

future patches should hopefully be able to move more of the SimplifyDemandedVectorElts/SimplifyDemandedVectorEltsLow code over.

Not sure what you mean here. Over to where? from where?

I meant for cases like the SimplifyDemandedVectorEltsLow call in Intrinsic::x86_vcvtph2ps_128/256 which can now be removed and similar code moved into InstCombiner::SimplifyDemandedVectorElts so it is correctly handled within a recursive call. Not all intrinsics would work, but some are probably worth it.

Ah, yes. I don't plan to focus on the x86 specific intrinsics, but I agree that would be a worthwhile cleanup.

I'll take a look at some point.

At a high level, do you know why we need these intrinsics at all? What are they expressing which can be well represented/pattern matched from normal IR?

The cvtph2ps intrinsics are still around because x86 is terrible at f16 lowering (scalar and vector) - see PR37554

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCalls.cpp

47 lines

test/

Transforms/

InstCombine/

X86/

x86-avx512.ll

16 lines

Diff 184181

lib/Transforms/InstCombine/InstCombineCalls.cpp

Show First 20 Lines • Show All 1,862 Lines • ▼ Show 20 Lines	if (auto *MI = dyn_cast<AnyMemIntrinsic>(II)) {
} else if (auto *MSI = dyn_cast<AnyMemSetInst>(MI)) {		} else if (auto *MSI = dyn_cast<AnyMemSetInst>(MI)) {
if (Instruction *I = SimplifyAnyMemSet(MSI))		if (Instruction *I = SimplifyAnyMemSet(MSI))
return I;		return I;
}		}

if (Changed) return II;		if (Changed) return II;
}		}

		// For vector result intrinsics, use the generic demanded vector support to
		// simplify any operands before moving on to the per-intrinsic rules.
		if (II->getType()->isVectorTy()) {
		auto VWidth = II->getType()->getVectorNumElements();
		APInt UndefElts(VWidth, 0);
		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
		if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
		if (V != II)
		return replaceInstUsesWith(*II, V);
		return II;
		}
		}

if (Instruction I = SimplifyNVVMIntrinsic(II, this))		if (Instruction I = SimplifyNVVMIntrinsic(II, this))
return I;		return I;

auto SimplifyDemandedVectorEltsLow = [this](Value *Op, unsigned Width,		auto SimplifyDemandedVectorEltsLow = [this](Value *Op, unsigned Width,
unsigned DemandedWidth) {		unsigned DemandedWidth) {
APInt UndefElts(Width, 0);		APInt UndefElts(Width, 0);
APInt DemandedElts = APInt::getLowBitsSet(Width, DemandedWidth);		APInt DemandedElts = APInt::getLowBitsSet(Width, DemandedWidth);
return SimplifyDemandedVectorElts(Op, DemandedElts, UndefElts);		return SimplifyDemandedVectorElts(Op, DemandedElts, UndefElts);
▲ Show 20 Lines • Show All 782 Lines • ▼ Show 20 Lines	if (auto *R = dyn_cast<ConstantInt>(II->getArgOperand(4))) {
}		}

// Insert the result back into the original argument 0.		// Insert the result back into the original argument 0.
V = Builder.CreateInsertElement(Arg0, V, (uint64_t)0);		V = Builder.CreateInsertElement(Arg0, V, (uint64_t)0);

return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
}		}
}		}
LLVM_FALLTHROUGH;

// X86 scalar intrinsics simplified with SimplifyDemandedVectorElts.
case Intrinsic::x86_avx512_mask_max_ss_round:
case Intrinsic::x86_avx512_mask_min_ss_round:
case Intrinsic::x86_avx512_mask_max_sd_round:
case Intrinsic::x86_avx512_mask_min_sd_round:
case Intrinsic::x86_sse_cmp_ss:
case Intrinsic::x86_sse_min_ss:
case Intrinsic::x86_sse_max_ss:
case Intrinsic::x86_sse2_cmp_sd:
case Intrinsic::x86_sse2_min_sd:
case Intrinsic::x86_sse2_max_sd:
case Intrinsic::x86_xop_vfrcz_ss:
case Intrinsic::x86_xop_vfrcz_sd: {
unsigned VWidth = II->getType()->getVectorNumElements();
APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
if (V != II)
return replaceInstUsesWith(*II, V);
return II;
}
break;		break;
}
case Intrinsic::x86_sse41_round_ss:		case Intrinsic::x86_sse41_round_ss:
case Intrinsic::x86_sse41_round_sd: {		case Intrinsic::x86_sse41_round_sd: {
unsigned VWidth = II->getType()->getVectorNumElements();		if (Value V = simplifyX86round(II, Builder))
APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
if (Value *V = SimplifyDemandedVectorElts(II, AllOnesEltMask, UndefElts)) {
if (V != II)
return replaceInstUsesWith(*II, V);
return II;
} else if (Value V = simplifyX86round(II, Builder))
return replaceInstUsesWith(*II, V);		return replaceInstUsesWith(*II, V);
break;		break;
}		}

// Constant fold ashr( <A x Bi>, Ci ).		// Constant fold ashr( <A x Bi>, Ci ).
// Constant fold lshr( <A x Bi>, Ci ).		// Constant fold lshr( <A x Bi>, Ci ).
// Constant fold shl( <A x Bi>, Ci ).		// Constant fold shl( <A x Bi>, Ci ).
case Intrinsic::x86_sse2_psrai_d:		case Intrinsic::x86_sse2_psrai_d:
▲ Show 20 Lines • Show All 1,972 Lines • Show Last 20 Lines

test/Transforms/InstCombine/X86/x86-avx512.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -instcombine -S \| FileCheck %s		; RUN: opt < %s -instcombine -S \| FileCheck %s
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

declare <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)		declare <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)

define <4 x float> @test_add_ss(<4 x float> %a, <4 x float> %b) {		define <4 x float> @test_add_ss(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @test_add_ss(		; CHECK-LABEL: @test_add_ss(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP4]]		; CHECK-NEXT: ret <4 x float> [[TMP4]]
;		;
%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)
Show All 14 Lines

define <4 x float> @test_add_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {		define <4 x float> @test_add_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
; CHECK-LABEL: @test_add_ss_mask(		; CHECK-LABEL: @test_add_ss_mask(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0
; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i32 0		; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i64 0
; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]		; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP8]]		; CHECK-NEXT: ret <4 x float> [[TMP8]]
;		;
%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.add.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	;
ret double %6		ret double %6
}		}

declare <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)		declare <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)

define <4 x float> @test_sub_ss(<4 x float> %a, <4 x float> %b) {		define <4 x float> @test_sub_ss(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @test_sub_ss(		; CHECK-LABEL: @test_sub_ss(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fsub float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fsub float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP4]]		; CHECK-NEXT: ret <4 x float> [[TMP4]]
;		;
%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)
Show All 14 Lines

define <4 x float> @test_sub_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {		define <4 x float> @test_sub_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
; CHECK-LABEL: @test_sub_ss_mask(		; CHECK-LABEL: @test_sub_ss_mask(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fsub float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fsub float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0
; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i32 0		; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i64 0
; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]		; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP8]]		; CHECK-NEXT: ret <4 x float> [[TMP8]]
;		;
%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.sub.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	;
ret double %6		ret double %6
}		}

declare <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)		declare <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)

define <4 x float> @test_mul_ss(<4 x float> %a, <4 x float> %b) {		define <4 x float> @test_mul_ss(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @test_mul_ss(		; CHECK-LABEL: @test_mul_ss(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fmul float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fmul float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP4]]		; CHECK-NEXT: ret <4 x float> [[TMP4]]
;		;
%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)
Show All 14 Lines

define <4 x float> @test_mul_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {		define <4 x float> @test_mul_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
; CHECK-LABEL: @test_mul_ss_mask(		; CHECK-LABEL: @test_mul_ss_mask(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fmul float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fmul float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0
; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i32 0		; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i64 0
; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]		; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP8]]		; CHECK-NEXT: ret <4 x float> [[TMP8]]
;		;
%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.mul.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	;
ret double %6		ret double %6
}		}

declare <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)		declare <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float>, <4 x float>, <4 x float>, i8, i32)

define <4 x float> @test_div_ss(<4 x float> %a, <4 x float> %b) {		define <4 x float> @test_div_ss(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @test_div_ss(		; CHECK-LABEL: @test_div_ss(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i32 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fdiv float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fdiv float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0		; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x float> [[A]], float [[TMP3]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP4]]		; CHECK-NEXT: ret <4 x float> [[TMP4]]
;		;
%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %b, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float> %a, <4 x float> %3, <4 x float> undef, i8 -1, i32 4)
Show All 14 Lines

define <4 x float> @test_div_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {		define <4 x float> @test_div_ss_mask(<4 x float> %a, <4 x float> %b, <4 x float> %c, i8 %mask) {
; CHECK-LABEL: @test_div_ss_mask(		; CHECK-LABEL: @test_div_ss_mask(
; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0		; CHECK-NEXT: [[TMP1:%.]] = extractelement <4 x float> [[A:%.]], i64 0
; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0		; CHECK-NEXT: [[TMP2:%.]] = extractelement <4 x float> [[B:%.]], i64 0
; CHECK-NEXT: [[TMP3:%.*]] = fdiv float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = fdiv float [[TMP1]], [[TMP2]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[MASK:%.]] to <8 x i1>
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <8 x i1> [[TMP4]], i64 0
; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i32 0		; CHECK-NEXT: [[TMP6:%.]] = extractelement <4 x float> [[C:%.]], i64 0
; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]		; CHECK-NEXT: [[TMP7:%.*]] = select i1 [[TMP5]], float [[TMP3]], float [[TMP6]]
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0		; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x float> [[A]], float [[TMP7]], i64 0
; CHECK-NEXT: ret <4 x float> [[TMP8]]		; CHECK-NEXT: ret <4 x float> [[TMP8]]
;		;
%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1		%1 = insertelement <4 x float> %c, float 1.000000e+00, i32 1
%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2		%2 = insertelement <4 x float> %1, float 2.000000e+00, i32 2
%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3		%3 = insertelement <4 x float> %2, float 3.000000e+00, i32 3
%4 = tail call <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)		%4 = tail call <4 x float> @llvm.x86.avx512.mask.div.ss.round(<4 x float> %a, <4 x float> %b, <4 x float> %3, i8 %mask, i32 4)
▲ Show 20 Lines • Show All 3,057 Lines • Show Last 20 Lines