This is an archive of the discontinued LLVM Phabricator instance.

[AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors.x
ClosedPublic

Authored by craig.topper on Aug 21 2017, 5:51 PM.

Download Raw Diff

Details

Reviewers

guyblank
zvi
RKSimon
spatel
delena

Commits

rGd0c62f909fa4: Merging r311572: --------------------------------------------------------------…
rG853a8d9ffcf7: [AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors
rL311593: Merging r311572:
rL311572: [AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors

Summary

There are no 512-bit blend instructions so we shouldn't create SHRUNKBLEND for them.

On a side note, it looks like there may be a missed opportunity for constant folding TESTM when LHS and RHS are equal.

This fixes PR34139.

Diff Detail

Event Timeline

craig.topper created this revision.Aug 21 2017, 5:51 PM

delena added a subscriber: delena.Aug 22 2017, 11:37 PM

delena added inline comments.

test/CodeGen/X86/pr34139.ll
14	Could you, please, explain me how <16 x double> value is stored using one ZMM instruction?

zvi added inline comments.Aug 23 2017, 6:40 AM

lib/Target/X86/X86ISelLowering.cpp
30679	Any chance that due to the added bail-out we will be missing out on this combine?

craig.topper added inline comments.Aug 23 2017, 9:11 AM

lib/Target/X86/X86ISelLowering.cpp
30679	This combine runs on the very last DAG combine. The one above runs on earlier DAG combine. So I don't think there's an issue. If there was, I think the early out on BitWidth==1 above would be much worse.
test/CodeGen/X86/pr34139.ll
14	I think its because the IR is using a store to undef as its address. So I think we sort of merged the stores. If i put in a real address we get two stores. I'll try to unreduce the test case a little

Use a less reduced test case so that we still get multiple stores

delena accepted this revision.Aug 23 2017, 9:28 AM

This revision is now accepted and ready to land.Aug 23 2017, 9:28 AM

Closed by commit rL311572: [AVX512] Don't create SHRUNKBLEND SDNodes for 512-bit vectors (authored by ctopper). · Explain WhyAug 23 2017, 9:42 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

3 lines

test/

CodeGen/

X86/

pr34139.ll

24 lines

Diff 112385

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 30,623 Lines • ▼ Show 20 Lines	if (N->getOpcode() == ISD::VSELECT && DCI.isBeforeLegalizeOps() &&
if (VT.getVectorElementType() == MVT::i16)		if (VT.getVectorElementType() == MVT::i16)
return SDValue();		return SDValue();
// Dynamic blending was only available from SSE4.1 onward.		// Dynamic blending was only available from SSE4.1 onward.
if (VT.is128BitVector() && !Subtarget.hasSSE41())		if (VT.is128BitVector() && !Subtarget.hasSSE41())
return SDValue();		return SDValue();
// Byte blends are only available in AVX2		// Byte blends are only available in AVX2
if (VT == MVT::v32i8 && !Subtarget.hasAVX2())		if (VT == MVT::v32i8 && !Subtarget.hasAVX2())
return SDValue();		return SDValue();
		// There are no 512-bit blend instructions that use sign bits.
		if (VT.is512BitVector())
		return SDValue();

assert(BitWidth >= 8 && BitWidth <= 64 && "Invalid mask size");		assert(BitWidth >= 8 && BitWidth <= 64 && "Invalid mask size");
APInt DemandedMask(APInt::getSignMask(BitWidth));		APInt DemandedMask(APInt::getSignMask(BitWidth));
KnownBits Known;		KnownBits Known;
TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),		TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),
!DCI.isBeforeLegalizeOps());		!DCI.isBeforeLegalizeOps());
if (TLI.ShrinkDemandedConstant(Cond, DemandedMask, TLO) \|\|		if (TLI.ShrinkDemandedConstant(Cond, DemandedMask, TLO) \|\|
TLI.SimplifyDemandedBits(Cond, DemandedMask, Known, TLO)) {		TLI.SimplifyDemandedBits(Cond, DemandedMask, Known, TLO)) {
Show All 28 Lines	if (TLI.ShrinkDemandedConstant(Cond, DemandedMask, TLO) \|\|
// changed. Change the condition just for N to keep the opportunity to		// changed. Change the condition just for N to keep the opportunity to
// optimize all other users their own way.		// optimize all other users their own way.
SDValue SB = DAG.getNode(X86ISD::SHRUNKBLEND, DL, VT, TLO.New, LHS, RHS);		SDValue SB = DAG.getNode(X86ISD::SHRUNKBLEND, DL, VT, TLO.New, LHS, RHS);
DAG.ReplaceAllUsesOfValueWith(SDValue(N, 0), SB);		DAG.ReplaceAllUsesOfValueWith(SDValue(N, 0), SB);
return SDValue();		return SDValue();
}		}
}		}

// Look for vselects with LHS/RHS being bitcasted from an operation that		// Look for vselects with LHS/RHS being bitcasted from an operation that
		zviUnsubmitted Not Done Reply Inline Actions Any chance that due to the added bail-out we will be missing out on this combine? zvi: Any chance that due to the added bail-out we will be missing out on this combine?
		craig.topperAuthorUnsubmitted Not Done Reply Inline Actions This combine runs on the very last DAG combine. The one above runs on earlier DAG combine. So I don't think there's an issue. If there was, I think the early out on BitWidth==1 above would be much worse. craig.topper: This combine runs on the very last DAG combine. The one above runs on earlier DAG combine. So I…
// can be executed on another type. Push the bitcast to the inputs of		// can be executed on another type. Push the bitcast to the inputs of
// the operation. This exposes opportunities for using masking instructions.		// the operation. This exposes opportunities for using masking instructions.
if (N->getOpcode() == ISD::VSELECT && DCI.isAfterLegalizeVectorOps() &&		if (N->getOpcode() == ISD::VSELECT && DCI.isAfterLegalizeVectorOps() &&
CondVT.getVectorElementType() == MVT::i1) {		CondVT.getVectorElementType() == MVT::i1) {
if (combineBitcastForMaskedOp(LHS, DAG, DCI))		if (combineBitcastForMaskedOp(LHS, DAG, DCI))
return SDValue(N, 0);		return SDValue(N, 0);
if (combineBitcastForMaskedOp(RHS, DAG, DCI))		if (combineBitcastForMaskedOp(RHS, DAG, DCI))
return SDValue(N, 0);		return SDValue(N, 0);
▲ Show 20 Lines • Show All 6,120 Lines • Show Last 20 Lines

test/CodeGen/X86/pr34139.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=knl \| FileCheck %s

				define void @f_f(<16 x double>* %ptr) {
				; CHECK-LABEL: f_f:
				; CHECK: # BB#0:
				; CHECK-NEXT: vpcmpeqd %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: vmovdqa %xmm0, (%rax)
				; CHECK-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0
				; CHECK-NEXT: vmovapd (%rdi), %zmm1
				; CHECK-NEXT: vmovapd 64(%rdi), %zmm2
				; CHECK-NEXT: vptestmq %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovapd %zmm0, %zmm1 {%k1}
				; CHECK-NEXT: vmovapd %zmm0, %zmm2 {%k1}
				delenaUnsubmitted Not Done Reply Inline Actions Could you, please, explain me how <16 x double> value is stored using one ZMM instruction? delena: Could you, please, explain me how <16 x double> value is stored using one ZMM instruction?
				craig.topperAuthorUnsubmitted Not Done Reply Inline Actions I think its because the IR is using a store to undef as its address. So I think we sort of merged the stores. If i put in a real address we get two stores. I'll try to unreduce the test case a little craig.topper: I think its because the IR is using a store to undef as its address. So I think we sort of…
				; CHECK-NEXT: vmovapd %zmm2, 64(%rdi)
				; CHECK-NEXT: vmovapd %zmm1, (%rdi)
				store <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16 x i8>* undef
				%load_mask8.i.i.i = load <16 x i8>, <16 x i8>* undef
				%v.i.i.i.i = load <16 x double>, <16 x double>* %ptr
				%mask_vec_i1.i.i.i51.i.i = icmp ne <16 x i8> %load_mask8.i.i.i, zeroinitializer
				%v1.i.i.i.i = select <16 x i1> %mask_vec_i1.i.i.i51.i.i, <16 x double> undef, <16 x double> %v.i.i.i.i
				store <16 x double> %v1.i.i.i.i, <16 x double>* %ptr
				unreachable
				}