This is an archive of the discontinued LLVM Phabricator instance.

[x86] use demanded bits to simplify masked store codegen
ClosedPublic

Authored by spatel on Oct 6 2018, 7:30 AM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
andreadb

Commits

rGf5fac1826a86: [x86] use demanded bits to simplify masked store codegen
rL344048: [x86] use demanded bits to simplify masked store codegen

Summary

As noted in D52747, if we prefer IR to use trunc for bool vectors rather than and+icmp, we can expose codegen shortcomings as seen here with masked store.

We can replace a hard-coded PCMPGT simplification with the more general demanded bits call here to improve things. The AVX1 pattern still isn't handled, so that's another potential dependency for the instcombine patch (although I'm not sure how much masked op usage we prefer with only AVX1).

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Oct 6 2018, 7:30 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptOct 6 2018, 7:30 AM

spatel added a child revision: D52747: [InstCombine] reverse 'trunc X to <N x i1>' canonicalization.Oct 6 2018, 7:32 AM

RKSimon added inline comments.Oct 6 2018, 7:38 AM

test/CodeGen/X86/masked_memop.ll
1282 ↗	(On Diff #168573)	We're going to have to add SimplifyDemandedBitsForTargetNode to handle this properly - @craig.topper didn't you have a patch that was going to add that at some point?

craig.topper added inline comments.Oct 6 2018, 3:23 PM

lib/Target/X86/X86ISelLowering.cpp
36499 ↗	(On Diff #168573)	Can you drop the 2 spaces at the start of this blank line.
test/CodeGen/X86/masked_memop.ll
1282 ↗	(On Diff #168573)	It was in https://reviews.llvm.org/D38832, but I found a simpler approach for that specific case.
1283 ↗	(On Diff #168573)	Why might be able to get this without target support if we stop splitting v4i32<-v4i64 sign extends during DAG combine on AVX1 targets. We already handle the split in LowerSIGN_EXTEND so we shouldn't need to split in combine. The splitting creates a sequence we can't run SimplifyDemandedBits through because we ended up with 2 uses of the v4i32 input.

RKSimon added inline comments.Oct 7 2018, 6:15 AM

test/CodeGen/X86/masked_memop.ll
1283 ↗	(On Diff #168573)	Looking at this now - the SEXT isn't a problem, but ZEXT needs to be handled as well and is proving trickier.

Patch updated:
Fixed whitespace diff.

RKSimon mentioned this in D52970: [X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors.Oct 7 2018, 6:54 AM

RKSimon mentioned this in D52980: [X86][AVX1] Enable *_EXTEND_VECTOR_INREG lowering of 256-bit vectors.Oct 8 2018, 4:12 AM

RKSimon added inline comments.Oct 8 2018, 4:40 AM

lib/Target/X86/X86ISelLowering.cpp
36527 ↗	(On Diff #168588)	No need for the Depth = 0 at the end

Patch updated:
Removed unnecessary depth param (copy-paste remnant).

Also, I confirmed that the combination of this patch + D52980 will remove the extra AVX1 shift from masked_store_bool_mask_demand_trunc_sext().

RKSimon mentioned this in rL343991: [X86][AVX2] Enable ZERO_EXTEND_VECTOR_INREG lowering of 256-bit vectors.Oct 8 2018, 11:42 AM

RKSimon mentioned this in rL344019: [X86][AVX1] Enable *_EXTEND_VECTOR_INREG lowering of 256-bit vectors.Oct 9 2018, 12:45 AM

In D52964#1257755, @spatel wrote:

Also, I confirmed that the combination of this patch + D52980 will remove the extra AVX1 shift from masked_store_bool_mask_demand_trunc_sext().

D52980 has now landed, but I had to reduce it slightly so we need to confirm if the AVX1 shift is still removed.

LGTM - the AVX1 regression needs SIGN_EXTEND_VECTOR_INREG support adding to SimplifyDemandedBits which I intend to do as a follow up patch

This revision is now accepted and ready to land.Oct 9 2018, 5:27 AM

RKSimon mentioned this in rL344043: [SelectionDAG] Add SIGN_EXTEND_VECTOR_INREG and CONCAT_VECTORS support to….Oct 9 2018, 6:15 AM

In D52964#1258721, @RKSimon wrote:

LGTM - the AVX1 regression needs SIGN_EXTEND_VECTOR_INREG support adding to SimplifyDemandedBits which I intend to do as a follow up patch

For reference, that was committed here:
rL344043

...so I'll update the test file and commit soon.

Closed by commit rL344048: [x86] use demanded bits to simplify masked store codegen (authored by spatel). · Explain WhyOct 9 2018, 7:06 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D52747: [InstCombine] reverse 'trunc X to <N x i1>' canonicalization.Oct 9 2018, 7:10 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

33 lines

test/

CodeGen/

X86/

masked_memop.ll

6 lines

Diff 168791

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 36,515 Lines • ▼ Show 20 Lines	SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, EltVT,
MS->getValue(), VecIndex);		MS->getValue(), VecIndex);

// Store that element at the appropriate offset from the base pointer.		// Store that element at the appropriate offset from the base pointer.
return DAG.getStore(MS->getChain(), DL, Extract, Addr, MS->getPointerInfo(),		return DAG.getStore(MS->getChain(), DL, Extract, Addr, MS->getPointerInfo(),
Alignment, MS->getMemOperand()->getFlags());		Alignment, MS->getMemOperand()->getFlags());
}		}

static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,		static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);		MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);

if (Mst->isCompressingStore())		if (Mst->isCompressingStore())
return SDValue();		return SDValue();

		EVT VT = Mst->getValue().getValueType();
if (!Mst->isTruncatingStore()) {		if (!Mst->isTruncatingStore()) {
if (SDValue ScalarStore = reduceMaskedStoreToScalarStore(Mst, DAG))		if (SDValue ScalarStore = reduceMaskedStoreToScalarStore(Mst, DAG))
return ScalarStore;		return ScalarStore;

// If the mask is checking (0 > X), we're creating a vector with all-zeros		// If the mask value has been legalized to a non-boolean vector, try to
// or all-ones elements based on the sign bits of X. AVX1 masked store only		// simplify ops leading up to it. We only demand the MSB of each lane.
// cares about the sign bit of each mask element, so eliminate the compare:
// mstore val, ptr, (pcmpgt 0, X) --> mstore val, ptr, X
// Note that by waiting to match an x86-specific PCMPGT node, we're
// eliminating potentially more complex matching of a setcc node which has
// a full range of predicates.
SDValue Mask = Mst->getMask();		SDValue Mask = Mst->getMask();
if (Mask.getOpcode() == X86ISD::PCMPGT &&		if (Mask.getScalarValueSizeInBits() != 1) {
ISD::isBuildVectorAllZeros(Mask.getOperand(0).getNode())) {		TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),
assert(Mask.getValueType() == Mask.getOperand(1).getValueType() &&		!DCI.isBeforeLegalizeOps());
"Unexpected type for PCMPGT");		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
return DAG.getMaskedStore(		APInt DemandedMask(APInt::getSignMask(VT.getScalarSizeInBits()));
Mst->getChain(), SDLoc(N), Mst->getValue(), Mst->getBasePtr(),		KnownBits Known;
Mask.getOperand(1), Mst->getMemoryVT(), Mst->getMemOperand());		if (TLI.SimplifyDemandedBits(Mask, DemandedMask, Known, TLO)) {
		DCI.AddToWorklist(Mask.getNode());
		DCI.CommitTargetLoweringOpt(TLO);
		return SDValue(N, 0);
		}
}		}

// TODO: AVX512 targets should also be able to simplify something like the		// TODO: AVX512 targets should also be able to simplify something like the
// pattern above, but that pattern will be different. It will either need to		// pattern above, but that pattern will be different. It will either need to
// match setcc more generally or match PCMPGTM later (in tablegen?).		// match setcc more generally or match PCMPGTM later (in tablegen?).

return SDValue();		return SDValue();
}		}

// Resolve truncating stores.		// Resolve truncating stores.
EVT VT = Mst->getValue().getValueType();
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
EVT StVT = Mst->getMemoryVT();		EVT StVT = Mst->getMemoryVT();
SDLoc dl(Mst);		SDLoc dl(Mst);

assert(StVT != VT && "Cannot truncate to the same type");		assert(StVT != VT && "Cannot truncate to the same type");
unsigned FromSz = VT.getScalarSizeInBits();		unsigned FromSz = VT.getScalarSizeInBits();
unsigned ToSz = StVT.getScalarSizeInBits();		unsigned ToSz = StVT.getScalarSizeInBits();

▲ Show 20 Lines • Show All 3,809 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);		case ISD::SRL: return combineShift(N, DAG, DCI, Subtarget);
case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);		case ISD::AND: return combineAnd(N, DAG, DCI, Subtarget);
case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);		case ISD::OR: return combineOr(N, DAG, DCI, Subtarget);
case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);		case ISD::XOR: return combineXor(N, DAG, DCI, Subtarget);
case X86ISD::BEXTR: return combineBEXTR(N, DAG, DCI, Subtarget);		case X86ISD::BEXTR: return combineBEXTR(N, DAG, DCI, Subtarget);
case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);		case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);
case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);		case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);
case ISD::STORE: return combineStore(N, DAG, Subtarget);		case ISD::STORE: return combineStore(N, DAG, Subtarget);
case ISD::MSTORE: return combineMaskedStore(N, DAG, Subtarget);		case ISD::MSTORE: return combineMaskedStore(N, DAG, DCI, Subtarget);
case ISD::SINT_TO_FP: return combineSIntToFP(N, DAG, Subtarget);		case ISD::SINT_TO_FP: return combineSIntToFP(N, DAG, Subtarget);
case ISD::UINT_TO_FP: return combineUIntToFP(N, DAG, Subtarget);		case ISD::UINT_TO_FP: return combineUIntToFP(N, DAG, Subtarget);
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB: return combineFaddFsub(N, DAG, Subtarget);		case ISD::FSUB: return combineFaddFsub(N, DAG, Subtarget);
case ISD::FNEG: return combineFneg(N, DAG, Subtarget);		case ISD::FNEG: return combineFneg(N, DAG, Subtarget);
case ISD::TRUNCATE: return combineTruncate(N, DAG, Subtarget);		case ISD::TRUNCATE: return combineTruncate(N, DAG, Subtarget);
case X86ISD::ANDNP: return combineAndnp(N, DAG, DCI, Subtarget);		case X86ISD::ANDNP: return combineAndnp(N, DAG, DCI, Subtarget);
case X86ISD::FAND: return combineFAnd(N, DAG, Subtarget);		case X86ISD::FAND: return combineFAnd(N, DAG, Subtarget);
▲ Show 20 Lines • Show All 1,182 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/masked_memop.ll

	Show First 20 Lines • Show All 1,272 Lines • ▼ Show 20 Lines
	; SKX-NEXT: vpcmpgtd %xmm2, %xmm1, %k1			; SKX-NEXT: vpcmpgtd %xmm2, %xmm1, %k1
	; SKX-NEXT: vmovups %xmm0, (%rdi) {%k1}			; SKX-NEXT: vmovups %xmm0, (%rdi) {%k1}
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%bool_mask = icmp slt <4 x i32> %mask, zeroinitializer			%bool_mask = icmp slt <4 x i32> %mask, zeroinitializer
	call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %x, <4 x float>* %ptr, i32 1, <4 x i1> %bool_mask)			call void @llvm.masked.store.v4f32.p0v4f32(<4 x float> %x, <4 x float>* %ptr, i32 1, <4 x i1> %bool_mask)
	ret void			ret void
	}			}

	; TODO: SimplifyDemandedBits should eliminate an ashr here.			; SimplifyDemandedBits eliminates an ashr here.

	define void @masked_store_bool_mask_demand_trunc_sext(<4 x double> %x, <4 x double>* %p, <4 x i32> %masksrc) {			define void @masked_store_bool_mask_demand_trunc_sext(<4 x double> %x, <4 x double>* %p, <4 x i32> %masksrc) {
	; AVX1-LABEL: masked_store_bool_mask_demand_trunc_sext:			; AVX1-LABEL: masked_store_bool_mask_demand_trunc_sext:
	; AVX1: ## %bb.0:			; AVX1: ## %bb.0:
	; AVX1-NEXT: vpslld $31, %xmm1, %xmm1			; AVX1-NEXT: vpslld $31, %xmm1, %xmm1
	; AVX1-NEXT: vpsrad $31, %xmm1, %xmm1
	; AVX1-NEXT: vpmovsxdq %xmm1, %xmm2			; AVX1-NEXT: vpmovsxdq %xmm1, %xmm2
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,0,1]
	; AVX1-NEXT: vpmovsxdq %xmm1, %xmm1			; AVX1-NEXT: vpmovsxdq %xmm1, %xmm1
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
	; AVX1-NEXT: vmaskmovpd %ymm0, %ymm1, (%rdi)			; AVX1-NEXT: vmaskmovpd %ymm0, %ymm1, (%rdi)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: masked_store_bool_mask_demand_trunc_sext:			; AVX2-LABEL: masked_store_bool_mask_demand_trunc_sext:
	; AVX2: ## %bb.0:			; AVX2: ## %bb.0:
	; AVX2-NEXT: vpslld $31, %xmm1, %xmm1			; AVX2-NEXT: vpslld $31, %xmm1, %xmm1
	; AVX2-NEXT: vpsrad $31, %xmm1, %xmm1
	; AVX2-NEXT: vpmovsxdq %xmm1, %ymm1			; AVX2-NEXT: vpmovsxdq %xmm1, %ymm1
	; AVX2-NEXT: vmaskmovpd %ymm0, %ymm1, (%rdi)			; AVX2-NEXT: vmaskmovpd %ymm0, %ymm1, (%rdi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: masked_store_bool_mask_demand_trunc_sext:			; AVX512F-LABEL: masked_store_bool_mask_demand_trunc_sext:
	; AVX512F: ## %bb.0:			; AVX512F: ## %bb.0:
	; AVX512F-NEXT: ## kill: def $ymm0 killed $ymm0 def $zmm0			; AVX512F-NEXT: ## kill: def $ymm0 killed $ymm0 def $zmm0
	Show All 25 Lines
	; AVX1-LABEL: widen_masked_store:			; AVX1-LABEL: widen_masked_store:
	; AVX1: ## %bb.0:			; AVX1: ## %bb.0:
	; AVX1-NEXT: vmovd %edx, %xmm1			; AVX1-NEXT: vmovd %edx, %xmm1
	; AVX1-NEXT: vmovd %esi, %xmm2			; AVX1-NEXT: vmovd %esi, %xmm2
	; AVX1-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; AVX1-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
	; AVX1-NEXT: vmovd %ecx, %xmm2			; AVX1-NEXT: vmovd %ecx, %xmm2
	; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX1-NEXT: vpslld $31, %xmm1, %xmm1			; AVX1-NEXT: vpslld $31, %xmm1, %xmm1
	; AVX1-NEXT: vpsrad $31, %xmm1, %xmm1
	; AVX1-NEXT: vmaskmovps %xmm0, %xmm1, (%rdi)			; AVX1-NEXT: vmaskmovps %xmm0, %xmm1, (%rdi)
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: widen_masked_store:			; AVX2-LABEL: widen_masked_store:
	; AVX2: ## %bb.0:			; AVX2: ## %bb.0:
	; AVX2-NEXT: vmovd %edx, %xmm1			; AVX2-NEXT: vmovd %edx, %xmm1
	; AVX2-NEXT: vmovd %esi, %xmm2			; AVX2-NEXT: vmovd %esi, %xmm2
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
	; AVX2-NEXT: vmovd %ecx, %xmm2			; AVX2-NEXT: vmovd %ecx, %xmm2
	; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]			; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm1 = xmm1[0],xmm2[0]
	; AVX2-NEXT: vpslld $31, %xmm1, %xmm1			; AVX2-NEXT: vpslld $31, %xmm1, %xmm1
	; AVX2-NEXT: vpsrad $31, %xmm1, %xmm1
	; AVX2-NEXT: vpmaskmovd %xmm0, %xmm1, (%rdi)			; AVX2-NEXT: vpmaskmovd %xmm0, %xmm1, (%rdi)
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: widen_masked_store:			; AVX512F-LABEL: widen_masked_store:
	; AVX512F: ## %bb.0:			; AVX512F: ## %bb.0:
	; AVX512F-NEXT: ## kill: def $xmm0 killed $xmm0 def $zmm0			; AVX512F-NEXT: ## kill: def $xmm0 killed $xmm0 def $zmm0
	; AVX512F-NEXT: vpslld $31, %xmm1, %xmm1			; AVX512F-NEXT: vpslld $31, %xmm1, %xmm1
	; AVX512F-NEXT: vptestmd %zmm1, %zmm1, %k1			; AVX512F-NEXT: vptestmd %zmm1, %zmm1, %k1
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines