This is an archive of the discontinued LLVM Phabricator instance.

[x86] convert masked store of one element to scalar store
ClosedPublic

Authored by spatel on Feb 2 2016, 3:28 PM.

Download Raw Diff

Details

Reviewers

RKSimon
delena
igorb

Commits

rG264d7e5b685a: [x86] convert masked store of one element to scalar store
rL260145: [x86] convert masked store of one element to scalar store

Summary

Another opportunity to reduce masked stores: in D16691, we decided not to attempt the 'one mask element is set' transform in InstCombine, but I think this should be a win for any AVX machine.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 46713.Feb 2 2016, 3:28 PM

spatel retitled this revision from to [x86] convert masked store of one element to scalar store.

spatel updated this object.

spatel added reviewers: delena, igorb, RKSimon.

spatel added a subscriber: llvm-commits.

Herald added a subscriber: mcrosier. · View Herald TranscriptFeb 2 2016, 3:28 PM

Please can you add tests for i64 or f64 vectors as well? Also cases where the element is not from the lower 128-bits of a ymm/zmm.

Some minor comments. I wonder if this could be moved to DAGCombiner - possibly with a test for scalar store legality and TLI.isExtractVectorElementCheap()?

lib/Target/X86/X86ISelLowering.cpp
26727 ↗	(On Diff #46713)	This can probably be brought inside reduceMaskedStoreToScalarStore as a helper predicate
26730 ↗	(On Diff #46713)	Don't we have cases where the build vector input operands are implicitly truncated to i1? For instance vector constant folding is likely to have created legal types that are larger.

In D16828#343021, @RKSimon wrote:

Some minor comments. I wonder if this could be moved to DAGCombiner - possibly with a test for scalar store legality and TLI.isExtractVectorElementCheap()?

I know that on AVX, VMASKMOV does not imply the non-temporal hint. However, what about targets that implement masked stores as non-temporal stores of selected bytes? For those targets, not preserving the non-temporal hint when converting the mask store into a scalar store would affect the hardware write combining logic.
That said, my knowledge is only limited to x86, so I cannot say that those targets exist.. What I am trying to say is that we probably have to be careful if we decide to move this transformation into the DAGCombiner as different targets may expand those nodes in a slightly different way (not sure if this makes sense..).
Basically, I recommend leaving it for now as a target specific combine (just my opinion).

In D16828#343149, @andreadb wrote:

In D16828#343021, @RKSimon wrote:

Some minor comments. I wonder if this could be moved to DAGCombiner - possibly with a test for scalar store legality and TLI.isExtractVectorElementCheap()?

I know that on AVX, VMASKMOV does not imply the non-temporal hint. However, what about targets that implement masked stores as non-temporal stores of selected bytes? For those targets, not preserving the non-temporal hint when converting the mask store into a scalar store would affect the hardware write combining logic.

I have preserved the non-temporal hint of the original masked store in the scalar store with this patch, but...

That said, my knowledge is only limited to x86, so I cannot say that those targets exist.. What I am trying to say is that we probably have to be careful if we decide to move this transformation into the DAGCombiner as different targets may expand those nodes in a slightly different way (not sure if this makes sense..).
Basically, I recommend leaving it for now as a target specific combine (just my opinion).

That was my thinking too. Let's make sure this is sane across x86 first. There's a lot of potential variation in how a masked op is implemented. I'll add a 'TODO' comment to remind us that this could be lifted to DAGCombiner once we have more confidence.

spatel added inline comments.Feb 3 2016, 12:17 PM

lib/Target/X86/X86ISelLowering.cpp
26727 ↗	(On Diff #46713)	Yes - good idea. My lambda skills are non-existent. Please double-check in the updated patch.
26730 ↗	(On Diff #46713)	I'm not sure about the interaction between legalization and the IR intrinsic. I was expecting that we always get here with the IR-defined form. How about I add another 'TODO' for now? I think supporting different mask types may require careful logic because ISD::MSTORE doesn't actually define the mask format AFAICT.

Patch updated:

Converted getOneTrueElt() into a lambda.
Added TODO comment about hoisting this to DAGCombiner.
Added TODO comment about supporting other mask types (but if we do, what is the format for those masks?)
Added test cases to demonstrate extraction from different types, high elements, and different sizes of vectors.

In D16828#343354, @spatel wrote:

Patch updated:

Converted getOneTrueElt() into a lambda.

On 2nd thought, we're going to need that exact function for the sibling patch for masked loads. I suppose it's good form to keep it this way for now, but I think it's going to get pulled back out soon enough.

LGTM - feel free to pull getOneTrueElt back out again before committing if you wish.

This revision is now accepted and ready to land.Feb 8 2016, 9:40 AM

Closed by commit rL260145: [x86] convert masked store of one element to scalar store (authored by spatel). · Explain WhyFeb 8 2016, 1:09 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

77 lines

test/

CodeGen/

X86/

masked_memop.ll

113 lines

Diff 47243

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,778 Lines • ▼ Show 20 Lines	static SDValue PerformMLOADCombine(SDNode *N, SelectionDAG &DAG,
SDValue WideLd = DAG.getMaskedLoad(WideVecVT, dl, Mld->getChain(),		SDValue WideLd = DAG.getMaskedLoad(WideVecVT, dl, Mld->getChain(),
Mld->getBasePtr(), NewMask, WideSrc0,		Mld->getBasePtr(), NewMask, WideSrc0,
Mld->getMemoryVT(), Mld->getMemOperand(),		Mld->getMemoryVT(), Mld->getMemOperand(),
ISD::NON_EXTLOAD);		ISD::NON_EXTLOAD);
SDValue NewVec = DAG.getNode(X86ISD::VSEXT, dl, VT, WideLd);		SDValue NewVec = DAG.getNode(X86ISD::VSEXT, dl, VT, WideLd);
return DCI.CombineTo(N, NewVec, WideLd.getValue(1), true);		return DCI.CombineTo(N, NewVec, WideLd.getValue(1), true);
}		}

/// PerformMSTORECombine - Resolve truncating stores
		/// If exactly one element of the mask is set for a non-truncating masked store,
		/// it is a vector extract and scalar store.
		/// Note: It is expected that the degenerate cases of an all-zeros or all-ones
		/// mask have already been optimized in IR, so we don't bother with those here.
		static SDValue reduceMaskedStoreToScalarStore(MaskedStoreSDNode *MS,
		SelectionDAG &DAG) {
		// TODO: This is not x86-specific, so it could be lifted to DAGCombiner.
		// However, some target hooks may need to be added to know when the transform
		// is profitable. Endianness would also have to be considered.

		// If V is a build vector of boolean constants and exactly one of those
		// constants is true, return the operand index of that true element.
		// Otherwise, return -1.
		auto getOneTrueElt = [](SDValue V) {
		// This needs to be a build vector of booleans.
		// TODO: Checking for the i1 type matches the IR definition for the mask,
		// but the mask check could be loosened to i8 or other types. That might
		// also require checking more than 'allOnesValue'; eg, the x86 HW
		// instructions only require that the MSB is set for each mask element.
		// The ISD::MSTORE comments/definition do not specify how the mask operand
		// is formatted.
		auto *BV = dyn_cast<BuildVectorSDNode>(V);
		if (!BV \|\| BV->getValueType(0).getVectorElementType() != MVT::i1)
		return -1;

		int TrueIndex = -1;
		unsigned NumElts = BV->getValueType(0).getVectorNumElements();
		for (unsigned i = 0; i < NumElts; ++i) {
		const SDValue &Op = BV->getOperand(i);
		if (Op.getOpcode() == ISD::UNDEF)
		continue;
		auto *ConstNode = dyn_cast<ConstantSDNode>(Op);
		if (!ConstNode)
		return -1;
		if (ConstNode->getAPIntValue().isAllOnesValue()) {
		// If we already found a one, this is too many.
		if (TrueIndex >= 0)
		return -1;
		TrueIndex = i;
		}
		}
		return TrueIndex;
		};

		int TrueMaskElt = getOneTrueElt(MS->getMask());
		if (TrueMaskElt < 0)
		return SDValue();

		SDLoc DL(MS);
		EVT VT = MS->getValue().getValueType();
		EVT EltVT = VT.getVectorElementType();

		// Extract the one scalar element that is actually being stored.
		SDValue ExtractIndex = DAG.getIntPtrConstant(TrueMaskElt, DL);
		SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, EltVT,
		MS->getValue(), ExtractIndex);

		// Store that element at the appropriate offset from the base pointer.
		SDValue StoreAddr = MS->getBasePtr();
		unsigned EltSize = EltVT.getStoreSize();
		if (TrueMaskElt != 0) {
		unsigned StoreOffset = TrueMaskElt * EltSize;
		SDValue StoreOffsetVal = DAG.getIntPtrConstant(StoreOffset, DL);
		StoreAddr = DAG.getNode(ISD::ADD, DL, StoreAddr.getValueType(), StoreAddr,
		StoreOffsetVal);
		}
		unsigned Alignment = MinAlign(MS->getAlignment(), EltSize);
		return DAG.getStore(MS->getChain(), DL, Extract, StoreAddr,
		MS->getPointerInfo(), MS->isVolatile(),
		MS->isNonTemporal(), Alignment);
		}

static SDValue PerformMSTORECombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformMSTORECombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);		MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);
if (!Mst->isTruncatingStore())		if (!Mst->isTruncatingStore())
return SDValue();		return reduceMaskedStoreToScalarStore(Mst, DAG);

		// Resolve truncating stores.
EVT VT = Mst->getValue().getValueType();		EVT VT = Mst->getValue().getValueType();
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
EVT StVT = Mst->getMemoryVT();		EVT StVT = Mst->getMemoryVT();
SDLoc dl(Mst);		SDLoc dl(Mst);

assert(StVT != VT && "Cannot truncate to the same type");		assert(StVT != VT && "Cannot truncate to the same type");
unsigned FromSz = VT.getVectorElementType().getSizeInBits();		unsigned FromSz = VT.getVectorElementType().getSizeInBits();
unsigned ToSz = StVT.getVectorElementType().getSizeInBits();		unsigned ToSz = StVT.getVectorElementType().getSizeInBits();
▲ Show 20 Lines • Show All 2,568 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/masked_memop.ll

	Show First 20 Lines • Show All 985 Lines • ▼ Show 20 Lines
	; SKX-NEXT: kxnorw %k0, %k0, %k1			; SKX-NEXT: kxnorw %k0, %k0, %k1
	; SKX-NEXT: vmovdqu32 %xmm1, (%rdi) {%k1}			; SKX-NEXT: vmovdqu32 %xmm1, (%rdi) {%k1}
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%mask = icmp eq <4 x i32> %trigger, zeroinitializer			%mask = icmp eq <4 x i32> %trigger, zeroinitializer
	call void @llvm.masked.store.v4i32(<4 x i32>%val, <4 x i32>* %addr, i32 4, <4 x i1><i1 true, i1 true, i1 true, i1 true>)			call void @llvm.masked.store.v4i32(<4 x i32>%val, <4 x i32>* %addr, i32 4, <4 x i1><i1 true, i1 true, i1 true, i1 true>)
	ret void			ret void
	}			}

	define void @test22(<4 x i32> %trigger, <4 x i32>* %addr, <4 x i32> %val) {			; When only one element of the mask is set, reduce to a scalar store.
	; AVX1-LABEL: test22:
	; AVX1: ## BB#0:			define void @one_mask_bit_set1(<4 x i32>* %addr, <4 x i32> %val) {
	; AVX1-NEXT: movl $-1, %eax			; AVX-LABEL: one_mask_bit_set1:
	; AVX1-NEXT: vmovd %eax, %xmm0			; AVX: ## BB#0:
	; AVX1-NEXT: vmaskmovps %xmm1, %xmm0, (%rdi)			; AVX-NEXT: vmovd %xmm0, (%rdi)
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX2-LABEL: test22:			; AVX512-LABEL: one_mask_bit_set1:
	; AVX2: ## BB#0:			; AVX512: ## BB#0:
	; AVX2-NEXT: movl $-1, %eax			; AVX512-NEXT: vmovd %xmm0, (%rdi)
	; AVX2-NEXT: vmovd %eax, %xmm0			; AVX512-NEXT: retq
	; AVX2-NEXT: vpmaskmovd %xmm1, %xmm0, (%rdi)			call void @llvm.masked.store.v4i32(<4 x i32> %val, <4 x i32>* %addr, i32 4, <4 x i1><i1 true, i1 false, i1 false, i1 false>)
	; AVX2-NEXT: retq			ret void
				}

				; Choose a different element to show that the correct address offset is produced.

				define void @one_mask_bit_set2(<4 x float>* %addr, <4 x float> %val) {
				; AVX-LABEL: one_mask_bit_set2:
				; AVX: ## BB#0:
				; AVX-NEXT: vextractps $2, %xmm0, 8(%rdi)
				; AVX-NEXT: retq
	;			;
	; AVX512F-LABEL: test22:			; AVX512-LABEL: one_mask_bit_set2:
	; AVX512F: ## BB#0:			; AVX512: ## BB#0:
	; AVX512F-NEXT: movl $-1, %eax			; AVX512-NEXT: vextractps $2, %xmm0, 8(%rdi)
	; AVX512F-NEXT: vmovd %eax, %xmm0			; AVX512-NEXT: retq
	; AVX512F-NEXT: vpmaskmovd %xmm1, %xmm0, (%rdi)			call void @llvm.masked.store.v4f32(<4 x float> %val, <4 x float>* %addr, i32 4, <4 x i1><i1 false, i1 false, i1 true, i1 false>)
	; AVX512F-NEXT: retq			ret void
				}

				; Choose a different scalar type and a high element of a 256-bit vector because AVX doesn't support those evenly.

				define void @one_mask_bit_set3(<4 x i64>* %addr, <4 x i64> %val) {
				; AVX-LABEL: one_mask_bit_set3:
				; AVX: ## BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX-NEXT: vmovlps %xmm0, 16(%rdi)
				; AVX-NEXT: vzeroupper
				; AVX-NEXT: retq
	;			;
	; SKX-LABEL: test22:			; AVX512-LABEL: one_mask_bit_set3:
	; SKX: ## BB#0:			; AVX512: ## BB#0:
	; SKX-NEXT: movb $1, %al			; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm0
	; SKX-NEXT: kmovw %eax, %k1			; AVX512-NEXT: vmovq %xmm0, 16(%rdi)
	; SKX-NEXT: vmovdqu32 %xmm1, (%rdi) {%k1}			; AVX512-NEXT: retq
	; SKX-NEXT: retq			call void @llvm.masked.store.v4i64(<4 x i64> %val, <4 x i64>* %addr, i32 4, <4 x i1><i1 false, i1 false, i1 true, i1 false>)
	%mask = icmp eq <4 x i32> %trigger, zeroinitializer			ret void
	call void @llvm.masked.store.v4i32(<4 x i32>%val, <4 x i32>* %addr, i32 4, <4 x i1><i1 true, i1 false, i1 false, i1 false>)			}

				; Choose a different scalar type and a high element of a 256-bit vector because AVX doesn't support those evenly.

				define void @one_mask_bit_set4(<4 x double>* %addr, <4 x double> %val) {
				; AVX-LABEL: one_mask_bit_set4:
				; AVX: ## BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX-NEXT: vmovhpd %xmm0, 24(%rdi)
				; AVX-NEXT: vzeroupper
				; AVX-NEXT: retq
				;
				; AVX512-LABEL: one_mask_bit_set4:
				; AVX512: ## BB#0:
				; AVX512-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX512-NEXT: vmovhpd %xmm0, 24(%rdi)
				; AVX512-NEXT: retq
				call void @llvm.masked.store.v4f64(<4 x double> %val, <4 x double>* %addr, i32 4, <4 x i1><i1 false, i1 false, i1 false, i1 true>)
				ret void
				}

				; Try a 512-bit vector to make sure AVX doesn't die and AVX512 works as expected.

				define void @one_mask_bit_set5(<8 x double>* %addr, <8 x double> %val) {
				; AVX-LABEL: one_mask_bit_set5:
				; AVX: ## BB#0:
				; AVX-NEXT: vextractf128 $1, %ymm1, %xmm0
				; AVX-NEXT: vmovlps %xmm0, 48(%rdi)
				; AVX-NEXT: vzeroupper
				; AVX-NEXT: retq
				;
				; AVX512-LABEL: one_mask_bit_set5:
				; AVX512: ## BB#0:
				; AVX512-NEXT: vextractf32x4 $3, %zmm0, %xmm0
				; AVX512-NEXT: vmovlpd %xmm0, 48(%rdi)
				; AVX512-NEXT: retq
				call void @llvm.masked.store.v8f64(<8 x double> %val, <8 x double>* %addr, i32 4, <8 x i1><i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true, i1 false>)
	ret void			ret void
	}			}

	declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)			declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)
	declare <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*, i32, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*, i32, <4 x i1>, <4 x i32>)
	declare <2 x i32> @llvm.masked.load.v2i32(<2 x i32>*, i32, <2 x i1>, <2 x i32>)			declare <2 x i32> @llvm.masked.load.v2i32(<2 x i32>*, i32, <2 x i1>, <2 x i32>)
	declare void @llvm.masked.store.v16i32(<16 x i32>, <16 x i32>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16i32(<16 x i32>, <16 x i32>*, i32, <16 x i1>)
	declare void @llvm.masked.store.v8i32(<8 x i32>, <8 x i32>*, i32, <8 x i1>)			declare void @llvm.masked.store.v8i32(<8 x i32>, <8 x i32>*, i32, <8 x i1>)
	declare void @llvm.masked.store.v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)			declare void @llvm.masked.store.v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)
				declare void @llvm.masked.store.v4i64(<4 x i64>, <4 x i64>*, i32, <4 x i1>)
	declare void @llvm.masked.store.v2f32(<2 x float>, <2 x float>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2f32(<2 x float>, <2 x float>*, i32, <2 x i1>)
	declare void @llvm.masked.store.v2i32(<2 x i32>, <2 x i32>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2i32(<2 x i32>, <2 x i32>*, i32, <2 x i1>)
				declare void @llvm.masked.store.v4f32(<4 x float>, <4 x float>*, i32, <4 x i1>)
	declare void @llvm.masked.store.v16f32(<16 x float>, <16 x float>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16f32(<16 x float>, <16 x float>*, i32, <16 x i1>)
	declare void @llvm.masked.store.v16f32p(<16 x float>, <16 x float>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16f32p(<16 x float>, <16 x float>*, i32, <16 x i1>)
	declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)			declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)
	declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)			declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)
	declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)			declare <8 x i32> @llvm.masked.load.v8i32(<8 x i32>*, i32, <8 x i1>, <8 x i32>)
	declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)			declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)
	declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)			declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)
	declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)			declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)
	declare <4 x double> @llvm.masked.load.v4f64(<4 x double>*, i32, <4 x i1>, <4 x double>)			declare <4 x double> @llvm.masked.load.v4f64(<4 x double>*, i32, <4 x i1>, <4 x double>)
	declare <2 x double> @llvm.masked.load.v2f64(<2 x double>*, i32, <2 x i1>, <2 x double>)			declare <2 x double> @llvm.masked.load.v2f64(<2 x double>*, i32, <2 x i1>, <2 x double>)
	declare void @llvm.masked.store.v8f64(<8 x double>, <8 x double>*, i32, <8 x i1>)			declare void @llvm.masked.store.v8f64(<8 x double>, <8 x double>*, i32, <8 x i1>)
				declare void @llvm.masked.store.v4f64(<4 x double>, <4 x double>*, i32, <4 x i1>)
	declare void @llvm.masked.store.v2f64(<2 x double>, <2 x double>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2f64(<2 x double>, <2 x double>*, i32, <2 x i1>)
	declare void @llvm.masked.store.v2i64(<2 x i64>, <2 x i64>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2i64(<2 x i64>, <2 x i64>*, i32, <2 x i1>)

	declare <16 x i32> @llvm.masked.load.v16p0i32(<16 x i32>, i32, <16 x i1>, <16 x i32>)			declare <16 x i32> @llvm.masked.load.v16p0i32(<16 x i32>, i32, <16 x i1>, <16 x i32>)

	define <16 x i32> @test23(<16 x i32> %trigger, <16 x i32> %addr) {			define <16 x i32> @test23(<16 x i32> %trigger, <16 x i32> %addr) {
	; AVX1-LABEL: test23:			; AVX1-LABEL: test23:
	; AVX1: ## BB#0:			; AVX1: ## BB#0:
	▲ Show 20 Lines • Show All 766 Lines • Show Last 20 Lines