This is an archive of the discontinued LLVM Phabricator instance.

[x86] eliminate unnecessary vector compare for AVX masked store
ClosedPublic

Authored by spatel on Sep 4 2017, 10:33 AM.

Download Raw Diff

Details

Reviewers

craig.topper
aymanmus
igorb
zvi
RKSimon

Commits

rG659279450e5b: [x86] eliminate unnecessary vector compare for AVX masked store
rL313089: [x86] eliminate unnecessary vector compare for AVX masked store

Summary

As noted in PR11210:
https://bugs.llvm.org/show_bug.cgi?id=11210
...fixing this should allow us to eliminate x86-specific masked store intrinsics in IR.
(Although more testing will be needed to confirm that.)

I don't know anything about SKX, so I don't know if the existing code is optimal or not. If we want to handle that case too, we could add a check for a 'PCMPGTM' opcode or try to fold this earlier when the node is still a setcc.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Sep 4 2017, 10:33 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptSep 4 2017, 10:33 AM

aymanmus added inline comments.Sep 5 2017, 1:10 AM

lib/Target/X86/X86ISelLowering.cpp
33185 ↗	(On Diff #113775)	Is there any canonical form of compare-with-all-zeros that can be guaranteed here? Or should the pattern with (pcmplt X, 0) be added also?
test/CodeGen/X86/masked_memop.ll
1158 ↗	(On Diff #113775)	I think the optimal code for SKX is: vpmovd2m %xmm2, %k1 vmovups %xmm0, (%rdi) {%k1}

RKSimon added inline comments.Sep 5 2017, 4:17 AM

lib/Target/X86/X86ISelLowering.cpp
33185 ↗	(On Diff #113775)	Add X86ISD::PCMPGTM support?

spatel added inline comments.Sep 5 2017, 6:14 AM

lib/Target/X86/X86ISelLowering.cpp
33185 ↗	(On Diff #113775)	Waiting until this is PCMPGT is a kind of canonicalization (compared to the general setcc node) because SSE/AVX don't have any other compare predicates. Ie, there's no other simple way to encode this; there is no PCMPLT node.
test/CodeGen/X86/masked_memop.ll
1158 ↗	(On Diff #113775)	Ok - let me try to shake that out of here. To be clear, we're saying this is the optimal sequence for any CPU with avx512vl/avx512bw. SKX is just an implementation of those ISAs.

aymanmus added inline comments.Sep 5 2017, 7:01 AM

lib/Target/X86/X86ISelLowering.cpp
33185 ↗	(On Diff #113775)	100%, my fault.
test/CodeGen/X86/masked_memop.ll
1158 ↗	(On Diff #113775)	The IACA tool shows same throughput for both sequences, but the one I suggested has one less uop and one less register. Actually the needed features for vpmovb2m/vpmovw2m are avx512vl+avx512bw and for vpmovd2m/vpmovq2m are avx512vl+avx512dq (which skx also includes) The %y test's operand not used.

spatel added inline comments.Sep 5 2017, 10:37 AM

test/CodeGen/X86/masked_memop.ll
1158 ↗	(On Diff #113775)	I need to confirm what we're saying here. For a 128-bit vector (and similarly for 256-bit), if the machine has avx512 (with all necessary variants), then we would rather see this: vpmovd2m %xmm2, %k1 vmovups %xmm0, (%rdi) {%k1} than the single instruction that we would produce for a plain AVX machine: vmaskmovps %xmm0, %xmm2, (%rdi) Ie, we want to treat vmaskmovps as legacy cruft and avoid it if we have bitmasks?

aymanmus added inline comments.Sep 6 2017, 12:42 AM

test/CodeGen/X86/masked_memop.ll
1158 ↗	(On Diff #113775)	Actually it seems like both are equivalent in skx. They show the same throughput and number of uops, the same ports are used and the latency on each port is equal. Nonetheless, I think we should prefer the vpmovd2m alternative because it provides a full set of instructions for all possible type granularities (byte, word, double-word and quad-word) while the AVX vmaskmovps are only available for 32-bit and 64-bit.

Patch updated:
Given that AVX512 requires different pattern matching and different output, I'm pushing back on trying to include that in this patch. It should be an independent improvement (and I'm not the right person to make that improvement).

I have updated the comments in the code and the test to reflect this. NFC from the previous rev of the patch.

LGTM

This revision is now accepted and ready to land.Sep 12 2017, 9:36 AM

I filed a bug to track this case with AVX512:
https://bugs.llvm.org/show_bug.cgi?id=34584

Closed by commit rL313089: [x86] eliminate unnecessary vector compare for AVX masked store (authored by spatel). · Explain WhySep 12 2017, 4:25 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

29 lines

test/

CodeGen/

X86/

masked_memop.ll

11 lines

Diff 114936

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 33,139 Lines • ▼ Show 20 Lines

	static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,			static SDValue combineMaskedStore(SDNode *N, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);			MaskedStoreSDNode *Mst = cast<MaskedStoreSDNode>(N);

	if (Mst->isCompressingStore())			if (Mst->isCompressingStore())
	return SDValue();			return SDValue();

	if (!Mst->isTruncatingStore())			if (!Mst->isTruncatingStore()) {
	return reduceMaskedStoreToScalarStore(Mst, DAG);			if (SDValue ScalarStore = reduceMaskedStoreToScalarStore(Mst, DAG))
				return ScalarStore;

				// If the mask is checking (0 > X), we're creating a vector with all-zeros
				// or all-ones elements based on the sign bits of X. AVX1 masked store only
				// cares about the sign bit of each mask element, so eliminate the compare:
				// mstore val, ptr, (pcmpgt 0, X) --> mstore val, ptr, X
				// Note that by waiting to match an x86-specific PCMPGT node, we're
				// eliminating potentially more complex matching of a setcc node which has
				// a full range of predicates.
				SDValue Mask = Mst->getMask();
				if (Mask.getOpcode() == X86ISD::PCMPGT &&
				ISD::isBuildVectorAllZeros(Mask.getOperand(0).getNode())) {
				assert(Mask.getValueType() == Mask.getOperand(1).getValueType() &&
				"Unexpected type for PCMPGT");
				return DAG.getMaskedStore(
				Mst->getChain(), SDLoc(N), Mst->getValue(), Mst->getBasePtr(),
				Mask.getOperand(1), Mst->getMemoryVT(), Mst->getMemOperand());
				}

				// TODO: AVX512 targets should also be able to simplify something like the
				// pattern above, but that pattern will be different. It will either need to
				// match setcc more generally or match PCMPGTM later (in tablegen?).

				return SDValue();
				}

	// Resolve truncating stores.			// Resolve truncating stores.
	EVT VT = Mst->getValue().getValueType();			EVT VT = Mst->getValue().getValueType();
	unsigned NumElems = VT.getVectorNumElements();			unsigned NumElems = VT.getVectorNumElements();
	EVT StVT = Mst->getMemoryVT();			EVT StVT = Mst->getMemoryVT();
	SDLoc dl(Mst);			SDLoc dl(Mst);

	assert(StVT != VT && "Cannot truncate to the same type");			assert(StVT != VT && "Cannot truncate to the same type");
	▲ Show 20 Lines • Show All 3,769 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/masked_memop.ll

	Show First 20 Lines • Show All 1,134 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: vextractf32x4 $3, %zmm0, %xmm1			; AVX512-NEXT: vextractf32x4 $3, %zmm0, %xmm1
	; AVX512-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0]			; AVX512-NEXT: vmovhpd {{.*#+}} xmm1 = xmm1[0],mem[0]
	; AVX512-NEXT: vinsertf32x4 $3, %xmm1, %zmm0, %zmm0			; AVX512-NEXT: vinsertf32x4 $3, %xmm1, %zmm0, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%res = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %addr, i32 4, <8 x i1><i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>, <8 x double> %val)			%res = call <8 x double> @llvm.masked.load.v8f64.p0v8f64(<8 x double>* %addr, i32 4, <8 x i1><i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 true>, <8 x double> %val)
	ret <8 x double> %res			ret <8 x double> %res
	}			}

	; FIXME: The mask bit for each data element is the most significant bit of the mask operand, so a compare isn't needed.			; The mask bit for each data element is the most significant bit of the mask operand, so a compare isn't needed.
				; FIXME: The AVX512 code should be improved to use 'vpmovd2m'. Add tests for 512-bit vectors when implementing that.

	define void @trunc_mask(<4 x float> %x, <4 x float>* %ptr, <4 x float> %y, <4 x i32> %mask) {			define void @trunc_mask(<4 x float> %x, <4 x float>* %ptr, <4 x float> %y, <4 x i32> %mask) {
	; AVX-LABEL: trunc_mask:			; AVX-LABEL: trunc_mask:
	; AVX: ## BB#0:			; AVX: ## BB#0:
	; AVX-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX-NEXT: vmaskmovps %xmm0, %xmm2, (%rdi)
	; AVX-NEXT: vpcmpgtd %xmm2, %xmm1, %xmm1
	; AVX-NEXT: vmaskmovps %xmm0, %xmm1, (%rdi)
	; AVX-NEXT: retq			; AVX-NEXT: retq
	;			;
	; AVX512F-LABEL: trunc_mask:			; AVX512F-LABEL: trunc_mask:
	; AVX512F: ## BB#0:			; AVX512F: ## BB#0:
	; AVX512F-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX512F-NEXT: vmaskmovps %xmm0, %xmm2, (%rdi)
	; AVX512F-NEXT: vpcmpgtd %xmm2, %xmm1, %xmm1
	; AVX512F-NEXT: vmaskmovps %xmm0, %xmm1, (%rdi)
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; SKX-LABEL: trunc_mask:			; SKX-LABEL: trunc_mask:
	; SKX: ## BB#0:			; SKX: ## BB#0:
	; SKX-NEXT: vpxor %xmm1, %xmm1, %xmm1			; SKX-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; SKX-NEXT: vpcmpgtd %xmm2, %xmm1, %k1			; SKX-NEXT: vpcmpgtd %xmm2, %xmm1, %k1
	; SKX-NEXT: vmovups %xmm0, (%rdi) {%k1}			; SKX-NEXT: vmovups %xmm0, (%rdi) {%k1}
	; SKX-NEXT: retq			; SKX-NEXT: retq
	Show All 26 Lines