This is an archive of the discontinued LLVM Phabricator instance.

[x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449)
ClosedPublic

Authored by spatel on Sep 20 2018, 11:47 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
lebedev.ri
andreadb

Commits

rG10c11b867a04: [x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449)
rL343008: [x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449)

Summary

This is the final (I hope!) problem pattern mentioned in PR37749:
https://bugs.llvm.org/show_bug.cgi?id=37749

We are trying to avoid an AVX1 sinkhole that arises because the bitwise logic ops are the only supported 256-bit integer ops. We've already solved the simple logic ops, but 'andn' is an x86 special. I looked at alternative solutions like extending the generic DAG combine or trying to wait until the ANDNP node is created, but those are bigger patches that can over-reach. Ie, splitting to 128-bit does not look like a win in most cases with >1 256-bit op.

The pattern matching is cluttered with bitcasts because of our i64 element canonicalization. For the affected test, we have this vector-type-legalized sequence:

        t29: v8i32 = concat_vectors t27, t28
      t30: v4i64 = bitcast t29
        t18: v8i32 = BUILD_VECTOR Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>, Constant:i32<-1>
      t31: v4i64 = bitcast t18
    t32: v4i64 = xor t30, t31
      t9: v8i32 = BUILD_VECTOR Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>, Constant:i32<255>
    t34: v4i64 = bitcast t9
  t35: v4i64 = and t32, t34
t36: v8i32 = bitcast t35
      t37: v4i32 = extract_subvector t36, Constant:i64<0>
      t38: v4i32 = extract_subvector t36, Constant:i64<4>

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Sep 20 2018, 11:47 AM

Herald added a subscriber: mcrosier. · View Herald TranscriptSep 20 2018, 11:47 AM

RKSimon added a reviewer: andreadb.Sep 21 2018, 3:13 AM

Hi Sanjay,

You should add a test where the mask vector is not a constant.

I verified that on Jaguar, this change improves cases where:

the mask is a constant
users access the lo/hi part of the defined YMM.

In one particular case, I saw a quite nice improvement in IPC.

Unfortunately, I also found this regression:

define <8 x i32> @bar(<8 x i32> %A, <8 x i32> %B, <8 x i32> %Mask) {
  %1 = and <8 x i32> %A, %Mask
  %2 = xor <8 x i32> %1, %Mask
  %3 = add <8 x i32> %2, %B
  ret <8 x i32> %3
}

Before this patch (-mcpu=btver2):

vandnps %ymm2, %ymm0, %ymm0
vextractf128    $1, %ymm1, %xmm3
vextractf128    $1, %ymm0, %xmm2
vpaddd  %xmm1, %xmm0, %xmm0
vpaddd  %xmm3, %xmm2, %xmm2
vinsertf128     $1, %xmm2, %ymm0, %ymm0
retq

After your patch:

vxorps  %xmm3, %xmm3, %xmm3
vextractf128    $1, %ymm1, %xmm4
vcmptrueps      %ymm3, %ymm3, %ymm3
vxorps  %ymm3, %ymm0, %ymm0
vandps  %xmm2, %xmm0, %xmm3
vextractf128    $1, %ymm0, %xmm0
vextractf128    $1, %ymm2, %xmm2
vpand   %xmm2, %xmm0, %xmm0
vpaddd  %xmm1, %xmm3, %xmm1
vpaddd  %xmm4, %xmm0, %xmm0
vinsertf128     $1, %xmm0, %ymm1, %ymm0
retq

Could you please have a look at it?

Thanks,
Andrea

gbedwell added a subscriber: gbedwell.Sep 21 2018, 7:39 AM

In D52318#1241940, @andreadb wrote:

Unfortunately, I also found this regression:

define <8 x i32> @bar(<8 x i32> %A, <8 x i32> %B, <8 x i32> %Mask) {
  %1 = and <8 x i32> %A, %Mask
  %2 = xor <8 x i32> %1, %Mask
  %3 = add <8 x i32> %2, %B
  ret <8 x i32> %3
}

https://gcc.godbolt.org/z/Byx2OR

spatel mentioned this in rL342756: [x86] add (negative) andnp test for D52318; NFC.Sep 21 2018, 11:26 AM

Patch updated:
The previous rev of the patch hinted at a constraint that I assumed, but wasn't actually checked: we should only do this transform when the input to the 'not' is the result of a vector concatenation. Ie, there must be some leading vector integer op that got split up itself. Without that, we're going to end up with more instructions than we started with.

Thanks Sanjay.

LGTM.

This revision is now accepted and ready to land.Sep 25 2018, 10:24 AM

For reference in case this comes up later: now that we're checking for the leading concat op, I think we could loosen the AVX1 constraint (do AVX512 flavors have this problem too?).

I tried to solve the problem generally in IR (if it occurs pre-legalization) with this instcombine patch:
https://reviews.llvm.org/rL342988

Closed by commit rL343008: [x86] avoid 256-bit andnp that requires insert/extract with AVX1 (PR37449) (authored by spatel). · Explain WhySep 25 2018, 12:11 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

SelectionDAG.cpp

2 lines

Target/

X86/

X86ISelLowering.cpp

31 lines

test/

CodeGen/

X86/

avx-logic.ll

18 lines

Diff 166971

llvm/trunk/lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,199 Lines • ▼ Show 20 Lines	SDValue llvm::peekThroughOneUseBitcasts(SDValue V) {
while (V.getOpcode() == ISD::BITCAST && V.getOperand(0).hasOneUse())		while (V.getOpcode() == ISD::BITCAST && V.getOperand(0).hasOneUse())
V = V.getOperand(0);		V = V.getOperand(0);
return V;		return V;
}		}

bool llvm::isBitwiseNot(SDValue V) {		bool llvm::isBitwiseNot(SDValue V) {
if (V.getOpcode() != ISD::XOR)		if (V.getOpcode() != ISD::XOR)
return false;		return false;
ConstantSDNode *C = isConstOrConstSplat(V.getOperand(1));		ConstantSDNode *C = isConstOrConstSplat(peekThroughBitcasts(V.getOperand(1)));
return C && C->isAllOnesValue();		return C && C->isAllOnesValue();
}		}

ConstantSDNode *llvm::isConstOrConstSplat(SDValue N) {		ConstantSDNode *llvm::isConstOrConstSplat(SDValue N) {
if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(N))		if (ConstantSDNode *CN = dyn_cast<ConstantSDNode>(N))
return CN;		return CN;

if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(N)) {		if (BuildVectorSDNode *BV = dyn_cast<BuildVectorSDNode>(N)) {
▲ Show 20 Lines • Show All 717 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 40,162 Lines • ▼ Show 20 Lines	static SDValue combineInsertSubvector(SDNode *N, SelectionDAG &DAG,
}		}

return SDValue();		return SDValue();
}		}

static SDValue combineExtractSubvector(SDNode *N, SelectionDAG &DAG,		static SDValue combineExtractSubvector(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		// For AVX1 only, if we are extracting from a 256-bit and+not (which will
		// eventually get combined/lowered into ANDNP) with a concatenated operand,
		// split the 'and' into 128-bit ops to avoid the concatenate and extract.
		// We let generic combining take over from there to simplify the
		// insert/extract and 'not'.
		// This pattern emerges during AVX1 legalization. We handle it before lowering
		// to avoid complications like splitting constant vector loads.

		// Capture the original wide type in the likely case that we need to bitcast
		// back to this type.
		EVT VT = N->getValueType(0);
		EVT WideVecVT = N->getOperand(0).getValueType();
		SDValue WideVec = peekThroughBitcasts(N->getOperand(0));
		if (Subtarget.hasAVX() && !Subtarget.hasAVX2() && WideVecVT.isSimple() &&
		WideVecVT.getSizeInBits() == 256 && WideVec.getOpcode() == ISD::AND) {
		auto isConcatenatedNot = [] (SDValue V) {
		V = peekThroughBitcasts(V);
		if (!isBitwiseNot(V))
		return false;
		SDValue NotOp = V->getOperand(0);
		return peekThroughBitcasts(NotOp).getOpcode() == ISD::CONCAT_VECTORS;
		};
		if (isConcatenatedNot(WideVec.getOperand(0)) \|\|
		isConcatenatedNot(WideVec.getOperand(1))) {
		// extract (and v4i64 X, (not (concat Y1, Y2))), n -> andnp v2i64 X(n), Y1
		SDValue Concat = split256IntArith(WideVec, DAG);
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(N), VT,
		DAG.getBitcast(WideVecVT, Concat), N->getOperand(1));
		}
		}

if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

MVT OpVT = N->getSimpleValueType(0);		MVT OpVT = N->getSimpleValueType(0);
SDValue InVec = N->getOperand(0);		SDValue InVec = N->getOperand(0);
unsigned IdxVal = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();		unsigned IdxVal = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();

if (ISD::isBuildVectorAllZeros(InVec.getNode()))		if (ISD::isBuildVectorAllZeros(InVec.getNode()))
▲ Show 20 Lines • Show All 1,320 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-logic.ll

	Show First 20 Lines • Show All 336 Lines • ▼ Show 20 Lines

	define <8 x i32> @andn_disguised_i8_elts(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z) {			define <8 x i32> @andn_disguised_i8_elts(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z) {
	; AVX1-LABEL: andn_disguised_i8_elts:			; AVX1-LABEL: andn_disguised_i8_elts:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4
	; AVX1-NEXT: vpaddd %xmm3, %xmm4, %xmm3			; AVX1-NEXT: vpaddd %xmm3, %xmm4, %xmm3
	; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0			; AVX1-NEXT: vpaddd %xmm0, %xmm1, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm0			; AVX1-NEXT: vmovdqa {{.*#+}} xmm1 = [1095216660735,1095216660735]
	; AVX1-NEXT: vandnps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vpandn %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vpandn %xmm1, %xmm3, %xmm1
	; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm3
	; AVX1-NEXT: vpaddd %xmm3, %xmm1, %xmm1			; AVX1-NEXT: vpaddd %xmm3, %xmm1, %xmm1
	; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; INT256-LABEL: andn_disguised_i8_elts:			; INT256-LABEL: andn_disguised_i8_elts:
	; INT256: # %bb.0:			; INT256: # %bb.0:
	; INT256-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; INT256-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; INT256-NEXT: vpandn {{.*}}(%rip), %ymm0, %ymm0			; INT256-NEXT: vpandn {{.*}}(%rip), %ymm0, %ymm0
	; INT256-NEXT: vpaddd %ymm2, %ymm0, %ymm0			; INT256-NEXT: vpaddd %ymm2, %ymm0, %ymm0
	; INT256-NEXT: retq			; INT256-NEXT: retq
	%add = add <8 x i32> %y, %x			%add = add <8 x i32> %y, %x
	%neg = and <8 x i32> %add, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>			%neg = and <8 x i32> %add, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
	%and = xor <8 x i32> %neg, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>			%and = xor <8 x i32> %neg, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
	%add1 = add <8 x i32> %and, %z			%add1 = add <8 x i32> %and, %z
	ret <8 x i32> %add1			ret <8 x i32> %add1
	}			}

				; Negative test - if we don't have a leading concat_vectors, the transform won't be profitable.

	define <8 x i32> @andn_variable_mask_operand_no_concat(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z) {			define <8 x i32> @andn_variable_mask_operand_no_concat(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z) {
	; AVX1-LABEL: andn_variable_mask_operand_no_concat:			; AVX1-LABEL: andn_variable_mask_operand_no_concat:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vandnps %ymm2, %ymm0, %ymm0			; AVX1-NEXT: vandnps %ymm2, %ymm0, %ymm0
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
	; AVX1-NEXT: vpaddd %xmm3, %xmm2, %xmm2			; AVX1-NEXT: vpaddd %xmm3, %xmm2, %xmm2
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; INT256-LABEL: andn_variable_mask_operand_no_concat:			; INT256-LABEL: andn_variable_mask_operand_no_concat:
	; INT256: # %bb.0:			; INT256: # %bb.0:
	; INT256-NEXT: vpandn %ymm2, %ymm0, %ymm0			; INT256-NEXT: vpandn %ymm2, %ymm0, %ymm0
	; INT256-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; INT256-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; INT256-NEXT: retq			; INT256-NEXT: retq
	%and = and <8 x i32> %x, %z			%and = and <8 x i32> %x, %z
	%xor = xor <8 x i32> %and, %z ; demanded bits will make this a 'not'			%xor = xor <8 x i32> %and, %z ; demanded bits will make this a 'not'
	%add = add <8 x i32> %xor, %y			%add = add <8 x i32> %xor, %y
	ret <8 x i32> %add			ret <8 x i32> %add
	}			}

				; Negative test - if we don't have a leading concat_vectors, the transform won't be profitable (even if the mask is a constant).

	define <8 x i32> @andn_constant_mask_operand_no_concat(<8 x i32> %x, <8 x i32> %y) {			define <8 x i32> @andn_constant_mask_operand_no_concat(<8 x i32> %x, <8 x i32> %y) {
	; AVX1-LABEL: andn_constant_mask_operand_no_concat:			; AVX1-LABEL: andn_constant_mask_operand_no_concat:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vandnps {{.*}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vandnps {{.*}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm2
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3
	; AVX1-NEXT: vpaddd %xmm2, %xmm3, %xmm2			; AVX1-NEXT: vpaddd %xmm2, %xmm3, %xmm2
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; INT256-LABEL: andn_constant_mask_operand_no_concat:			; INT256-LABEL: andn_constant_mask_operand_no_concat:
	; INT256: # %bb.0:			; INT256: # %bb.0:
	; INT256-NEXT: vpandn {{.*}}(%rip), %ymm0, %ymm0			; INT256-NEXT: vpandn {{.*}}(%rip), %ymm0, %ymm0
	; INT256-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; INT256-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; INT256-NEXT: retq			; INT256-NEXT: retq
	%xor = xor <8 x i32> %x, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>			%xor = xor <8 x i32> %x, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
	%and = and <8 x i32> %xor, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>			%and = and <8 x i32> %xor, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
	%r = add <8 x i32> %and, %y			%r = add <8 x i32> %and, %y
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}

				; This is a close call, but we split the 'andn' to reduce the insert/extract.

	define <8 x i32> @andn_variable_mask_operand_concat(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z, <8 x i32> %w) {			define <8 x i32> @andn_variable_mask_operand_concat(<8 x i32> %x, <8 x i32> %y, <8 x i32> %z, <8 x i32> %w) {
	; AVX1-LABEL: andn_variable_mask_operand_concat:			; AVX1-LABEL: andn_variable_mask_operand_concat:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm4
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm5
	; AVX1-NEXT: vpaddd %xmm4, %xmm5, %xmm4			; AVX1-NEXT: vpaddd %xmm4, %xmm5, %xmm4
	; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm1, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm0, %ymm0			; AVX1-NEXT: vpandn %xmm2, %xmm0, %xmm0
	; AVX1-NEXT: vandnps %ymm2, %ymm0, %ymm0			; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm1
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vpandn %xmm1, %xmm4, %xmm1
	; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm2			; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm2
	; AVX1-NEXT: vpaddd %xmm2, %xmm1, %xmm1			; AVX1-NEXT: vpaddd %xmm2, %xmm1, %xmm1
	; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0			; AVX1-NEXT: vpaddd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; INT256-LABEL: andn_variable_mask_operand_concat:			; INT256-LABEL: andn_variable_mask_operand_concat:
	; INT256: # %bb.0:			; INT256: # %bb.0:
	▲ Show 20 Lines • Show All 156 Lines • Show Last 20 Lines