This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] canonicalize shifty abs(): ashr+add+xor --> cmp+neg+sel
ClosedPublic

Authored by spatel on Dec 7 2017, 2:00 PM.

Download Raw Diff

Details

Reviewers

craig.topper
efriedma
majnemer
hfinkel

Commits

rG5a0cdac17424: [InstCombine] canonicalize shifty abs(): ashr+add+xor --> cmp+neg+sel
rL320921: [InstCombine] canonicalize shifty abs(): ashr+add+xor --> cmp+neg+sel

Summary

We want to do this for 2 reasons:

Value tracking does not recognize the ashr variant, so it would fail to match for cases like D39766.
DAGCombiner tries to recognize the ashr variant for scalars, but not vectors. For vectors, we only have:

// Canonicalize integer abs.
// vselect (setg[te] X,  0),  X, -X ->
// vselect (setgt    X, -1),  X, -X ->
// vselect (setl[te] X,  0), -X,  X ->
// Y = sra (X, size(X)-1); xor (add (X, Y), Y)

(the comment isn't accurate - we'll produce an ISD::ABS node if it's legal or custom)

But even for scalars, it doesn't handle commuted variants (see DAGCombiner under comment):

// fold Y = sra (X, size(X)-1); xor (add (X, Y), Y) -> (abs X)

So it should work if you start with a cmp+sel pattern because we do this:

// Check to see if this is an integer abs.
// select_cc setg[te] X,  0,  X, -X ->
// select_cc setgt    X, -1,  X, -X ->
// select_cc setl[te] X,  0, -X,  X ->
// select_cc setlt    X,  1, -X,  X ->
// Y = sra (X, size(X)-1); xor (add (X, Y), Y)

but it allows other cases to fall though the cracks:

define i32 @abs_shifty(i32 %x) {
  %signbit = ashr i32 %x, 31 
  %add = add i32 %signbit, %x  
  %abs = xor i32 %signbit, %add 
  ret i32 %abs
}

define i32 @abs_cmpsubsel(i32 %x) {
  %cmp = icmp slt i32 %x, zeroinitializer
  %sub = sub i32 zeroinitializer, %x
  %abs = select i1 %cmp, i32 %sub, i32 %x
  ret i32 %abs
}

define <4 x i32> @abs_shifty_vec(<4 x i32> %x) {
  %signbit = ashr <4 x i32> %x, <i32 31, i32 31, i32 31, i32 31> 
  %add = add <4 x i32> %signbit, %x  
  %abs = xor <4 x i32> %signbit, %add 
  ret <4 x i32> %abs
}

define <4 x i32> @abs_cmpsubsel_vec(<4 x i32> %x) {
  %cmp = icmp slt <4 x i32> %x, zeroinitializer
  %sub = sub <4 x i32> zeroinitializer, %x
  %abs = select <4 x i1> %cmp, <4 x i32> %sub, <4 x i32> %x
  ret <4 x i32> %abs
}

$ ./llc -o - -mattr=avx abs.ll

_abs_shifty:             
	movl	%edi, %eax
	sarl	$31, %eax
	addl	%eax, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
	retq

_abs_cmpsubsel:   
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
	.cfi_endproc
                        

_abs_shifty_vec:            
	vpsrad	$31, %xmm0, %xmm1
	vpaddd	%xmm0, %xmm1, %xmm0
	vpxor	%xmm0, %xmm1, %xmm0
	retq


_abs_cmpsubsel_vec:   
	vpabsd	%xmm0, %xmm0
	retq

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Dec 7 2017, 2:00 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptDec 7 2017, 2:00 PM

Ping.

LGTM (please take a quick look at CodeGen for other non-X86 targets (e.g., PowerPC, AArch64) and make sure that looks reasonable too).

This revision is now accepted and ready to land.Dec 14 2017, 8:58 PM

In D40984#956312, @hfinkel wrote:

LGTM (please take a quick look at CodeGen for other non-X86 targets (e.g., PowerPC, AArch64) and make sure that looks reasonable too).

Sure - re-reading the description I realize that was not at all clear. Let me try to clarify here and in the commit log (although the code itself is a mess, so it's hard to describe cleanly):

DAGCombiner has a generic transform for all targets to convert the scalar cmp+sel variant of abs into the shift variant. This is the opposite of this IR canonicalization.
DAGCombiner has a generic transform for all targets to convert the vector cmp+sel variant of abs into either an ABS node or the shift variant. This is again the opposite of this IR canonicalization.
DAGCombiner has a generic transform for all targets to convert the exact scalar shift variant produced by #1 into an ISD::ABS node. Note: It would be an efficiency improvement if we had #1 go directly to an ABS node when that's legal/custom.
The pattern matching above is incomplete, so it is possible to escape the intended/optimal codegen in a variety of ways.
1. For #2, the vector path is missing the case for setlt with a '1' constant.
2. For #3, we are missing a match for commuted versions of the scalar shift variant.
3. There is no vector equivalent at all for #3.
Therefore, this IR canonicalization can only help get us to the optimal codegen. The version of cmp+sel produced by this patch will be recognized in the DAG and converted to an ABS node when possible or the shift sequence when not.
In the following examples with this patch applied, we may get conditional moves rather than the shift produced by the generic DAGCombiner transforms. That is created using a target-specific decision for any given target. Whether it is optimal or not for a particular subtarget may be up for debate.

define i32 @abs_shifty(i32 %x) {
  %signbit = ashr i32 %x, 31 
  %add = add i32 %signbit, %x  
  %abs = xor i32 %signbit, %add 
  ret i32 %abs
}

define i32 @abs_cmpsubsel(i32 %x) {
  %cmp = icmp slt i32 %x, zeroinitializer
  %sub = sub i32 zeroinitializer, %x
  %abs = select i1 %cmp, i32 %sub, i32 %x
  ret i32 %abs
}

define <4 x i32> @abs_shifty_vec(<4 x i32> %x) {
  %signbit = ashr <4 x i32> %x, <i32 31, i32 31, i32 31, i32 31> 
  %add = add <4 x i32> %signbit, %x  
  %abs = xor <4 x i32> %signbit, %add 
  ret <4 x i32> %abs
}

define <4 x i32> @abs_cmpsubsel_vec(<4 x i32> %x) {
  %cmp = icmp slt <4 x i32> %x, zeroinitializer
  %sub = sub <4 x i32> zeroinitializer, %x
  %abs = select <4 x i1> %cmp, <4 x i32> %sub, <4 x i32> %x
  ret <4 x i32> %abs
}

> $ ./opt -instcombine shiftyabs.ll -S | ./llc -o - -mtriple=x86_64 -mattr=avx 
> abs_shifty:
> 	movl	%edi, %eax
> 	negl	%eax
> 	cmovll	%edi, %eax
> 	retq
> 
> abs_cmpsubsel:
> 	movl	%edi, %eax
> 	negl	%eax
> 	cmovll	%edi, %eax
> 	retq
> 
> abs_shifty_vec:
> 	vpabsd	%xmm0, %xmm0
> 	retq
> 
> abs_cmpsubsel_vec:
> 	vpabsd	%xmm0, %xmm0
> 	retq
> 
> $ ./opt -instcombine shiftyabs.ll -S | ./llc -o - -mtriple=aarch64
> abs_shifty:
> 	cmp	w0, #0                  // =0
> 	cneg	w0, w0, mi
> 	ret
> 
> abs_cmpsubsel: 
> 	cmp	w0, #0                  // =0
> 	cneg	w0, w0, mi
> 	ret
>                                        
> abs_shifty_vec: 
> 	abs	v0.4s, v0.4s
> 	ret
> 
> abs_cmpsubsel_vec: 
> 	abs	v0.4s, v0.4s
> 	ret
> 
> $ ./opt -instcombine shiftyabs.ll -S | ./llc -o - -mtriple=powerpc64le 
> abs_shifty:  
> 	srawi 4, 3, 31
> 	add 3, 3, 4
> 	xor 3, 3, 4
> 	blr
> 
> abs_cmpsubsel:
> 	srawi 4, 3, 31
> 	add 3, 3, 4
> 	xor 3, 3, 4
> 	blr
> 
> abs_shifty_vec:   
> 	vspltisw 3, -16
> 	vspltisw 4, 15
> 	vsubuwm 3, 4, 3
> 	vsraw 3, 2, 3
> 	vadduwm 2, 2, 3
> 	xxlxor 34, 34, 35
> 	blr
> 
> abs_cmpsubsel_vec: 
> 	vspltisw 3, -16
> 	vspltisw 4, 15
> 	vsubuwm 3, 4, 3
> 	vsraw 3, 2, 3
> 	vadduwm 2, 2, 3
> 	xxlxor 34, 34, 35
> 	blr
>

In D40984#956809, @spatel wrote:

DAGCombiner has a generic transform for all targets to convert the scalar cmp+sel variant of abs into the shift variant. This is the opposite of this IR canonicalization.

DAGCombiner has a generic transform for all targets to convert the vector cmp+sel variant of abs into either an ABS node or the shift variant. This is again the opposite of this IR canonicalization.

DAGCombiner has a generic transform for all targets to convert the exact scalar shift variant produced by #1 into an ISD::ABS node. Note: It would be an efficiency improvement if we had #1 go directly to an ABS node when that's legal/custom.

The pattern matching above is incomplete, so it is possible to escape the intended/optimal codegen in a variety of ways.

For #2, the vector path is missing the case for setlt with a '1' constant.

For #3, we are missing a match for commuted versions of the scalar shift variant.

There is no vector equivalent at all for #3.

Re-reading the code, and I got that wrong. The transform from shift to ABS does work for vectors (it uses isConstOrConstSplat() for the shift amount). So both scalars and vectors have the same pattern matching hole - they won't convert to ABS for commuted variants of that shift pattern.

Closed by commit rL320921: [InstCombine] canonicalize shifty abs(): ashr+add+xor --> cmp+neg+sel (authored by spatel). · Explain WhyDec 16 2017, 8:42 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D47831: [DAGCombiner] Recognize more patterns for ABS.Jun 6 2018, 11:40 AM

spatel mentioned this in rL334137: [InstCombine] fold another shifty abs pattern to cmp+sel (PR36036).Jun 6 2018, 3:02 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineAndOrXor.cpp

20 lines

test/

Transforms/

InstCombine/

abs-1.ll

24 lines

Diff 127249

llvm/trunk/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp

Show First 20 Lines • Show All 2,391 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitXor(BinaryOperator &I) {
if (auto *LHS = dyn_cast<ICmpInst>(I.getOperand(0)))		if (auto *LHS = dyn_cast<ICmpInst>(I.getOperand(0)))
if (auto *RHS = dyn_cast<ICmpInst>(I.getOperand(1)))		if (auto *RHS = dyn_cast<ICmpInst>(I.getOperand(1)))
if (Value *V = foldXorOfICmps(LHS, RHS))		if (Value *V = foldXorOfICmps(LHS, RHS))
return replaceInstUsesWith(I, V);		return replaceInstUsesWith(I, V);

if (Instruction *CastedXor = foldCastedBitwiseLogic(I))		if (Instruction *CastedXor = foldCastedBitwiseLogic(I))
return CastedXor;		return CastedXor;

		// Canonicalize the shifty way to code absolute value to the common pattern.
		// There are 4 potential commuted variants. Move the 'ashr' candidate to Op1.
		// We're relying on the fact that we only do this transform when the shift has
		// exactly 2 uses and the add has exactly 1 use (otherwise, we might increase
		// instructions).
		if (Op0->getNumUses() == 2)
		std::swap(Op0, Op1);

		const APInt *ShAmt;
		Type *Ty = I.getType();
		if (match(Op1, m_AShr(m_Value(A), m_APInt(ShAmt))) &&
		Op1->getNumUses() == 2 && *ShAmt == Ty->getScalarSizeInBits() - 1 &&
		match(Op0, m_OneUse(m_c_Add(m_Specific(A), m_Specific(Op1))))) {
		// B = ashr i32 A, 31 ; smear the sign bit
		// xor (add A, B), B ; add -1 and flip bits if negative
		// --> (A < 0) ? -A : A
		Value *Cmp = Builder.CreateICmpSLT(A, ConstantInt::getNullValue(Ty));
		return SelectInst::Create(Cmp, Builder.CreateNeg(A), A);
		}

return Changed ? &I : nullptr;		return Changed ? &I : nullptr;
}		}

llvm/trunk/test/Transforms/InstCombine/abs-1.ll

	Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines
	}			}

	; The following 5 tests use a shift+add+xor to implement abs():			; The following 5 tests use a shift+add+xor to implement abs():
	; B = ashr i8 A, 7 -- smear the sign bit.			; B = ashr i8 A, 7 -- smear the sign bit.
	; xor (add A, B), B -- add -1 and flip bits if negative			; xor (add A, B), B -- add -1 and flip bits if negative

	define i8 @shifty_abs_commute0(i8 %x) {			define i8 @shifty_abs_commute0(i8 %x) {
	; CHECK-LABEL: @shifty_abs_commute0(			; CHECK-LABEL: @shifty_abs_commute0(
	; CHECK-NEXT: [[SIGNBIT:%.*]] = ashr i8 %x, 7			; CHECK-NEXT: [[TMP1:%.*]] = icmp slt i8 %x, 0
	; CHECK-NEXT: [[ADD:%.*]] = add i8 [[SIGNBIT]], %x			; CHECK-NEXT: [[TMP2:%.*]] = sub i8 0, %x
	; CHECK-NEXT: [[ABS:%.*]] = xor i8 [[ADD]], [[SIGNBIT]]			; CHECK-NEXT: [[ABS:%.*]] = select i1 [[TMP1]], i8 [[TMP2]], i8 %x
	; CHECK-NEXT: ret i8 [[ABS]]			; CHECK-NEXT: ret i8 [[ABS]]
	;			;
	%signbit = ashr i8 %x, 7			%signbit = ashr i8 %x, 7
	%add = add i8 %signbit, %x			%add = add i8 %signbit, %x
	%abs = xor i8 %add, %signbit			%abs = xor i8 %add, %signbit
	ret i8 %abs			ret i8 %abs
	}			}

	define <2 x i8> @shifty_abs_commute1(<2 x i8> %x) {			define <2 x i8> @shifty_abs_commute1(<2 x i8> %x) {
	; CHECK-LABEL: @shifty_abs_commute1(			; CHECK-LABEL: @shifty_abs_commute1(
	; CHECK-NEXT: [[SIGNBIT:%.*]] = ashr <2 x i8> %x, <i8 7, i8 7>			; CHECK-NEXT: [[TMP1:%.*]] = icmp slt <2 x i8> %x, zeroinitializer
	; CHECK-NEXT: [[ADD:%.*]] = add <2 x i8> [[SIGNBIT]], %x			; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i8> zeroinitializer, %x
	; CHECK-NEXT: [[ABS:%.*]] = xor <2 x i8> [[SIGNBIT]], [[ADD]]			; CHECK-NEXT: [[ABS:%.*]] = select <2 x i1> [[TMP1]], <2 x i8> [[TMP2]], <2 x i8> %x
	; CHECK-NEXT: ret <2 x i8> [[ABS]]			; CHECK-NEXT: ret <2 x i8> [[ABS]]
	;			;
	%signbit = ashr <2 x i8> %x, <i8 7, i8 7>			%signbit = ashr <2 x i8> %x, <i8 7, i8 7>
	%add = add <2 x i8> %signbit, %x			%add = add <2 x i8> %signbit, %x
	%abs = xor <2 x i8> %signbit, %add			%abs = xor <2 x i8> %signbit, %add
	ret <2 x i8> %abs			ret <2 x i8> %abs
	}			}

	define <2 x i8> @shifty_abs_commute2(<2 x i8> %x) {			define <2 x i8> @shifty_abs_commute2(<2 x i8> %x) {
	; CHECK-LABEL: @shifty_abs_commute2(			; CHECK-LABEL: @shifty_abs_commute2(
	; CHECK-NEXT: [[Y:%.*]] = mul <2 x i8> %x, <i8 3, i8 3>			; CHECK-NEXT: [[Y:%.*]] = mul <2 x i8> %x, <i8 3, i8 3>
	; CHECK-NEXT: [[SIGNBIT:%.*]] = ashr <2 x i8> [[Y]], <i8 7, i8 7>			; CHECK-NEXT: [[TMP1:%.*]] = icmp slt <2 x i8> [[Y]], zeroinitializer
	; CHECK-NEXT: [[ADD:%.*]] = add <2 x i8> [[Y]], [[SIGNBIT]]			; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i8> zeroinitializer, [[Y]]
	; CHECK-NEXT: [[ABS:%.*]] = xor <2 x i8> [[SIGNBIT]], [[ADD]]			; CHECK-NEXT: [[ABS:%.*]] = select <2 x i1> [[TMP1]], <2 x i8> [[TMP2]], <2 x i8> [[Y]]
	; CHECK-NEXT: ret <2 x i8> [[ABS]]			; CHECK-NEXT: ret <2 x i8> [[ABS]]
	;			;
	%y = mul <2 x i8> %x, <i8 3, i8 3> ; extra op to thwart complexity-based canonicalization			%y = mul <2 x i8> %x, <i8 3, i8 3> ; extra op to thwart complexity-based canonicalization
	%signbit = ashr <2 x i8> %y, <i8 7, i8 7>			%signbit = ashr <2 x i8> %y, <i8 7, i8 7>
	%add = add <2 x i8> %y, %signbit			%add = add <2 x i8> %y, %signbit
	%abs = xor <2 x i8> %signbit, %add			%abs = xor <2 x i8> %signbit, %add
	ret <2 x i8> %abs			ret <2 x i8> %abs
	}			}

	define i8 @shifty_abs_commute3(i8 %x) {			define i8 @shifty_abs_commute3(i8 %x) {
	; CHECK-LABEL: @shifty_abs_commute3(			; CHECK-LABEL: @shifty_abs_commute3(
	; CHECK-NEXT: [[Y:%.*]] = mul i8 %x, 3			; CHECK-NEXT: [[Y:%.*]] = mul i8 %x, 3
	; CHECK-NEXT: [[SIGNBIT:%.*]] = ashr i8 [[Y]], 7			; CHECK-NEXT: [[TMP1:%.*]] = icmp slt i8 [[Y]], 0
	; CHECK-NEXT: [[ADD:%.*]] = add i8 [[Y]], [[SIGNBIT]]			; CHECK-NEXT: [[TMP2:%.*]] = sub i8 0, [[Y]]
	; CHECK-NEXT: [[ABS:%.*]] = xor i8 [[ADD]], [[SIGNBIT]]			; CHECK-NEXT: [[ABS:%.*]] = select i1 [[TMP1]], i8 [[TMP2]], i8 [[Y]]
	; CHECK-NEXT: ret i8 [[ABS]]			; CHECK-NEXT: ret i8 [[ABS]]
	;			;
	%y = mul i8 %x, 3 ; extra op to thwart complexity-based canonicalization			%y = mul i8 %x, 3 ; extra op to thwart complexity-based canonicalization
	%signbit = ashr i8 %y, 7			%signbit = ashr i8 %y, 7
	%add = add i8 %y, %signbit			%add = add i8 %y, %signbit
	%abs = xor i8 %add, %signbit			%abs = xor i8 %add, %signbit
	ret i8 %abs			ret i8 %abs
	}			}
	Show All 20 Lines