This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Strength reduce cmp (xor (a, -1), xor(b, -1)) => cmp (b, a)
AbandonedPublic

Authored by dmgreen on Feb 6 2018, 1:35 AM.

Download Raw Diff

Details

Reviewers

samparker
john.brawn
spatel
efriedma

Summary

This is (hopefully) the last part of PR35875. The code sequence we currently have
is slightly worse than the original form, due to a higher register pressure causing
more spills on thumb1 cores with few registers.

The sequence:

%an = xor i8 %a, -1
%bn = xor i8 %b, -1
%cmp15 = icmp ult i8 %an, %bn
%cond = select i1 %cmp15, i8 %an, i8 %bn

is strength-reduced to

%an = xor i8 %a, -1
%bn = xor i8 %b, -1
%cmp15 = icmp ult i8 %b, %a
%cond = select i1 %cmp15, i8 %an, i8 %bn

I originally tried to do this during ISEL optimisation, but constant-hoist has
transformed the -1 to constants and pulled them out into higher blocks. So it
would need to look through truncate/zext/register chains into different basic
blocks. Instead it is done here in codegen prepare, late in the pipeline so as to
not break the representation of min/max.

Diff Detail

Event Timeline

dmgreen created this revision.Feb 6 2018, 1:35 AM

Herald added a subscriber: javed.absar. · View Herald TranscriptFeb 6 2018, 1:35 AM

I'm skeptical about trying to solve a register pressure / allocation problem this early in the pipeline (although I know it's easier to match in IR).

If we do justify this somehow, then it must be limited by TTI hooks. Breaking min/max patterns for targets that support those ops would cause regressions. For example, if we fix the matchers to see vector 'not' ops on AArch64, we'll go from:

        mvn	v0.16b, v0.16b
	mvn	v2.16b, v2.16b
	mvn	v1.16b, v1.16b
	umin	v3.4s, v0.4s, v2.4s
	umin	v3.4s, v3.4s, v1.4s
	sub	v0.4s, v0.4s, v3.4s
	sub	v1.4s, v1.4s, v3.4s
	sub	v2.4s, v2.4s, v3.4s
	bl	vuse4

To:

        mvn	v4.16b, v0.16b
	mvn	v5.16b, v2.16b
	cmhi	v0.4s, v0.4s, v2.4s
	mvn	v1.16b, v1.16b
	bsl	v0.16b, v4.16b, v5.16b
	umin	v3.4s, v0.4s, v1.4s
	sub	v0.4s, v4.4s, v3.4s
	sub	v1.4s, v1.4s, v3.4s
	sub	v2.4s, v5.4s, v3.4s
	bl	vuse4

We'd get a similar regression on x86. That failure would show up in tests/CodeGen/AArch64/minmax-of-minmax.ll, but the pattern matching in this patch is artificially limited to scalars, so that's why we don't see it currently.

Thanks for taking a look. I've been looking into this more today. At finding a sensible target hook to put this behind and/or perhaps making it more specific. It's a great shame I can't just do this in ISel (right?), after selection has happened.

I agree it's not the best as it is, and there might be something else going on here. Not specifically register pressure related, just by knockon causing extra spills in the thumb1 case. It seems to be better for thumb2/aarch64 codegen too (where there are more registers), at least when put in a loop. Perhaps because it can reason it does not need to perform uxtb's / AND 0xff's? In other cases it can make things worse, even without an integer min/max instruction as in arm/aarch64. I know we have some issues with uxtbs where they are not necessary, but difficult to reason that they can be removed. Some of the guys here have been looking into that lately.

Note that the changes to make i8's instcombines legal got us ~12%. This gets us an extra 25% on top! At least on these targets, where it's hit us the hardest. Just from a one line change, switching the operators.

In D42951#999292, @dmgreen wrote:

Thanks for taking a look. I've been looking into this more today. At finding a sensible target hook to put this behind and/or perhaps making it more specific. It's a great shame I can't just do this in ISel (right?), after selection has happened.

Can you file (or maybe it's already filed?) a bug report that shows the output for the ARM/Thumb target where things are falling apart? I can see it being harder to match in the DAG, but it's not clear to me why we can't do it. Another possibility is trying to switch the operands around in MachineCombiner if we can show some kind of win via MachineTraceMetrics.

higher register pressure

You're essentially transforming SELECT_CC(ult, %an, %bn, %an, %bn) to SELECT_CC(ult, %b, %a, %an, %bn); that doesn't help register pressure at all, at least not on its own.

From your testcase, I guess the problem has something to do with legalization? We could increase register pressure if we zero-extend the operands to the compare, then don't reuse the zero-extended compare operands to produce the result. That seems like a problem we could solve more effectively some other way, though.

We could increase register pressure if we zero-extend the operands to the compare, then don't reuse the zero-extended compare operands to produce the result.

Yeah, bang on. Looks like both the cmp's are uxtb'd, where as the selects are not.

Back to the drawing board with this one I guess.

spatel mentioned this in D43451: [ARM] Mark -1 as cheap in xor's for thumb1.Feb 19 2018, 7:06 AM

dmgreen abandoned this revision.Feb 19 2018, 8:35 AM

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

25 lines

test/

Transforms/

CodeGenPrepare/

ARM/

xorcmp.ll

181 lines

Diff 132955

lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 1,190 Lines • ▼ Show 20 Lines	static bool SinkCmpExpression(CmpInst CI, const TargetLowering TLI) {
if (CI->use_empty()) {		if (CI->use_empty()) {
CI->eraseFromParent();		CI->eraseFromParent();
MadeChange = true;		MadeChange = true;
}		}

return MadeChange;		return MadeChange;
}		}

		// Strength reduce cmp (xor (a, -1), xor(b, -1)) => cmp (b, a)
		// This breaks apart the representation of a min/max (cmp + select),
		// but can reduce register pressure in some cases.
		// We cant do this in ISEL as the -1 may have been turned to a constant
		// and hoisted out of the block (hence the bitcasts).
		static bool StrengthReduceCmpXor(CmpInst CI, const TargetLowering TLI) {
		Value A, B;
		if (CI->isIntPredicate() &&
		(match(CI->getOperand(0), m_Xor(m_Value(A), m_ConstantInt<-1>())) \|\|
		match(CI->getOperand(0),
		m_Xor(m_Value(A), m_BitCast(m_ConstantInt<-1>())))) &&
		(match(CI->getOperand(1), m_Xor(m_Value(B), m_ConstantInt<-1>())) \|\|
		match(CI->getOperand(1),
		m_Xor(m_Value(B), m_BitCast(m_ConstantInt<-1>()))))) {
		CI->setOperand(0, B);
		CI->setOperand(1, A);
		return true;
		}

		return false;
		}

static bool OptimizeCmpExpression(CmpInst CI, const TargetLowering TLI) {		static bool OptimizeCmpExpression(CmpInst CI, const TargetLowering TLI) {
if (SinkCmpExpression(CI, TLI))		if (SinkCmpExpression(CI, TLI))
return true;		return true;

if (CombineUAddWithOverflow(CI))		if (CombineUAddWithOverflow(CI))
return true;		return true;

		if (StrengthReduceCmpXor(CI, TLI))
		return true;

return false;		return false;
}		}

/// Duplicate and sink the given 'and' instruction into user blocks where it is		/// Duplicate and sink the given 'and' instruction into user blocks where it is
/// used in a compare to allow isel to generate better code for targets where		/// used in a compare to allow isel to generate better code for targets where
/// this operation can be combined.		/// this operation can be combined.
///		///
/// Return true if any changes are made.		/// Return true if any changes are made.
▲ Show 20 Lines • Show All 5,384 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/ARM/xorcmp.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -codegenprepare < %s \| FileCheck %s
				; RUN: opt -S -consthoist -codegenprepare < %s \| FileCheck %s --check-prefix=CONSTHOIST

				target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "thumbv8m.baseline-none-eabi"

				define void @test(i8 %a, i8 %b, i8 %c) {
				; CHECK-LABEL: @test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[AN:%.]] = xor i8 [[A:%.]], -1
				; CHECK-NEXT: [[BN:%.]] = xor i8 [[B:%.]], -1
				; CHECK-NEXT: [[CN:%.]] = xor i8 [[C:%.]], -1
				; CHECK-NEXT: [[CMP15:%.*]] = icmp ult i8 [[C]], [[A]]
				; CHECK-NEXT: [[COND:%.*]] = select i1 [[CMP15]], i8 [[AN]], i8 [[CN]]
				; CHECK-NEXT: [[CMP16:%.*]] = icmp ult i8 [[COND]], [[BN]]
				; CHECK-NEXT: [[XK_0_IN:%.*]] = select i1 [[CMP16]], i8 [[COND]], i8 [[BN]]
				; CHECK-NEXT: [[CONV34:%.*]] = sub i8 [[AN]], [[XK_0_IN]]
				; CHECK-NEXT: [[CONV38:%.*]] = sub i8 [[BN]], [[XK_0_IN]]
				; CHECK-NEXT: [[CONV42:%.*]] = sub i8 [[CN]], [[XK_0_IN]]
				; CHECK-NEXT: call void @use4(i8 [[CONV34]], i8 [[CONV38]], i8 [[CONV42]], i8 [[XK_0_IN]])
				; CHECK-NEXT: ret void
				;
				; CONSTHOIST-LABEL: @test(
				; CONSTHOIST-NEXT: entry:
				; CONSTHOIST-NEXT: [[CONST:%.*]] = bitcast i8 -1 to i8
				; CONSTHOIST-NEXT: [[AN:%.]] = xor i8 [[A:%.]], [[CONST]]
				; CONSTHOIST-NEXT: [[BN:%.]] = xor i8 [[B:%.]], [[CONST]]
				; CONSTHOIST-NEXT: [[CN:%.]] = xor i8 [[C:%.]], [[CONST]]
				; CONSTHOIST-NEXT: [[CMP15:%.*]] = icmp ult i8 [[C]], [[A]]
				; CONSTHOIST-NEXT: [[COND:%.*]] = select i1 [[CMP15]], i8 [[AN]], i8 [[CN]]
				; CONSTHOIST-NEXT: [[CMP16:%.*]] = icmp ult i8 [[COND]], [[BN]]
				; CONSTHOIST-NEXT: [[XK_0_IN:%.*]] = select i1 [[CMP16]], i8 [[COND]], i8 [[BN]]
				; CONSTHOIST-NEXT: [[CONV34:%.*]] = sub i8 [[AN]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: [[CONV38:%.*]] = sub i8 [[BN]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: [[CONV42:%.*]] = sub i8 [[CN]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: call void @use4(i8 [[CONV34]], i8 [[CONV38]], i8 [[CONV42]], i8 [[XK_0_IN]])
				; CONSTHOIST-NEXT: ret void
				;
				entry:
				%an = xor i8 %a, -1
				%bn = xor i8 %b, -1
				%cn = xor i8 %c, -1
				%cmp15 = icmp ult i8 %an, %cn
				%cond = select i1 %cmp15, i8 %an, i8 %cn
				%cmp16 = icmp ult i8 %cond, %bn
				%xk.0.in = select i1 %cmp16, i8 %cond, i8 %bn
				%conv34 = sub i8 %an, %xk.0.in
				%conv38 = sub i8 %bn, %xk.0.in
				%conv42 = sub i8 %cn, %xk.0.in
				call void @use4(i8 %conv34, i8 %conv38, i8 %conv42, i8 %xk.0.in)
				ret void
				}


				define void @testloop(i32 %I, i8* nocapture readonly %A, i8* nocapture %B) {
				; CHECK-LABEL: @testloop(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[CMP74:%.]] = icmp sgt i32 [[I:%.]], 0
				; CHECK-NEXT: br i1 [[CMP74]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
				; CHECK: for.body.preheader:
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I_077:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; CHECK-NEXT: [[B_ADDR_076:%.]] = phi i8 [ [[INCDEC_PTR47:%.]], [[FOR_BODY]] ], [ [[B:%.]], [[FOR_BODY_PREHEADER]] ]
				; CHECK-NEXT: [[A_ADDR_075:%.]] = phi i8 [ [[INCDEC_PTR2:%.]], [[FOR_BODY]] ], [ [[A:%.]], [[FOR_BODY_PREHEADER]] ]
				; CHECK-NEXT: [[INCDEC_PTR:%.]] = getelementptr inbounds i8, i8 [[A_ADDR_075]], i32 1
				; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[A_ADDR_075]], align 1
				; CHECK-NEXT: [[INCDEC_PTR1:%.]] = getelementptr inbounds i8, i8 [[A_ADDR_075]], i32 2
				; CHECK-NEXT: [[TMP1:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CHECK-NEXT: [[INCDEC_PTR2]] = getelementptr inbounds i8, i8* [[A_ADDR_075]], i32 3
				; CHECK-NEXT: [[TMP2:%.]] = load i8, i8 [[INCDEC_PTR1]], align 1
				; CHECK-NEXT: [[TMP3:%.*]] = xor i8 [[TMP0]], -1
				; CHECK-NEXT: [[TMP4:%.*]] = xor i8 [[TMP1]], -1
				; CHECK-NEXT: [[TMP5:%.*]] = xor i8 [[TMP2]], -1
				; CHECK-NEXT: [[CMP16:%.*]] = icmp ult i8 [[TMP2]], [[TMP0]]
				; CHECK-NEXT: [[COND:%.*]] = select i1 [[CMP16]], i8 [[TMP3]], i8 [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = icmp ult i8 [[COND]], [[TMP4]]
				; CHECK-NEXT: [[XK_0_IN:%.*]] = select i1 [[TMP6]], i8 [[COND]], i8 [[TMP4]]
				; CHECK-NEXT: [[CONV35:%.*]] = sub i8 [[TMP3]], [[XK_0_IN]]
				; CHECK-NEXT: [[CONV39:%.*]] = sub i8 [[TMP4]], [[XK_0_IN]]
				; CHECK-NEXT: [[CONV43:%.*]] = sub i8 [[TMP5]], [[XK_0_IN]]
				; CHECK-NEXT: [[INCDEC_PTR44:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 1
				; CHECK-NEXT: store i8 [[XK_0_IN]], i8* [[B_ADDR_076]], align 1
				; CHECK-NEXT: [[INCDEC_PTR45:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 2
				; CHECK-NEXT: store i8 [[CONV35]], i8* [[INCDEC_PTR44]], align 1
				; CHECK-NEXT: [[INCDEC_PTR46:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 3
				; CHECK-NEXT: store i8 [[CONV39]], i8* [[INCDEC_PTR45]], align 1
				; CHECK-NEXT: [[INCDEC_PTR47]] = getelementptr inbounds i8, i8* [[B_ADDR_076]], i32 4
				; CHECK-NEXT: store i8 [[CONV43]], i8* [[INCDEC_PTR46]], align 1
				; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_077]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[I]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				;
				; CONSTHOIST-LABEL: @testloop(
				; CONSTHOIST-NEXT: entry:
				; CONSTHOIST-NEXT: [[CMP74:%.]] = icmp sgt i32 [[I:%.]], 0
				; CONSTHOIST-NEXT: br i1 [[CMP74]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
				; CONSTHOIST: for.body.preheader:
				; CONSTHOIST-NEXT: [[CONST:%.*]] = bitcast i8 -1 to i8
				; CONSTHOIST-NEXT: br label [[FOR_BODY:%.*]]
				; CONSTHOIST: for.body:
				; CONSTHOIST-NEXT: [[I_077:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
				; CONSTHOIST-NEXT: [[B_ADDR_076:%.]] = phi i8 [ [[INCDEC_PTR47:%.]], [[FOR_BODY]] ], [ [[B:%.]], [[FOR_BODY_PREHEADER]] ]
				; CONSTHOIST-NEXT: [[A_ADDR_075:%.]] = phi i8 [ [[INCDEC_PTR2:%.]], [[FOR_BODY]] ], [ [[A:%.]], [[FOR_BODY_PREHEADER]] ]
				; CONSTHOIST-NEXT: [[INCDEC_PTR:%.]] = getelementptr inbounds i8, i8 [[A_ADDR_075]], i32 1
				; CONSTHOIST-NEXT: [[TMP0:%.]] = load i8, i8 [[A_ADDR_075]], align 1
				; CONSTHOIST-NEXT: [[INCDEC_PTR1:%.]] = getelementptr inbounds i8, i8 [[A_ADDR_075]], i32 2
				; CONSTHOIST-NEXT: [[TMP1:%.]] = load i8, i8 [[INCDEC_PTR]], align 1
				; CONSTHOIST-NEXT: [[INCDEC_PTR2]] = getelementptr inbounds i8, i8* [[A_ADDR_075]], i32 3
				; CONSTHOIST-NEXT: [[TMP2:%.]] = load i8, i8 [[INCDEC_PTR1]], align 1
				; CONSTHOIST-NEXT: [[TMP3:%.*]] = xor i8 [[TMP0]], [[CONST]]
				; CONSTHOIST-NEXT: [[TMP4:%.*]] = xor i8 [[TMP1]], [[CONST]]
				; CONSTHOIST-NEXT: [[TMP5:%.*]] = xor i8 [[TMP2]], [[CONST]]
				; CONSTHOIST-NEXT: [[CMP16:%.*]] = icmp ult i8 [[TMP2]], [[TMP0]]
				; CONSTHOIST-NEXT: [[COND:%.*]] = select i1 [[CMP16]], i8 [[TMP3]], i8 [[TMP5]]
				; CONSTHOIST-NEXT: [[TMP6:%.*]] = icmp ult i8 [[COND]], [[TMP4]]
				; CONSTHOIST-NEXT: [[XK_0_IN:%.*]] = select i1 [[TMP6]], i8 [[COND]], i8 [[TMP4]]
				; CONSTHOIST-NEXT: [[CONV35:%.*]] = sub i8 [[TMP3]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: [[CONV39:%.*]] = sub i8 [[TMP4]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: [[CONV43:%.*]] = sub i8 [[TMP5]], [[XK_0_IN]]
				; CONSTHOIST-NEXT: [[INCDEC_PTR44:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 1
				; CONSTHOIST-NEXT: store i8 [[XK_0_IN]], i8* [[B_ADDR_076]], align 1
				; CONSTHOIST-NEXT: [[INCDEC_PTR45:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 2
				; CONSTHOIST-NEXT: store i8 [[CONV35]], i8* [[INCDEC_PTR44]], align 1
				; CONSTHOIST-NEXT: [[INCDEC_PTR46:%.]] = getelementptr inbounds i8, i8 [[B_ADDR_076]], i32 3
				; CONSTHOIST-NEXT: store i8 [[CONV39]], i8* [[INCDEC_PTR45]], align 1
				; CONSTHOIST-NEXT: [[INCDEC_PTR47]] = getelementptr inbounds i8, i8* [[B_ADDR_076]], i32 4
				; CONSTHOIST-NEXT: store i8 [[CONV43]], i8* [[INCDEC_PTR46]], align 1
				; CONSTHOIST-NEXT: [[INC]] = add nuw nsw i32 [[I_077]], 1
				; CONSTHOIST-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[I]]
				; CONSTHOIST-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]
				; CONSTHOIST: for.cond.cleanup:
				; CONSTHOIST-NEXT: ret void
				;
				entry:
				%cmp74 = icmp sgt i32 %I, 0
				br i1 %cmp74, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader:
				br label %for.body

				for.body:
				%i.077 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%B.addr.076 = phi i8* [ %incdec.ptr47, %for.body ], [ %B, %for.body.preheader ]
				%A.addr.075 = phi i8* [ %incdec.ptr2, %for.body ], [ %A, %for.body.preheader ]
				%incdec.ptr = getelementptr inbounds i8, i8* %A.addr.075, i32 1
				%0 = load i8, i8* %A.addr.075, align 1
				%incdec.ptr1 = getelementptr inbounds i8, i8* %A.addr.075, i32 2
				%1 = load i8, i8* %incdec.ptr, align 1
				%incdec.ptr2 = getelementptr inbounds i8, i8* %A.addr.075, i32 3
				%2 = load i8, i8* %incdec.ptr1, align 1
				%3 = xor i8 %0, -1
				%4 = xor i8 %1, -1
				%5 = xor i8 %2, -1
				%cmp16 = icmp ult i8 %3, %5
				%cond = select i1 %cmp16, i8 %3, i8 %5
				%6 = icmp ult i8 %cond, %4
				%xk.0.in = select i1 %6, i8 %cond, i8 %4
				%conv35 = sub i8 %3, %xk.0.in
				%conv39 = sub i8 %4, %xk.0.in
				%conv43 = sub i8 %5, %xk.0.in
				%incdec.ptr44 = getelementptr inbounds i8, i8* %B.addr.076, i32 1
				store i8 %xk.0.in, i8* %B.addr.076, align 1
				%incdec.ptr45 = getelementptr inbounds i8, i8* %B.addr.076, i32 2
				store i8 %conv35, i8* %incdec.ptr44, align 1
				%incdec.ptr46 = getelementptr inbounds i8, i8* %B.addr.076, i32 3
				store i8 %conv39, i8* %incdec.ptr45, align 1
				%incdec.ptr47 = getelementptr inbounds i8, i8* %B.addr.076, i32 4
				store i8 %conv43, i8* %incdec.ptr46, align 1
				%inc = add nuw nsw i32 %i.077, 1
				%exitcond = icmp eq i32 %inc, %I
				br i1 %exitcond, label %for.cond.cleanup, label %for.body

				for.cond.cleanup:
				ret void
				}

				declare void @use4(i8, i8, i8, i8)