This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
2/4
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
umax.ll

Differential D144451

[X86] Optimize umax(X,1)
ClosedPublic

Authored by kazu on Feb 21 2023, 12:16 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
goldstein.w.n

Commits

rGa21a7ddf5ad1: [X86] Optimize umax(X,1) (NFC)

Summary

Without this patch:

%cond = call i32 @llvm.umax.i32(i32 %X, i32 1)

is compiled as:

83 ff 02                   cmp    $0x2,%edi
b8 01 00 00 00             mov    $0x1,%eax
0f 43 c7                   cmovae %edi,%eax

With this patch, the compiler generates:

89 f8                      mov    %edi,%eax
83 ff 01                   cmp    $0x1,%edi
83 d0 00                   adc    $0x0,%eax

saving 3 bytes. We should be able to save 5 bytes in larger functions
where the mov is unnecessary.

This patch converts the specific cmov pattern to cmp $1 followed by
adc $0.

This patch partially fixes:

https://github.com/llvm/llvm-project/issues/60374

The LLVM IR optimizer is yet to canonicalize max expressions to
actual @llvm.umax.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kazu created this revision.Feb 21 2023, 12:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 12:16 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

kazu requested review of this revision.Feb 21 2023, 12:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 12:16 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

kazu added reviewers: RKSimon, spatel, goldstein.w.n.Feb 21 2023, 12:17 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptFeb 21 2023, 12:17 AM

Harbormaster completed remote builds in B214936: Diff 499053.Feb 21 2023, 2:14 AM

kazu retitled this revision from [X86] Optimize umax(X,1) (NFC) to [X86] Optimize umax(X,1).Feb 21 2023, 7:13 AM

Slightly unrelated, is it possible to get the codegen to be:

	xorl	%eax, %eax
	cmp	$0x1, %edi
	adc	%edi, %eax

If so that would be preferable (except in register pressure corner case b.c eax and edi now have slight live-range overlap).
On target like ICX it saves a full instruction b.c no move-elim.

This doesn't handle i8 and i16 the same way because truncates get in the way, right? Would it be better to match this as "umax(x, 1)" (before it becomes x86-specific instructions)? Or could we peek through a truncate and still generate the adc for i8/i16?

In D144451#4142230, @spatel wrote:

This doesn't handle i8 and i16 the same way because truncates get in the way, right? Would it be better to match this as "umax(x, 1)" (before it becomes x86-specific instructions)? Or could we peek through a truncate and still generate the adc for i8/i16?

I think there is potentially a bit of danger of transforming umin as other patterns seem more likely to look for umin than the alternative (select or x86adc).

goldstein.w.n added inline comments.Feb 21 2023, 12:26 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
47277	Can it also match `X86ISD::CMP`?
47286	Can `ADC` valuetype by different than comparison valuetype?

In D144451#4142412, @goldstein.w.n wrote:

In D144451#4142230, @spatel wrote:

This doesn't handle i8 and i16 the same way because truncates get in the way, right? Would it be better to match this as "umax(x, 1)" (before it becomes x86-specific instructions)? Or could we peek through a truncate and still generate the adc for i8/i16?

I think there is potentially a bit of danger of transforming umin as other patterns seem more likely to look for umin than the alternative (select or x86adc).

Looking a little more, think you're right, best to do this in LowerSELECT before the X86ISD::CMOV logic.

Do we have i8/i16 test coverage anywhere?

Incidently, I'm guessing the vector equivalent code will be pretty poor as well when a umax op isn't available (v2i64 pre-AVX512 etc).

In D144451#4142962, @RKSimon wrote:

Incidently, I'm guessing the vector equivalent code will be pretty poor as well when a umax op isn't available (v2i64 pre-AVX512 etc).

Yes, I posted an example of that in https://github.com/llvm/llvm-project/issues/60374 :
https://alive2.llvm.org/ce/z/didmjC

If there's no legal/custom umax, we probably want to convert to setcc + add. Once that's done, we can take another step to canonicalize the IR icmp+zext+add sequence to umax.
There's a sibling pattern with umin:
https://alive2.llvm.org/ce/z/cXL_d9
...and related signed patterns too.

In D144451#4142080, @goldstein.w.n wrote:
Slightly unrelated, is it possible to get the codegen to be:
	xorl	%eax, %eax
	cmp	$0x1, %edi
	adc	%edi, %eax
?

If so that would be preferable (except in register pressure corner case b.c eax and edi now have slight live-range overlap).
On target like ICX it saves a full instruction b.c no move-elim.

I've never thought of that. Very interesting. The comparison of the two approaches is such a close call. I have a slight preference to the mov-cmp-adc route. If we assume that users typically do not care about the value of x past umax(x,1), which I haven't surveyed in real-world applications, we shouldn't have to make a copy of the source operand with a mov, especially in large enough functions where the calling conventions (edi->eax) aren't much of a constraint. Friendliness under high register pressure along with the 6-byte encoding (without a mov) is a plus.

In D144451#4142960, @RKSimon wrote:

Do we have i8/i16 test coverage anywhere?

I just checked in 86bd9c984154d625392bfab05541bbe9ee18b6ab. This patch being reviewed here does not handle i8/i16 yet.

In D144451#4153259, @kazu wrote:

In D144451#4142960, @RKSimon wrote:

Do we have i8/i16 test coverage anywhere?

I just checked in 86bd9c984154d625392bfab05541bbe9ee18b6ab. This patch being reviewed here does not handle i8/i16 yet.

Think might make sense to move this to LowerSELECT so it can handle i8/i16. Also makes more sense to just lower this directly to adc/sbb rather than through cmovcc.

Add support for i8 and i16 by seeing through ISD::TRUNCATE.

Please take a look.

I settled on the idea of seeing trough ISD::TRUNCATE. I tried sending ISD::UMAX of MVT::i8 through MVT::i64 to LowerMINMAX with setOperationAction, but I ended up seeing various differences in the generated code even without my actual change for the optimization.

Harbormaster completed remote builds in B217351: Diff 502353.Mar 4 2023, 12:01 AM

spatel mentioned this in D145299: [InstCombine] Generate better code for std::bit_ceil.Mar 5 2023, 5:06 AM

RKSimon added inline comments.Mar 5 2023, 5:47 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
47439	Cond.hasOneUse() will only check the sub flags output has one use - what about the sub result? Cond->hasOneUse()?
47443	(style) auto *Sub1C =

Use auto and Cond->hasOneUse().

Please take a look. Thanks!

Harbormaster completed remote builds in B217453: Diff 502472.Mar 5 2023, 2:45 PM

LGTM cheers

This revision is now accepted and ready to land.Mar 6 2023, 3:06 AM

spatel accepted this revision.Mar 6 2023, 5:00 AM

Closed by commit rGa21a7ddf5ad1: [X86] Optimize umax(X,1) (NFC) (authored by kazu). · Explain WhyMar 6 2023, 10:19 AM

This revision was automatically updated to reflect the committed changes.

kazu added a commit: rGa21a7ddf5ad1: [X86] Optimize umax(X,1) (NFC).

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

26 lines

test/

CodeGen/

X86/

umax.ll

53 lines

Diff 502698

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 47,268 Lines • ▼ Show 20 Lines	static SDValue combineSetCCEFLAGS(SDValue EFLAGS, X86::CondCode &CC,
if (SDValue R = checkBoolTestSetCCCombine(EFLAGS, CC))		if (SDValue R = checkBoolTestSetCCCombine(EFLAGS, CC))
return R;		return R;

if (SDValue R = combinePTESTCC(EFLAGS, CC, DAG, Subtarget))		if (SDValue R = combinePTESTCC(EFLAGS, CC, DAG, Subtarget))
return R;		return R;

if (SDValue R = combineSetCCMOVMSK(EFLAGS, CC, DAG, Subtarget))		if (SDValue R = combineSetCCMOVMSK(EFLAGS, CC, DAG, Subtarget))
return R;		return R;

		goldstein.w.nUnsubmitted Not Done Reply Inline Actions Can it also match `X86ISD::CMP`? goldstein.w.n: Can it also match `X86ISD::CMP`?
return combineSetCCAtomicArith(EFLAGS, CC, DAG, Subtarget);		return combineSetCCAtomicArith(EFLAGS, CC, DAG, Subtarget);
}		}

/// Optimize X86ISD::CMOV [LHS, RHS, CONDCODE (e.g. X86::COND_NE), CONDVAL]		/// Optimize X86ISD::CMOV [LHS, RHS, CONDCODE (e.g. X86::COND_NE), CONDVAL]
static SDValue combineCMov(SDNode *N, SelectionDAG &DAG,		static SDValue combineCMov(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDLoc DL(N);		SDLoc DL(N);

		goldstein.w.nUnsubmitted Not Done Reply Inline Actions Can `ADC` valuetype by different than comparison valuetype? goldstein.w.n: Can `ADC` valuetype by different than comparison valuetype?
SDValue FalseOp = N->getOperand(0);		SDValue FalseOp = N->getOperand(0);
SDValue TrueOp = N->getOperand(1);		SDValue TrueOp = N->getOperand(1);
X86::CondCode CC = (X86::CondCode)N->getConstantOperandVal(2);		X86::CondCode CC = (X86::CondCode)N->getConstantOperandVal(2);
SDValue Cond = N->getOperand(3);		SDValue Cond = N->getOperand(3);

// cmov X, X, ?, ? --> X		// cmov X, X, ?, ? --> X
if (TrueOp == FalseOp)		if (TrueOp == FalseOp)
return TrueOp;		return TrueOp;
▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	if ((Cond.getOpcode() == X86ISD::CMP \|\| Cond.getOpcode() == X86ISD::SUB) &&
CmpAgainst == dyn_cast<ConstantSDNode>(TrueOp)) {		CmpAgainst == dyn_cast<ConstantSDNode>(TrueOp)) {
SDValue Ops[] = {FalseOp, Cond.getOperand(0),		SDValue Ops[] = {FalseOp, Cond.getOperand(0),
DAG.getTargetConstant(CC, DL, MVT::i8), Cond};		DAG.getTargetConstant(CC, DL, MVT::i8), Cond};
return DAG.getNode(X86ISD::CMOV, DL, N->getValueType(0), Ops);		return DAG.getNode(X86ISD::CMOV, DL, N->getValueType(0), Ops);
}		}
}		}
}		}

		// Transform:
		//
		// (cmov 1 T (uge T 2))
		//
		// to:
		//
		// (adc T 0 (sub T 1))
		if (CC == X86::COND_AE && isOneConstant(FalseOp) &&
		Cond.getOpcode() == X86ISD::SUB && Cond->hasOneUse()) {
		RKSimonUnsubmitted Done Reply Inline Actions Cond.hasOneUse() will only check the sub flags output has one use - what about the sub result? Cond->hasOneUse()? RKSimon: Cond.hasOneUse() will only check the sub flags output has one use - what about the sub result?
		SDValue Cond0 = Cond.getOperand(0);
		if (Cond0.getOpcode() == ISD::TRUNCATE)
		Cond0 = Cond0.getOperand(0);
		auto *Sub1C = dyn_cast<ConstantSDNode>(Cond.getOperand(1));
		RKSimonUnsubmitted Done Reply Inline Actions (style) auto Sub1C = RKSimon:* (style) auto *Sub1C =
		if (Cond0 == TrueOp && Sub1C && Sub1C->getZExtValue() == 2) {
		EVT CondVT = Cond->getValueType(0);
		EVT OuterVT = N->getValueType(0);
		// Subtract 1 and generate a carry.
		SDValue NewSub =
		DAG.getNode(X86ISD::SUB, DL, Cond->getVTList(), Cond.getOperand(0),
		DAG.getConstant(1, DL, CondVT));
		SDValue EFLAGS(NewSub.getNode(), 1);
		return DAG.getNode(X86ISD::ADC, DL, DAG.getVTList(OuterVT, MVT::i32),
		TrueOp, DAG.getConstant(0, DL, OuterVT), EFLAGS);
		}
		}

// Fold and/or of setcc's to double CMOV:		// Fold and/or of setcc's to double CMOV:
// (CMOV F, T, ((cc1 \| cc2) != 0)) -> (CMOV (CMOV F, T, cc1), T, cc2)		// (CMOV F, T, ((cc1 \| cc2) != 0)) -> (CMOV (CMOV F, T, cc1), T, cc2)
// (CMOV F, T, ((cc1 & cc2) != 0)) -> (CMOV (CMOV T, F, !cc1), F, !cc2)		// (CMOV F, T, ((cc1 & cc2) != 0)) -> (CMOV (CMOV T, F, !cc1), F, !cc2)
//		//
// This combine lets us generate:		// This combine lets us generate:
// cmovcc1 (jcc1 if we don't have CMOV)		// cmovcc1 (jcc1 if we don't have CMOV)
// cmovcc2 (same)		// cmovcc2 (same)
// instead of:		// instead of:
▲ Show 20 Lines • Show All 11,172 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/umax.ll

	Show All 38 Lines
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i8 @llvm.umax.i8(i8 %a, i8 %b)			%r = call i8 @llvm.umax.i8(i8 %a, i8 %b)
	ret i8 %r			ret i8 %r
	}			}

	define i8 @test_i8_1(i8 %a) nounwind {			define i8 @test_i8_1(i8 %a) nounwind {
	; X64-LABEL: test_i8_1:			; X64-LABEL: test_i8_1:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: cmpb $2, %dil			; X64-NEXT: movl %edi, %eax
	; X64-NEXT: movl $1, %eax			; X64-NEXT: cmpb $1, %al
	; X64-NEXT: cmovael %edi, %eax			; X64-NEXT: adcl $0, %eax
	; X64-NEXT: # kill: def $al killed $al killed $eax			; X64-NEXT: # kill: def $al killed $al killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-LABEL: test_i8_1:			; X86-LABEL: test_i8_1:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: cmpb $2, %cl			; X86-NEXT: cmpb $1, %al
	; X86-NEXT: movl $1, %eax			; X86-NEXT: adcl $0, %eax
	; X86-NEXT: cmovael %ecx, %eax
	; X86-NEXT: # kill: def $al killed $al killed $eax			; X86-NEXT: # kill: def $al killed $al killed $eax
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i8 @llvm.umax.i8(i8 %a, i8 1)			%r = call i8 @llvm.umax.i8(i8 %a, i8 1)
	ret i8 %r			ret i8 %r
	}			}

	define i16 @test_i16(i16 %a, i16 %b) nounwind {			define i16 @test_i16(i16 %a, i16 %b) nounwind {
	; X64-LABEL: test_i16:			; X64-LABEL: test_i16:
	Show All 14 Lines
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i16 @llvm.umax.i16(i16 %a, i16 %b)			%r = call i16 @llvm.umax.i16(i16 %a, i16 %b)
	ret i16 %r			ret i16 %r
	}			}

	define i16 @test_i16_1(i16 %a) nounwind {			define i16 @test_i16_1(i16 %a) nounwind {
	; X64-LABEL: test_i16_1:			; X64-LABEL: test_i16_1:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: cmpw $2, %di			; X64-NEXT: movl %edi, %eax
	; X64-NEXT: movl $1, %eax			; X64-NEXT: cmpw $1, %ax
	; X64-NEXT: cmovael %edi, %eax			; X64-NEXT: adcl $0, %eax
	; X64-NEXT: # kill: def $ax killed $ax killed $eax			; X64-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-LABEL: test_i16_1:			; X86-LABEL: test_i16_1:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: cmpw $2, %cx			; X86-NEXT: cmpw $1, %ax
	; X86-NEXT: movl $1, %eax			; X86-NEXT: adcl $0, %eax
	; X86-NEXT: cmovael %ecx, %eax
	; X86-NEXT: # kill: def $ax killed $ax killed $eax			; X86-NEXT: # kill: def $ax killed $ax killed $eax
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i16 @llvm.umax.i16(i16 %a, i16 1)			%r = call i16 @llvm.umax.i16(i16 %a, i16 1)
	ret i16 %r			ret i16 %r
	}			}

	define i24 @test_i24(i24 %a, i24 %b) nounwind {			define i24 @test_i24(i24 %a, i24 %b) nounwind {
	; X64-LABEL: test_i24:			; X64-LABEL: test_i24:
	Show All 35 Lines
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i32 @llvm.umax.i32(i32 %a, i32 %b)			%r = call i32 @llvm.umax.i32(i32 %a, i32 %b)
	ret i32 %r			ret i32 %r
	}			}

	define i32 @test_i32_1(i32 %a) nounwind {			define i32 @test_i32_1(i32 %a) nounwind {
	; X64-LABEL: test_i32_1:			; X64-LABEL: test_i32_1:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: cmpl $2, %edi			; X64-NEXT: movl %edi, %eax
	; X64-NEXT: movl $1, %eax			; X64-NEXT: cmpl $1, %edi
	; X64-NEXT: cmovael %edi, %eax			; X64-NEXT: adcl $0, %eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-LABEL: test_i32_1:			; X86-LABEL: test_i32_1:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: cmpl $2, %ecx			; X86-NEXT: cmpl $1, %eax
	; X86-NEXT: movl $1, %eax			; X86-NEXT: adcl $0, %eax
	; X86-NEXT: cmovael %ecx, %eax
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i32 @llvm.umax.i32(i32 %a, i32 1)			%r = call i32 @llvm.umax.i32(i32 %a, i32 1)
	ret i32 %r			ret i32 %r
	}			}

	define i64 @test_i64(i64 %a, i64 %b) nounwind {			define i64 @test_i64(i64 %a, i64 %b) nounwind {
	; X64-LABEL: test_i64:			; X64-LABEL: test_i64:
	; X64: # %bb.0:			; X64: # %bb.0:
	Show All 22 Lines
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i64 @llvm.umax.i64(i64 %a, i64 %b)			%r = call i64 @llvm.umax.i64(i64 %a, i64 %b)
	ret i64 %r			ret i64 %r
	}			}

	define i64 @test_i64_1(i64 %a) nounwind {			define i64 @test_i64_1(i64 %a) nounwind {
	; X64-LABEL: test_i64_1:			; X64-LABEL: test_i64_1:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: cmpq $2, %rdi			; X64-NEXT: movq %rdi, %rax
	; X64-NEXT: movl $1, %eax			; X64-NEXT: cmpq $1, %rdi
	; X64-NEXT: cmovaeq %rdi, %rax			; X64-NEXT: adcq $0, %rax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-LABEL: test_i64_1:			; X86-LABEL: test_i64_1:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: pushl %esi			; X86-NEXT: pushl %esi
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-NEXT: movl {{[0-9]+}}(%esp), %edx			; X86-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-NEXT: cmpl $2, %ecx			; X86-NEXT: cmpl $1, %ecx
	; X86-NEXT: movl $1, %eax			; X86-NEXT: movl %ecx, %esi
	; X86-NEXT: movl $1, %esi			; X86-NEXT: adcl $0, %esi
	; X86-NEXT: cmovael %ecx, %esi
	; X86-NEXT: testl %edx, %edx			; X86-NEXT: testl %edx, %edx
				; X86-NEXT: movl $1, %eax
	; X86-NEXT: cmovnel %ecx, %eax			; X86-NEXT: cmovnel %ecx, %eax
	; X86-NEXT: cmovel %esi, %eax			; X86-NEXT: cmovel %esi, %eax
	; X86-NEXT: popl %esi			; X86-NEXT: popl %esi
	; X86-NEXT: retl			; X86-NEXT: retl
	%r = call i64 @llvm.umax.i64(i64 %a, i64 1)			%r = call i64 @llvm.umax.i64(i64 %a, i64 1)
	ret i64 %r			ret i64 %r
	}			}

	▲ Show 20 Lines • Show All 649 Lines • Show Last 20 Lines