This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
3/5
add-cmov.ll

Differential D106684

[x86] improve CMOV codegen by pushing add into operands, part 2
ClosedPublic

Authored by spatel on Jul 23 2021, 9:59 AM.

Download Raw Diff

Details

Reviewers

craig.topper
lebedev.ri
pengfei
RKSimon

Commits

rG1ce05ad619a5: [x86] improve CMOV codegen by pushing add into operands, part 2

Summary

This is a minimum extension of D106607 to allow folding for 2 non-zero constants.

In the reduced test examples, we save 1 instruction by rolling the constants into LEA/ADD. In the motivating test from the bullet benchmark, we absorb both of the constant moves into add ops via LEA magic, so we reduce by 2 instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jul 23 2021, 9:59 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptJul 23 2021, 9:59 AM

spatel requested review of this revision.Jul 23 2021, 9:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2021, 9:59 AM

Harbormaster completed remote builds in B115892: Diff 361263.Jul 23 2021, 10:00 AM

Hmm, could you please add tests with 64-bit add immediate?

llvm/test/CodeGen/X86/add-cmov.ll
213–214	The other benefit is that the `add` can execute without waiting for the `cmov`. https://godbolt.org/z/zz5EKhrKj

Patch updated:

Added tests for various larger-than-i32-immediate constants.
Limited the fold if there's a large constant and the other constant is non-zero. It's possible that we can loosen that check, but I noticed that immediate 0x80000000 gets folded into subq $-2147483648, %rcx with an extra movq reg, reg, so that could increase the total number of instructions.

Harbormaster completed remote builds in B115967: Diff 361369.Jul 23 2021, 4:00 PM

In D106684#2900784, @lebedev.ri wrote:

Hmm, could you please add tests with 64-bit add immediate?

Good catch - this transform is based on being able to use immediate operands to save instructions. If the constants are too big, then we should not do it.

LG unless there are other comments.
Thanks.

llvm/test/CodeGen/X86/add-cmov.ll
158–159	Would transforming this into %tval = add i64 %offset, 42 %fval = add i64 %tval, 2147483606 %r = select i1 %b, i64 %tval, i64 %fval be a win? https://godbolt.org/z/rYjzePnr4

This revision is now accepted and ready to land.Jul 24 2021, 12:36 AM

LGTM - cheers

RKSimon added inline comments.Jul 24 2021, 4:17 AM

llvm/test/CodeGen/X86/add-cmov.ll
251	I think this could be: https://llvm.godbolt.org/z/WMaPvfKKh leaq (%rdx,%rdx,4), %rax shlq $4, %rax leaq 6(%rax), %rcx testb $1, %dil cmovneq %rax, %rcx leaq 60(%rsi,%rcx), %rax

spatel marked an inline comment as done.Jul 25 2021, 6:38 AM

spatel added inline comments.

llvm/test/CodeGen/X86/add-cmov.ll
158–159	Yes, it seems likely to be a win if we can combine the adds sequentially like that. We should check how often we see large constants like this before making the transform more complicated though. I don't have stats on it, but seems unlikely? This is the example that I was referring to when I updated the patch - we change the add to sub somewhere in X86SelDAGToDAG causing a seemingly extra reg copy later on: movq %rdi, %rax movq %rdi, %rcx subq $-2147483648, %rcx ## imm = 0x80000000 addq $42, %rax testb $1, %sil cmoveq %rcx, %rax
251	We would need to factor out the common ops in the expressions and move it after the select...not sure what layer to try that yet. I think that's the same as what you showed in https://llvm.org/PR51069, I'll put this into the bug report, so we have another example to look at.

This revision was landed with ongoing or failed builds.Jul 25 2021, 7:08 AM

Closed by commit rG1ce05ad619a5: [x86] improve CMOV codegen by pushing add into operands, part 2 (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

spatel added a commit: rG1ce05ad619a5: [x86] improve CMOV codegen by pushing add into operands, part 2.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

16 lines

test/

CodeGen/

X86/

add-cmov.ll

37 lines

Diff 361498

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 49,867 Lines • ▼ Show 20 Lines	return SplitOpsAndApply(DAG, Subtarget, DL, VT, { In0, In1 },
PMADDBuilder);		PMADDBuilder);
}		}

/// CMOV of constants requires materializing constant operands in registers.		/// CMOV of constants requires materializing constant operands in registers.
/// Try to fold those constants into an 'add' instruction to reduce instruction		/// Try to fold those constants into an 'add' instruction to reduce instruction
/// count. We do this with CMOV rather the generic 'select' because there are		/// count. We do this with CMOV rather the generic 'select' because there are
/// earlier folds that may be used to turn select-of-constants into logic hacks.		/// earlier folds that may be used to turn select-of-constants into logic hacks.
static SDValue pushAddIntoCmovOfConsts(SDNode *N, SelectionDAG &DAG) {		static SDValue pushAddIntoCmovOfConsts(SDNode *N, SelectionDAG &DAG) {
// This checks for a zero operand because add-of-0 gets simplified away.		// If an operand is zero, add-of-0 gets simplified away, so that's clearly
// TODO: Allow generating an extra add?		// better because we eliminate 1-2 instructions. This transform is still
		// an improvement without zero operands because we trade 2 move constants and
		// 1 add for 2 adds (LEA) as long as the constants can be represented as
		// immediate asm operands (fit in 32-bits).
auto isSuitableCmov = [](SDValue V) {		auto isSuitableCmov = [](SDValue V) {
if (V.getOpcode() != X86ISD::CMOV \|\| !V.hasOneUse())		if (V.getOpcode() != X86ISD::CMOV \|\| !V.hasOneUse())
return false;		return false;
return isa<ConstantSDNode>(V.getOperand(0)) &&		if (!isa<ConstantSDNode>(V.getOperand(0)) \|\|
isa<ConstantSDNode>(V.getOperand(1)) &&		!isa<ConstantSDNode>(V.getOperand(1)))
(isNullConstant(V.getOperand(0)) \|\| isNullConstant(V.getOperand(1)));		return false;
		return isNullConstant(V.getOperand(0)) \|\| isNullConstant(V.getOperand(1)) \|\|
		(V.getConstantOperandAPInt(0).isSignedIntN(32) &&
		V.getConstantOperandAPInt(1).isSignedIntN(32));
};		};

// Match an appropriate CMOV as the first operand of the add.		// Match an appropriate CMOV as the first operand of the add.
SDValue Cmov = N->getOperand(0);		SDValue Cmov = N->getOperand(0);
SDValue OtherOp = N->getOperand(1);		SDValue OtherOp = N->getOperand(1);
if (!isSuitableCmov(Cmov))		if (!isSuitableCmov(Cmov))
std::swap(Cmov, OtherOp);		std::swap(Cmov, OtherOp);
if (!isSuitableCmov(Cmov))		if (!isSuitableCmov(Cmov))
▲ Show 20 Lines • Show All 2,640 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/add-cmov.ll

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%b = icmp sgt i64 %x, 41		%b = icmp sgt i64 %x, 41
%s = select i1 %b, i32 0, i32 43		%s = select i1 %b, i32 0, i32 43
store i32 %s, i32* %p		store i32 %s, i32* %p
%r = add i32 %offset, %s		%r = add i32 %offset, %s
ret i32 %r		ret i32 %r
}		}

		; Special-case LEA hacks are done before we try to push the add into a CMOV.

define i32 @select_40_43_i32(i32 %offset, i64 %x) {		define i32 @select_40_43_i32(i32 %offset, i64 %x) {
; CHECK-LABEL: select_40_43_i32:		; CHECK-LABEL: select_40_43_i32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: # kill: def $edi killed $edi def $rdi		; CHECK-NEXT: # kill: def $edi killed $edi def $rdi
; CHECK-NEXT: xorl %eax, %eax		; CHECK-NEXT: xorl %eax, %eax
; CHECK-NEXT: cmpq $42, %rsi		; CHECK-NEXT: cmpq $42, %rsi
; CHECK-NEXT: setl %al		; CHECK-NEXT: setl %al
; CHECK-NEXT: leal (%rax,%rax,2), %eax		; CHECK-NEXT: leal (%rax,%rax,2), %eax
Show All 29 Lines	; CHECK-NEXT: retq
%s = select i1 %b, i32 1, i32 0		%s = select i1 %b, i32 1, i32 0
%r = add i32 %offset, %s		%r = add i32 %offset, %s
ret i32 %r		ret i32 %r
}		}

define i64 @select_max32_2_i64(i64 %offset, i64 %x) {		define i64 @select_max32_2_i64(i64 %offset, i64 %x) {
; CHECK-LABEL: select_max32_2_i64:		; CHECK-LABEL: select_max32_2_i64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
		; CHECK-NEXT: leaq 2(%rdi), %rax
		; CHECK-NEXT: addq $2147483647, %rdi # imm = 0x7FFFFFFF
; CHECK-NEXT: cmpq $41, %rsi		; CHECK-NEXT: cmpq $41, %rsi
; CHECK-NEXT: movl $2147483647, %ecx # imm = 0x7FFFFFFF		; CHECK-NEXT: cmovneq %rdi, %rax
; CHECK-NEXT: movl $2, %eax
; CHECK-NEXT: cmovneq %rcx, %rax
; CHECK-NEXT: addq %rdi, %rax
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%b = icmp ne i64 %x, 41		%b = icmp ne i64 %x, 41
%s = select i1 %b, i64 2147483647, i64 2		%s = select i1 %b, i64 2147483647, i64 2
%r = add i64 %offset, %s		%r = add i64 %offset, %s
ret i64 %r		ret i64 %r
}		}

define i64 @select_42_min32_i64(i64 %offset, i1 %b) {		define i64 @select_42_min32_i64(i64 %offset, i1 %b) {
; CHECK-LABEL: select_42_min32_i64:		; CHECK-LABEL: select_42_min32_i64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: testb $1, %sil		; CHECK-NEXT: testb $1, %sil
; CHECK-NEXT: movl $42, %ecx		; CHECK-NEXT: movl $42, %ecx
; CHECK-NEXT: movl $2147483648, %eax # imm = 0x80000000		; CHECK-NEXT: movl $2147483648, %eax # imm = 0x80000000
; CHECK-NEXT: cmovneq %rcx, %rax		; CHECK-NEXT: cmovneq %rcx, %rax
; CHECK-NEXT: addq %rdi, %rax		; CHECK-NEXT: addq %rdi, %rax
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%s = select i1 %b, i64 42, i64 2147483648		%s = select i1 %b, i64 42, i64 2147483648
%r = add i64 %offset, %s		%r = add i64 %offset, %s
		lebedev.riUnsubmitted Done Reply Inline Actions Would transforming this into %tval = add i64 %offset, 42 %fval = add i64 %tval, 2147483606 %r = select i1 %b, i64 %tval, i64 %fval be a win? https://godbolt.org/z/rYjzePnr4 lebedev.ri: Would transforming this into ``` %tval = add i64 %offset, 42 %fval = add i64 %tval…
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, it seems likely to be a win if we can combine the adds sequentially like that. We should check how often we see large constants like this before making the transform more complicated though. I don't have stats on it, but seems unlikely? This is the example that I was referring to when I updated the patch - we change the add to sub somewhere in X86SelDAGToDAG causing a seemingly extra reg copy later on: movq %rdi, %rax movq %rdi, %rcx subq $-2147483648, %rcx ## imm = 0x80000000 addq $42, %rax testb $1, %sil cmoveq %rcx, %rax spatel: Yes, it seems likely to be a win if we can combine the adds sequentially like that. We should…
ret i64 %r		ret i64 %r
}		}

define i64 @select_big_42_i64(i64 %offset, i64 %x) {		define i64 @select_big_42_i64(i64 %offset, i64 %x) {
; CHECK-LABEL: select_big_42_i64:		; CHECK-LABEL: select_big_42_i64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: cmpq $41, %rsi		; CHECK-NEXT: cmpq $41, %rsi
; CHECK-NEXT: movl $2147483649, %ecx # imm = 0x80000001		; CHECK-NEXT: movl $2147483649, %ecx # imm = 0x80000001
Show All 35 Lines	; CHECK-NEXT: retq
%s = select i1 %b, i64 2147483649, i64 42000000000		%s = select i1 %b, i64 2147483649, i64 42000000000
%r = add i64 %s, %offset		%r = add i64 %s, %offset
ret i64 %r		ret i64 %r
}		}

define i32 @select_20_43_i32(i32 %offset, i64 %x) {		define i32 @select_20_43_i32(i32 %offset, i64 %x) {
; CHECK-LABEL: select_20_43_i32:		; CHECK-LABEL: select_20_43_i32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
		; CHECK-NEXT: # kill: def $edi killed $edi def $rdi
		; CHECK-NEXT: leal 43(%rdi), %eax
		; CHECK-NEXT: addl $20, %edi
; CHECK-NEXT: cmpq $42, %rsi		; CHECK-NEXT: cmpq $42, %rsi
; CHECK-NEXT: movl $20, %ecx		; CHECK-NEXT: cmovgel %edi, %eax
; CHECK-NEXT: movl $43, %eax
; CHECK-NEXT: cmovgel %ecx, %eax
; CHECK-NEXT: addl %edi, %eax
lebedev.riUnsubmitted Not Done Reply Inline Actions The other benefit is that the `add` can execute without waiting for the `cmov`. https://godbolt.org/z/zz5EKhrKj lebedev.ri: The other benefit is that the `add` can execute without waiting for the `cmov`. https://godbolt.
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%b = icmp sgt i64 %x, 41		%b = icmp sgt i64 %x, 41
%s = select i1 %b, i32 20, i32 43		%s = select i1 %b, i32 20, i32 43
%r = add i32 %offset, %s		%r = add i32 %offset, %s
ret i32 %r		ret i32 %r
}		}

define i16 @select_n2_17_i16(i16 %offset, i1 %b) {		define i16 @select_n2_17_i16(i16 %offset, i1 %b) {
; CHECK-LABEL: select_n2_17_i16:		; CHECK-LABEL: select_n2_17_i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
		; CHECK-NEXT: # kill: def $edi killed $edi def $rdi
		; CHECK-NEXT: leal 17(%rdi), %eax
		; CHECK-NEXT: addl $65534, %edi # imm = 0xFFFE
; CHECK-NEXT: testb $1, %sil		; CHECK-NEXT: testb $1, %sil
; CHECK-NEXT: movl $65534, %ecx # imm = 0xFFFE		; CHECK-NEXT: cmovnel %edi, %eax
; CHECK-NEXT: movl $17, %eax
; CHECK-NEXT: cmovnel %ecx, %eax
; CHECK-NEXT: addl %edi, %eax
; CHECK-NEXT: # kill: def $ax killed $ax killed $eax		; CHECK-NEXT: # kill: def $ax killed $ax killed $eax
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%s = select i1 %b, i16 -2, i16 17		%s = select i1 %b, i16 -2, i16 17
%r = add i16 %offset, %s		%r = add i16 %offset, %s
ret i16 %r		ret i16 %r
}		}

%class.btAxis = type { %struct.btBroadphaseProxy.base, [3 x i16], [3 x i16], %struct.btBroadphaseProxy* }		%class.btAxis = type { %struct.btBroadphaseProxy.base, [3 x i16], [3 x i16], %struct.btBroadphaseProxy* }
%struct.btBroadphaseProxy.base = type <{ i8, i16, i16, [4 x i8], i8, i32, [4 x float], [4 x float] }>		%struct.btBroadphaseProxy.base = type <{ i8, i16, i16, [4 x i8], i8, i32, [4 x float], [4 x float] }>
%struct.btBroadphaseProxy = type <{ i8, i16, i16, [4 x i8], i8, i32, [4 x float], [4 x float], [4 x i8] }>		%struct.btBroadphaseProxy = type <{ i8, i16, i16, [4 x i8], i8, i32, [4 x float], [4 x float], [4 x i8] }>

define i16* @bullet(i1 %b, %class.btAxis* readnone %ptr, i64 %idx) {		define i16* @bullet(i1 %b, %class.btAxis* readnone %ptr, i64 %idx) {
; CHECK-LABEL: bullet:		; CHECK-LABEL: bullet:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq (%rdx,%rdx,4), %rcx		; CHECK-NEXT: leaq (%rdx,%rdx,4), %rax
; CHECK-NEXT: shlq $4, %rcx		; CHECK-NEXT: shlq $4, %rax
; CHECK-NEXT: addq %rsi, %rcx		; CHECK-NEXT: leaq 60(%rsi,%rax), %rcx
		; CHECK-NEXT: leaq 66(%rsi,%rax), %rax
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: movl $60, %edx		; CHECK-NEXT: cmovneq %rcx, %rax
; CHECK-NEXT: movl $66, %eax
; CHECK-NEXT: cmovneq %rdx, %rax
; CHECK-NEXT: addq %rcx, %rax
; CHECK-NEXT: retq		; CHECK-NEXT: retq
		RKSimonUnsubmitted Not Done Reply Inline Actions I think this could be: https://llvm.godbolt.org/z/WMaPvfKKh leaq (%rdx,%rdx,4), %rax shlq $4, %rax leaq 6(%rax), %rcx testb $1, %dil cmovneq %rax, %rcx leaq 60(%rsi,%rcx), %rax RKSimon: I think this could be: https://llvm.godbolt.org/z/WMaPvfKKh ``` leaq (%rdx,%rdx,4), %rax…
		spatelAuthorUnsubmitted Done Reply Inline Actions We would need to factor out the common ops in the expressions and move it after the select...not sure what layer to try that yet. I think that's the same as what you showed in https://llvm.org/PR51069, I'll put this into the bug report, so we have another example to look at. spatel: We would need to factor out the common ops in the expressions and move it after the select...
%gep2 = getelementptr inbounds %class.btAxis, %class.btAxis* %ptr, i64 %idx, i32 2, i64 0		%gep2 = getelementptr inbounds %class.btAxis, %class.btAxis* %ptr, i64 %idx, i32 2, i64 0
%gep1 = getelementptr inbounds %class.btAxis, %class.btAxis* %ptr, i64 %idx, i32 1, i64 0		%gep1 = getelementptr inbounds %class.btAxis, %class.btAxis* %ptr, i64 %idx, i32 1, i64 0
%sel = select i1 %b, i16* %gep1, i16* %gep2		%sel = select i1 %b, i16* %gep1, i16* %gep2
ret i16* %sel		ret i16* %sel
}		}