This is an archive of the discontinued LLVM Phabricator instance.

[x86] improve CMOV codegen by pushing add into operands, part 3
ClosedPublic

Authored by spatel on Jul 27 2021, 2:30 PM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
lebedev.ri
RKSimon

Commits

rG4c41caa28710: [x86] improve CMOV codegen by pushing add into operands, part 3

Summary

In this episode, we are trying to avoid an x86 micro-arch quirk where complex (3 operand) LEA potentially costs significantly more than simple LEA. So we simultaneously push and pull the math around the CMOV to balance the operations.

I looked at the debug spew during instruction selection and decided against trying a later DAGToDAG transform -- it seems very difficult to match if the trailing memops are already selected and managing the creation of extra instructions at that level is always tricky.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jul 27 2021, 2:30 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptJul 27 2021, 2:30 PM

spatel requested review of this revision.Jul 27 2021, 2:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 27 2021, 2:30 PM

I think this makes sense.

llvm/test/CodeGen/X86/add-cmov.ll
277	This does seem like an improvement https://godbolt.org/z/obf4sGcKW

Harbormaster completed remote builds in B116545: Diff 362176.Jul 27 2021, 4:53 PM

pengfei added inline comments.Jul 27 2021, 6:22 PM

llvm/test/CodeGen/X86/add-cmov.ll
284–285	lea won't affect eflags, so it may still be better than add in some complex scenarios I guess?

spatel added inline comments.Jul 27 2021, 6:37 PM

llvm/test/CodeGen/X86/add-cmov.ll
284–285	That's true in general, but I don't think it can be a factor in this transform because we know that the cmov is a user of eflags, so it is already being set by some other op.

RKSimon added inline comments.Jul 27 2021, 11:48 PM

llvm/test/CodeGen/X86/add-cmov.ll
284–285	We already have passes that convert add -> lea to break intefering eflags dependencies - I think this is fine.

RKSimon added inline comments.Jul 28 2021, 3:52 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
49975	We usually canonicalize add constants to RHS - does this actually cause problems?

spatel added inline comments.Jul 28 2021, 4:25 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
49975	I haven't been able to come up with a test to show this pattern, but I was worried that we could get in here in some intermediate non-canonical state. If that happens, we'd likely hit an infinite loop with the previous transform, so I figured it was better to play it safe and check for constant.

LGTM - cheers

This revision is now accepted and ready to land.Jul 28 2021, 5:14 AM

Closed by commit rG4c41caa28710: [x86] improve CMOV codegen by pushing add into operands, part 3 (authored by spatel). · Explain WhyJul 28 2021, 6:11 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG4c41caa28710: [x86] improve CMOV codegen by pushing add into operands, part 3.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

25 lines

test/

CodeGen/

X86/

add-cmov.ll

40 lines

Diff 362358

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 49,955 Lines • ▼ Show 20 Lines	static SDValue pushAddIntoCmovOfConsts(SDNode *N, SelectionDAG &DAG) {
// Match an appropriate CMOV as the first operand of the add.		// Match an appropriate CMOV as the first operand of the add.
SDValue Cmov = N->getOperand(0);		SDValue Cmov = N->getOperand(0);
SDValue OtherOp = N->getOperand(1);		SDValue OtherOp = N->getOperand(1);
if (!isSuitableCmov(Cmov))		if (!isSuitableCmov(Cmov))
std::swap(Cmov, OtherOp);		std::swap(Cmov, OtherOp);
if (!isSuitableCmov(Cmov))		if (!isSuitableCmov(Cmov))
return SDValue();		return SDValue();

// add (cmov C1, C2), OtherOp --> cmov (add OtherOp, C1), (add OtherOp, C2)
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);
SDValue FalseOp = Cmov.getOperand(0);		SDValue FalseOp = Cmov.getOperand(0);
SDValue TrueOp = Cmov.getOperand(1);		SDValue TrueOp = Cmov.getOperand(1);

		// We will push the add through the select, but we can potentially do better
		// if we know there is another add in the sequence and this is pointer math.
		// In that case, we can absorb an add into the trailing memory op and avoid
		// a 3-operand LEA which is likely slower than a 2-operand LEA.
		// TODO: If target has "slow3OpsLEA", do this even without the trailing memop?
		if (OtherOp.getOpcode() == ISD::ADD && OtherOp.hasOneUse() &&
		!isa<ConstantSDNode>(OtherOp.getOperand(0)) &&
		RKSimonUnsubmitted Not Done Reply Inline Actions We usually canonicalize add constants to RHS - does this actually cause problems? RKSimon: We usually canonicalize add constants to RHS - does this actually cause problems?
		spatelAuthorUnsubmitted Done Reply Inline Actions I haven't been able to come up with a test to show this pattern, but I was worried that we could get in here in some intermediate non-canonical state. If that happens, we'd likely hit an infinite loop with the previous transform, so I figured it was better to play it safe and check for constant. spatel: I haven't been able to come up with a test to show this pattern, but I was worried that we…
		all_of(N->uses(), [&](SDNode *Use) {
		auto *MemNode = dyn_cast<MemSDNode>(Use);
		return MemNode && MemNode->getBasePtr().getNode() == N;
		})) {
		// add (cmov C1, C2), add (X, Y) --> add (cmov (add X, C1), (add X, C2)), Y
		// TODO: We are arbitrarily choosing op0 as the 1st piece of the sum, but
		// it is possible that choosing op1 might be better.
		SDValue X = OtherOp.getOperand(0), Y = OtherOp.getOperand(1);
		FalseOp = DAG.getNode(ISD::ADD, DL, VT, X, FalseOp);
		TrueOp = DAG.getNode(ISD::ADD, DL, VT, X, TrueOp);
		Cmov = DAG.getNode(X86ISD::CMOV, DL, VT, FalseOp, TrueOp,
		Cmov.getOperand(2), Cmov.getOperand(3));
		return DAG.getNode(ISD::ADD, DL, VT, Cmov, Y);
		}

		// add (cmov C1, C2), OtherOp --> cmov (add OtherOp, C1), (add OtherOp, C2)
FalseOp = DAG.getNode(ISD::ADD, DL, VT, OtherOp, FalseOp);		FalseOp = DAG.getNode(ISD::ADD, DL, VT, OtherOp, FalseOp);
TrueOp = DAG.getNode(ISD::ADD, DL, VT, OtherOp, TrueOp);		TrueOp = DAG.getNode(ISD::ADD, DL, VT, OtherOp, TrueOp);
return DAG.getNode(X86ISD::CMOV, DL, VT, FalseOp, TrueOp, Cmov.getOperand(2),		return DAG.getNode(X86ISD::CMOV, DL, VT, FalseOp, TrueOp, Cmov.getOperand(2),
Cmov.getOperand(3));		Cmov.getOperand(3));
}		}

static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,		static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
▲ Show 20 Lines • Show All 2,637 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/add-cmov.ll

Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
%idx40 = mul i64 %idx, 40		%idx40 = mul i64 %idx, 40
%gep2 = getelementptr inbounds i16, i16* %ptr, i64 33		%gep2 = getelementptr inbounds i16, i16* %ptr, i64 33
%gep1 = getelementptr inbounds i16, i16* %ptr, i64 30		%gep1 = getelementptr inbounds i16, i16* %ptr, i64 30
%sel = select i1 %b, i16* %gep1, i16* %gep2		%sel = select i1 %b, i16* %gep1, i16* %gep2
%gep3 = getelementptr inbounds i16, i16* %sel, i64 %idx40		%gep3 = getelementptr inbounds i16, i16* %sel, i64 %idx40
ret i16* %gep3		ret i16* %gep3
}		}

define void @bullet_load_store(i32 %x, i64 %y, %class.btAxis* %p) {		define void @bullet_load_store(i32 %x, i64 %y, %class.btAxis* %p) {
		lebedev.riUnsubmitted Not Done Reply Inline Actions This does seem like an improvement https://godbolt.org/z/obf4sGcKW lebedev.ri: This does seem like an improvement https://godbolt.org/z/obf4sGcKW
; CHECK-LABEL: bullet_load_store:		; CHECK-LABEL: bullet_load_store:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq (%rsi,%rsi,4), %rax		; CHECK-NEXT: leaq (%rsi,%rsi,4), %rax
; CHECK-NEXT: shlq $4, %rax		; CHECK-NEXT: shlq $4, %rax
		; CHECK-NEXT: leaq 66(%rdx), %rcx
		; CHECK-NEXT: addq $60, %rdx
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: leaq 60(%rdx,%rax), %rcx		; CHECK-NEXT: cmovneq %rcx, %rdx
		pengfeiUnsubmitted Not Done Reply Inline Actions lea won't affect eflags, so it may still be better than add in some complex scenarios I guess? pengfei: lea won't affect eflags, so it may still be better than add in some complex scenarios I guess?
		spatelAuthorUnsubmitted Done Reply Inline Actions That's true in general, but I don't think it can be a factor in this transform because we know that the cmov is a user of eflags, so it is already being set by some other op. spatel: That's true in general, but I don't think it can be a factor in this transform because we know…
		RKSimonUnsubmitted Not Done Reply Inline Actions We already have passes that convert add -> lea to break intefering eflags dependencies - I think this is fine. RKSimon: We already have passes that convert add -> lea to break intefering eflags dependencies - I…
; CHECK-NEXT: leaq 66(%rdx,%rax), %rax		; CHECK-NEXT: decw (%rdx,%rax)
; CHECK-NEXT: cmoveq %rcx, %rax
; CHECK-NEXT: decw (%rax)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%and = and i32 %x, 1		%and = and i32 %x, 1
%b = icmp eq i32 %and, 0		%b = icmp eq i32 %and, 0
%gep2 = getelementptr inbounds %class.btAxis, %class.btAxis* %p, i64 %y, i32 2, i64 0		%gep2 = getelementptr inbounds %class.btAxis, %class.btAxis* %p, i64 %y, i32 2, i64 0
%gep1 = getelementptr inbounds %class.btAxis, %class.btAxis* %p, i64 %y, i32 1, i64 0		%gep1 = getelementptr inbounds %class.btAxis, %class.btAxis* %p, i64 %y, i32 1, i64 0
%sel = select i1 %b, i16* %gep1, i16* %gep2		%sel = select i1 %b, i16* %gep1, i16* %gep2
%ld = load i16, i16* %sel, align 4		%ld = load i16, i16* %sel, align 4
%dec = add i16 %ld, -1		%dec = add i16 %ld, -1
store i16 %dec, i16* %sel, align 4		store i16 %dec, i16* %sel, align 4
ret void		ret void
}		}

define void @complex_lea_alt1(i1 %b, i16* readnone %ptr, i64 %idx) {		define void @complex_lea_alt1(i1 %b, i16* readnone %ptr, i64 %idx) {
; CHECK-LABEL: complex_lea_alt1:		; CHECK-LABEL: complex_lea_alt1:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq 60(%rdx,%rsi), %rax		; CHECK-NEXT: leaq 60(%rdx), %rax
; CHECK-NEXT: leaq 66(%rdx,%rsi), %rcx		; CHECK-NEXT: addq $66, %rdx
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: cmovneq %rax, %rcx		; CHECK-NEXT: cmovneq %rax, %rdx
; CHECK-NEXT: decw (%rcx)		; CHECK-NEXT: decw (%rdx,%rsi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%i = ptrtoint i16* %ptr to i64		%i = ptrtoint i16* %ptr to i64
%sum = add i64 %idx, %i		%sum = add i64 %idx, %i
%base = inttoptr i64 %sum to i16*		%base = inttoptr i64 %sum to i16*
%gep2 = getelementptr inbounds i16, i16* %base, i64 33		%gep2 = getelementptr inbounds i16, i16* %base, i64 33
%gep1 = getelementptr inbounds i16, i16* %base, i64 30		%gep1 = getelementptr inbounds i16, i16* %base, i64 30
%sel = select i1 %b, i16* %gep1, i16* %gep2		%sel = select i1 %b, i16* %gep1, i16* %gep2
%ld = load i16, i16* %sel, align 4		%ld = load i16, i16* %sel, align 4
%dec = add i16 %ld, -1		%dec = add i16 %ld, -1
store i16 %dec, i16* %sel, align 4		store i16 %dec, i16* %sel, align 4
ret void		ret void
}		}

define void @complex_lea_alt2(i1 %b, i16* readnone %ptr, i64 %idx) {		define void @complex_lea_alt2(i1 %b, i16* readnone %ptr, i64 %idx) {
; CHECK-LABEL: complex_lea_alt2:		; CHECK-LABEL: complex_lea_alt2:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq 60(%rsi,%rdx), %rax		; CHECK-NEXT: leaq 60(%rsi), %rax
; CHECK-NEXT: leaq 66(%rsi,%rdx), %rcx		; CHECK-NEXT: addq $66, %rsi
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: cmovneq %rax, %rcx		; CHECK-NEXT: cmovneq %rax, %rsi
; CHECK-NEXT: decw (%rcx)		; CHECK-NEXT: decw (%rsi,%rdx)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%i = ptrtoint i16* %ptr to i64		%i = ptrtoint i16* %ptr to i64
%sum = add i64 %i, %idx		%sum = add i64 %i, %idx
%base = inttoptr i64 %sum to i16*		%base = inttoptr i64 %sum to i16*
%gep2 = getelementptr inbounds i16, i16* %base, i64 33		%gep2 = getelementptr inbounds i16, i16* %base, i64 33
%gep1 = getelementptr inbounds i16, i16* %base, i64 30		%gep1 = getelementptr inbounds i16, i16* %base, i64 30
%sel = select i1 %b, i16* %gep1, i16* %gep2		%sel = select i1 %b, i16* %gep1, i16* %gep2
%ld = load i16, i16* %sel, align 4		%ld = load i16, i16* %sel, align 4
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
%dec = add i16 %ld, -1		%dec = add i16 %ld, -1
store i16 %dec, i16* %sel, align 4		store i16 %dec, i16* %sel, align 4
ret void		ret void
}		}

define void @complex_lea_alt7(i1 %b, i16* readnone %ptr, i64 %idx) {		define void @complex_lea_alt7(i1 %b, i16* readnone %ptr, i64 %idx) {
; CHECK-LABEL: complex_lea_alt7:		; CHECK-LABEL: complex_lea_alt7:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq 60(%rdx,%rsi), %rax		; CHECK-NEXT: leaq 60(%rdx), %rax
; CHECK-NEXT: leaq 66(%rdx,%rsi), %rcx		; CHECK-NEXT: addq $66, %rdx
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: cmovneq %rax, %rcx		; CHECK-NEXT: cmovneq %rax, %rdx
; CHECK-NEXT: decw (%rcx)		; CHECK-NEXT: decw (%rdx,%rsi)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%i = ptrtoint i16* %ptr to i64		%i = ptrtoint i16* %ptr to i64
%o = add i64 %idx, %i		%o = add i64 %idx, %i
%o66 = add i64 %o, 66		%o66 = add i64 %o, 66
%o60 = add i64 %o, 60		%o60 = add i64 %o, 60
%p66 = inttoptr i64 %o66 to i16*		%p66 = inttoptr i64 %o66 to i16*
%p60 = inttoptr i64 %o60 to i16*		%p60 = inttoptr i64 %o60 to i16*
%sel = select i1 %b, i16* %p60, i16* %p66		%sel = select i1 %b, i16* %p60, i16* %p66
%ld = load i16, i16* %sel, align 4		%ld = load i16, i16* %sel, align 4
%dec = add i16 %ld, -1		%dec = add i16 %ld, -1
store i16 %dec, i16* %sel, align 4		store i16 %dec, i16* %sel, align 4
ret void		ret void
}		}

define void @complex_lea_alt8(i1 %b, i16* readnone %ptr, i64 %idx) {		define void @complex_lea_alt8(i1 %b, i16* readnone %ptr, i64 %idx) {
; CHECK-LABEL: complex_lea_alt8:		; CHECK-LABEL: complex_lea_alt8:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: leaq 60(%rsi,%rdx), %rax		; CHECK-NEXT: leaq 60(%rsi), %rax
; CHECK-NEXT: leaq 66(%rsi,%rdx), %rcx		; CHECK-NEXT: addq $66, %rsi
; CHECK-NEXT: testb $1, %dil		; CHECK-NEXT: testb $1, %dil
; CHECK-NEXT: cmovneq %rax, %rcx		; CHECK-NEXT: cmovneq %rax, %rsi
; CHECK-NEXT: decw (%rcx)		; CHECK-NEXT: decw (%rsi,%rdx)
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%i = ptrtoint i16* %ptr to i64		%i = ptrtoint i16* %ptr to i64
%o = add i64 %i, %idx		%o = add i64 %i, %idx
%o66 = add i64 %o, 66		%o66 = add i64 %o, 66
%o60 = add i64 %o, 60		%o60 = add i64 %o, 60
%p66 = inttoptr i64 %o66 to i16*		%p66 = inttoptr i64 %o66 to i16*
%p60 = inttoptr i64 %o60 to i16*		%p60 = inttoptr i64 %o60 to i16*
%sel = select i1 %b, i16* %p60, i16* %p66		%sel = select i1 %b, i16* %p60, i16* %p66
%ld = load i16, i16* %sel, align 4		%ld = load i16, i16* %sel, align 4
%dec = add i16 %ld, -1		%dec = add i16 %ld, -1
store i16 %dec, i16* %sel, align 4		store i16 %dec, i16* %sel, align 4
ret void		ret void
}		}