Download Raw Diff

Details

Reviewers

RKSimon
spatel
craig.topper

Commits

rGa3d5f1cf5d88: [x86] Fix infinite loop inside DAG combiner with lzcnt feature.

Summary

The issue affects targets supporting fast-lzcnt such as btver2.
This removes extraneous zext/trunc node insertions to fix the infinite loop.
This fixes Issue https://github.com/llvm/llvm-project/issues/54694

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pgousseau created this revision.Apr 1 2022, 5:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 5:57 AM

Herald added subscribers: StephenFan, pengfei, hiraditya. · View Herald Transcript

pgousseau requested review of this revision.Apr 1 2022, 5:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2022, 5:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Added a regression test.

Do you know which commit introduced the problem, or which transforms conflict?
This looks like a pretty heavy hammer.

Harbormaster completed remote builds in B157399: Diff 419736.Apr 1 2022, 6:49 AM

In D122900#3422266, @lebedev.ri wrote:

Do you know which commit introduced the problem, or which transforms conflict?
This looks like a pretty heavy hammer.

Thank you Roman for reviewing!
The regression appeared after https://github.com/llvm/llvm-project/commit/8a156d1c2795189389fadbf33702384f522f2506 (thank you @wristow for finding it)
As @RKSimon found out, the IR does not get generated anymore from the original C++ test case but the issue is still present.
I am hopeful we can reintroduce the optimization (at a later codegen stage maybe?) but I would like to do some performance analysis first to see if it is still a win.

In D122900#3422344, @pgousseau wrote:

In D122900#3422266, @lebedev.ri wrote:

Do you know which commit introduced the problem, or which transforms conflict?
This looks like a pretty heavy hammer.

Thank you Roman for reviewing!
The regression appeared after https://github.com/llvm/llvm-project/commit/8a156d1c2795189389fadbf33702384f522f2506 (thank you @wristow for finding it)
As @RKSimon found out, the IR does not get generated anymore from the original C++ test case but the issue is still present.
I am hopeful we can reintroduce the optimization (at a later codegen stage maybe?) but I would like to do some performance analysis first to see if it is still a win.

That change might have exposed the bug from source/IR, but the backend loop has existed since at least the 10.0 release based on a godbolt check:
https://godbolt.org/z/Tfnaevboz

In D122900#3422365, @spatel wrote:

In D122900#3422344, @pgousseau wrote:

In D122900#3422266, @lebedev.ri wrote:

Do you know which commit introduced the problem, or which transforms conflict?
This looks like a pretty heavy hammer.

Thank you Roman for reviewing!
The regression appeared after https://github.com/llvm/llvm-project/commit/8a156d1c2795189389fadbf33702384f522f2506 (thank you @wristow for finding it)
As @RKSimon found out, the IR does not get generated anymore from the original C++ test case but the issue is still present.
I am hopeful we can reintroduce the optimization (at a later codegen stage maybe?) but I would like to do some performance analysis first to see if it is still a win.

That change might have exposed the bug from source/IR, but the backend loop has existed since at least the 10.0 release based on a godbolt check:
https://godbolt.org/z/Tfnaevboz

Then i think just fixing it may be better than removing some chunks of code.

In D122900#3422413, @lebedev.ri wrote:

In D122900#3422365, @spatel wrote:

In D122900#3422344, @pgousseau wrote:

In D122900#3422266, @lebedev.ri wrote:

Do you know which commit introduced the problem, or which transforms conflict?
This looks like a pretty heavy hammer.

Thank you Roman for reviewing!
The regression appeared after https://github.com/llvm/llvm-project/commit/8a156d1c2795189389fadbf33702384f522f2506 (thank you @wristow for finding it)
As @RKSimon found out, the IR does not get generated anymore from the original C++ test case but the issue is still present.
I am hopeful we can reintroduce the optimization (at a later codegen stage maybe?) but I would like to do some performance analysis first to see if it is still a win.

That change might have exposed the bug from source/IR, but the backend loop has existed since at least the 10.0 release based on a godbolt check:
https://godbolt.org/z/Tfnaevboz

Then i think just fixing it may be better than removing some chunks of code.

Yes, I agree fixing would be best, unfortunately I can’t see an obvious way to fix it inside DAG combiner nor if one the transformation involved is doing something wrong.
My understanding of the issue is that some optimizations disagree on which transformation is the best, causing a ping pong effect. I thought we could eventually fix it by moving the optimisation to the instruction selection stage, any other ideas?

@RKSimon suggested the transformation might be caused by unneeded trunc/zext operations.
It is causing the DAG combiner to needlessly sink them under the OR operations so I have removed them where possible and this fixes the infinite loop.
The instructions generated end up being reordered is some cases but I think it is equivalent.

Remove unused parameter.

Harbormaster completed remote builds in B157781: Diff 420247.Apr 4 2022, 12:27 PM

RKSimon added inline comments.Apr 5 2022, 12:08 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
47701–47702	Is Ret ever null at this point?

Remove unneeded Ret value check.

Fixed! thanks

Harbormaster completed remote builds in B157947: Diff 420464.Apr 5 2022, 6:28 AM

LGTM - thanks for dealing with this

This revision is now accepted and ready to land.Apr 5 2022, 7:24 AM

This revision was landed with ongoing or failed builds.Apr 5 2022, 9:32 AM

Closed by commit rGa3d5f1cf5d88: [x86] Fix infinite loop inside DAG combiner with lzcnt feature. (authored by pgousseau). · Explain Why

This revision was automatically updated to reflect the committed changes.

pgousseau added a commit: rGa3d5f1cf5d88: [x86] Fix infinite loop inside DAG combiner with lzcnt feature..

Diff 420551

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 47,600 Lines • ▼ Show 20 Lines
}		}

// Helper function for combineOrCmpEqZeroToCtlzSrl		// Helper function for combineOrCmpEqZeroToCtlzSrl
// Transforms:		// Transforms:
// seteq(cmp x, 0)		// seteq(cmp x, 0)
// into:		// into:
// srl(ctlz x), log2(bitsize(x))		// srl(ctlz x), log2(bitsize(x))
// Input pattern is checked by caller.		// Input pattern is checked by caller.
static SDValue lowerX86CmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy,		static SDValue lowerX86CmpEqZeroToCtlzSrl(SDValue Op, SelectionDAG &DAG) {
SelectionDAG &DAG) {
SDValue Cmp = Op.getOperand(1);		SDValue Cmp = Op.getOperand(1);
EVT VT = Cmp.getOperand(0).getValueType();		EVT VT = Cmp.getOperand(0).getValueType();
unsigned Log2b = Log2_32(VT.getSizeInBits());		unsigned Log2b = Log2_32(VT.getSizeInBits());
SDLoc dl(Op);		SDLoc dl(Op);
SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Cmp->getOperand(0));		SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Cmp->getOperand(0));
// The result of the shift is true or false, and on X86, the 32-bit		// The result of the shift is true or false, and on X86, the 32-bit
// encoding of shr and lzcnt is more desirable.		// encoding of shr and lzcnt is more desirable.
SDValue Trunc = DAG.getZExtOrTrunc(Clz, dl, MVT::i32);		SDValue Trunc = DAG.getZExtOrTrunc(Clz, dl, MVT::i32);
SDValue Scc = DAG.getNode(ISD::SRL, dl, MVT::i32, Trunc,		SDValue Scc = DAG.getNode(ISD::SRL, dl, MVT::i32, Trunc,
DAG.getConstant(Log2b, dl, MVT::i8));		DAG.getConstant(Log2b, dl, MVT::i8));
return DAG.getZExtOrTrunc(Scc, dl, ExtTy);		return Scc;
}		}

// Try to transform:		// Try to transform:
// zext(or(setcc(eq, (cmp x, 0)), setcc(eq, (cmp y, 0))))		// zext(or(setcc(eq, (cmp x, 0)), setcc(eq, (cmp y, 0))))
// into:		// into:
// srl(or(ctlz(x), ctlz(y)), log2(bitsize(x))		// srl(or(ctlz(x), ctlz(y)), log2(bitsize(x))
// Will also attempt to match more generic cases, eg:		// Will also attempt to match more generic cases, eg:
// zext(or(or(setcc(eq, cmp 0), setcc(eq, cmp 0)), setcc(eq, cmp 0)))		// zext(or(or(setcc(eq, cmp 0), setcc(eq, cmp 0)), setcc(eq, cmp 0)))
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	if (!(isSetCCCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
!isORCandidate(SDValue(OR, 0)))		!isORCandidate(SDValue(OR, 0)))
return SDValue();		return SDValue();

// We have a or(setcc(eq, cmp 0), setcc(eq, cmp 0)) pattern, try to lower it		// We have a or(setcc(eq, cmp 0), setcc(eq, cmp 0)) pattern, try to lower it
// to		// to
// or(srl(ctlz),srl(ctlz)).		// or(srl(ctlz),srl(ctlz)).
// The dag combiner can then fold it into:		// The dag combiner can then fold it into:
// srl(or(ctlz, ctlz)).		// srl(or(ctlz, ctlz)).
EVT VT = OR->getValueType(0);		SDValue NewLHS = lowerX86CmpEqZeroToCtlzSrl(LHS, DAG);
SDValue NewLHS = lowerX86CmpEqZeroToCtlzSrl(LHS, VT, DAG);
SDValue Ret, NewRHS;		SDValue Ret, NewRHS;
if (NewLHS && (NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG)))		if (NewLHS && (NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, DAG)))
Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, NewLHS, NewRHS);		Ret = DAG.getNode(ISD::OR, SDLoc(OR), MVT::i32, NewLHS, NewRHS);

if (!Ret)		if (!Ret)
return SDValue();		return SDValue();

// Try to lower nodes matching the or(or, setcc(eq, cmp 0)) pattern.		// Try to lower nodes matching the or(or, setcc(eq, cmp 0)) pattern.
while (ORNodes.size() > 0) {		while (ORNodes.size() > 0) {
OR = ORNodes.pop_back_val();		OR = ORNodes.pop_back_val();
LHS = OR->getOperand(0);		LHS = OR->getOperand(0);
RHS = OR->getOperand(1);		RHS = OR->getOperand(1);
// Swap rhs with lhs to match or(setcc(eq, cmp, 0), or).		// Swap rhs with lhs to match or(setcc(eq, cmp, 0), or).
if (RHS->getOpcode() == ISD::OR)		if (RHS->getOpcode() == ISD::OR)
std::swap(LHS, RHS);		std::swap(LHS, RHS);
NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG);		NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, DAG);
if (!NewRHS)		if (!NewRHS)
return SDValue();		return SDValue();
Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, Ret, NewRHS);		Ret = DAG.getNode(ISD::OR, SDLoc(OR), MVT::i32, Ret, NewRHS);
}		}

if (Ret)		return DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), N->getValueType(0), Ret);
		RKSimonUnsubmitted Done Reply Inline Actions Is Ret ever null at this point? RKSimon: Is Ret ever null at this point?
Ret = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), N->getValueType(0), Ret);

return Ret;
}		}

static SDValue foldMaskedMergeImpl(SDValue And0_L, SDValue And0_R,		static SDValue foldMaskedMergeImpl(SDValue And0_L, SDValue And0_R,
SDValue And1_L, SDValue And1_R, SDLoc DL,		SDValue And1_L, SDValue And1_R, SDLoc DL,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
if (!isBitwiseNot(And0_L, true) \|\| !And0_L->hasOneUse())		if (!isBitwiseNot(And0_L, true) \|\| !And0_L->hasOneUse())
return SDValue();		return SDValue();
SDValue NotOp = And0_L->getOperand(0);		SDValue NotOp = And0_L->getOperand(0);
▲ Show 20 Lines • Show All 8,061 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/lzcnt-zext-cmp.ll

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	entry:
%lor.ext = zext i1 %0 to i32		%lor.ext = zext i1 %0 to i32
ret i32 %lor.ext		ret i32 %lor.ext
}		}

; Test three 32-bit inputs, output is 32-bit.		; Test three 32-bit inputs, output is 32-bit.
define i32 @test_zext_cmp6(i32 %a, i32 %b, i32 %c) {		define i32 @test_zext_cmp6(i32 %a, i32 %b, i32 %c) {
; FASTLZCNT-LABEL: test_zext_cmp6:		; FASTLZCNT-LABEL: test_zext_cmp6:
; FASTLZCNT: # %bb.0: # %entry		; FASTLZCNT: # %bb.0: # %entry
; FASTLZCNT-NEXT: lzcntl %edi, %eax		; FASTLZCNT-NEXT: lzcntl %edi, %ecx
; FASTLZCNT-NEXT: lzcntl %esi, %ecx
; FASTLZCNT-NEXT: orl %eax, %ecx
; FASTLZCNT-NEXT: lzcntl %edx, %eax		; FASTLZCNT-NEXT: lzcntl %edx, %eax
		; FASTLZCNT-NEXT: lzcntl %esi, %esi
; FASTLZCNT-NEXT: orl %ecx, %eax		; FASTLZCNT-NEXT: orl %ecx, %eax
		; FASTLZCNT-NEXT: orl %esi, %eax
; FASTLZCNT-NEXT: shrl $5, %eax		; FASTLZCNT-NEXT: shrl $5, %eax
; FASTLZCNT-NEXT: retq		; FASTLZCNT-NEXT: retq
;		;
; NOFASTLZCNT-LABEL: test_zext_cmp6:		; NOFASTLZCNT-LABEL: test_zext_cmp6:
; NOFASTLZCNT: # %bb.0: # %entry		; NOFASTLZCNT: # %bb.0: # %entry
; NOFASTLZCNT-NEXT: testl %edi, %edi		; NOFASTLZCNT-NEXT: testl %edi, %edi
; NOFASTLZCNT-NEXT: sete %al		; NOFASTLZCNT-NEXT: sete %al
; NOFASTLZCNT-NEXT: testl %esi, %esi		; NOFASTLZCNT-NEXT: testl %esi, %esi
Show All 14 Lines	entry:
ret i32 %lor.ext		ret i32 %lor.ext
}		}

; Test three 32-bit inputs, output is 32-bit, but compared to test_zext_cmp6 test,		; Test three 32-bit inputs, output is 32-bit, but compared to test_zext_cmp6 test,
; %.cmp2 inputs' order is inverted.		; %.cmp2 inputs' order is inverted.
define i32 @test_zext_cmp7(i32 %a, i32 %b, i32 %c) {		define i32 @test_zext_cmp7(i32 %a, i32 %b, i32 %c) {
; FASTLZCNT-LABEL: test_zext_cmp7:		; FASTLZCNT-LABEL: test_zext_cmp7:
; FASTLZCNT: # %bb.0: # %entry		; FASTLZCNT: # %bb.0: # %entry
; FASTLZCNT-NEXT: lzcntl %edi, %eax		; FASTLZCNT-NEXT: lzcntl %edi, %ecx
; FASTLZCNT-NEXT: lzcntl %esi, %ecx
; FASTLZCNT-NEXT: orl %eax, %ecx
; FASTLZCNT-NEXT: lzcntl %edx, %eax		; FASTLZCNT-NEXT: lzcntl %edx, %eax
		; FASTLZCNT-NEXT: lzcntl %esi, %esi
; FASTLZCNT-NEXT: orl %ecx, %eax		; FASTLZCNT-NEXT: orl %ecx, %eax
		; FASTLZCNT-NEXT: orl %esi, %eax
; FASTLZCNT-NEXT: shrl $5, %eax		; FASTLZCNT-NEXT: shrl $5, %eax
; FASTLZCNT-NEXT: retq		; FASTLZCNT-NEXT: retq
;		;
; NOFASTLZCNT-LABEL: test_zext_cmp7:		; NOFASTLZCNT-LABEL: test_zext_cmp7:
; NOFASTLZCNT: # %bb.0: # %entry		; NOFASTLZCNT: # %bb.0: # %entry
; NOFASTLZCNT-NEXT: testl %edi, %edi		; NOFASTLZCNT-NEXT: testl %edi, %edi
; NOFASTLZCNT-NEXT: sete %al		; NOFASTLZCNT-NEXT: sete %al
; NOFASTLZCNT-NEXT: testl %esi, %esi		; NOFASTLZCNT-NEXT: testl %esi, %esi
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
; ALL-NEXT: retq		; ALL-NEXT: retq
entry:		entry:
%cmp = fcmp fast oeq double %a, 0.000000e+00		%cmp = fcmp fast oeq double %a, 0.000000e+00
%cmp1 = fcmp fast oeq double %b, 0.000000e+00		%cmp1 = fcmp fast oeq double %b, 0.000000e+00
%0 = or i1 %cmp, %cmp1		%0 = or i1 %cmp, %cmp1
%conv = zext i1 %0 to i32		%conv = zext i1 %0 to i32
ret i32 %conv		ret i32 %conv
}		}

		; PR54694 Fix an infinite loop in DAG combiner.
		define i32 @test_zext_cmp12(i32 %0, i32 %1) {
		; FASTLZCNT-LABEL: test_zext_cmp12:
		; FASTLZCNT: # %bb.0:
		; FASTLZCNT-NEXT: andl $131072, %edi # imm = 0x20000
		; FASTLZCNT-NEXT: andl $131072, %esi # imm = 0x20000
		; FASTLZCNT-NEXT: lzcntl %edi, %eax
		; FASTLZCNT-NEXT: lzcntl %esi, %ecx
		; FASTLZCNT-NEXT: orl %eax, %ecx
		; FASTLZCNT-NEXT: movl $2, %eax
		; FASTLZCNT-NEXT: shrl $5, %ecx
		; FASTLZCNT-NEXT: subl %ecx, %eax
		; FASTLZCNT-NEXT: retq
		;
		; NOFASTLZCNT-LABEL: test_zext_cmp12:
		; NOFASTLZCNT: # %bb.0:
		; NOFASTLZCNT-NEXT: testl $131072, %edi # imm = 0x20000
		; NOFASTLZCNT-NEXT: sete %al
		; NOFASTLZCNT-NEXT: testl $131072, %esi # imm = 0x20000
		; NOFASTLZCNT-NEXT: sete %cl
		; NOFASTLZCNT-NEXT: orb %al, %cl
		; NOFASTLZCNT-NEXT: movl $2, %eax
		; NOFASTLZCNT-NEXT: movzbl %cl, %ecx
		; NOFASTLZCNT-NEXT: subl %ecx, %eax
		; NOFASTLZCNT-NEXT: retq
		%3 = and i32 %0, 131072
		%4 = icmp eq i32 %3, 0
		%5 = and i32 %1, 131072
		%6 = icmp eq i32 %5, 0
		%7 = select i1 %4, i1 true, i1 %6
		%8 = select i1 %7, i32 1, i32 2
		ret i32 %8
		}

This is an archive of the discontinued LLVM Phabricator instance.

[x86] Fix infinite loop inside DAG combiner lzcnt's optimization.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 420551

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/lzcnt-zext-cmp.ll

This is an archive of the discontinued LLVM Phabricator instance.

[x86] Fix infinite loop inside DAG combiner lzcnt's optimization.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 420551

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/lzcnt-zext-cmp.ll

[x86] Fix infinite loop inside DAG combiner lzcnt's optimization.
ClosedPublic