Download Raw Diff

Details

Reviewers

pengfei
RKSimon

Commits

rGcf8fadcf9b93: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`

Summary

Was previously de-optimizating if -march supported lzcnt as there is
no reason to add the extra instruction.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

goldstein.w.n created this revision.Jan 11 2023, 12:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 12:47 AM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

goldstein.w.n requested review of this revision.Jan 11 2023, 12:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 12:47 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

goldstein.w.n added reviewers: pengfei, RKSimon.Jan 11 2023, 12:54 AM

craig.topper added a subscriber: craig.topper.Jan 11 2023, 12:57 AM

craig.topper added inline comments.

llvm/lib/Target/X86/X86InstrCompiler.td
2254 ↗	(On Diff #488100)	This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value. I think this is only valid for the zero_undef nodes.

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

In D141464#4042695, @craig.topper wrote:

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

goldstein.w.n added inline comments.Jan 11 2023, 1:53 AM

llvm/lib/Target/X86/X86InstrCompiler.td

2254 ↗

(On Diff #488100)

This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value.

I think this is only valid for the zero_undef nodes.

I think you're right although call i32 @llvm.ctlz.i32(i32, i1 true) doesn't seem to match ctlz_zero_undef.
Also taking a look at lzcnt def:

  def LZCNT16rr : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src),
                    "lzcnt{w}\t{$src, $dst|$dst, $src}",
                    [(set GR16:$dst, (ctlz GR16:$src)), (implicit EFLAGS)]>,
                    XS, OpSize16, Sched<[WriteLZCNT]>;
  def LZCNT16rm : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src),
                    "lzcnt{w}\t{$src, $dst|$dst, $src}",
                    [(set GR16:$dst, (ctlz (loadi16 addr:$src))),
                     (implicit EFLAGS)]>, XS, OpSize16, Sched<[WriteLZCNTLd]>;

  def LZCNT32rr : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src),
                    "lzcnt{l}\t{$src, $dst|$dst, $src}",
                    [(set GR32:$dst, (ctlz GR32:$src)), (implicit EFLAGS)]>,
                    XS, OpSize32, Sched<[WriteLZCNT]>;
...

Which is just set as ctlz so I thought it was just a very confusing overloaded wording in the TD. Do you know what I'm missing?

Harbormaster completed remote builds in B207006: Diff 488100.Jan 11 2023, 2:11 AM

We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt

In D141464#4042773, @goldstein.w.n wrote:

In D141464#4042695, @craig.topper wrote:

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?

Only do for ctlz_zero_undef. Add more tests

goldstein.w.n added a parent revision: D141549: [X86] Add additional tests for ctlz{_zero_undef} to test folding with xor; NFC.Jan 11 2023, 3:17 PM

Harbormaster completed remote builds in B207225: Diff 488397.Jan 11 2023, 3:17 PM

In D141464#4042876, @RKSimon wrote:

We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt

Done in V2 (I think, used HasFastLZCNT).

In D141464#4042773, @goldstein.w.n wrote:

In D141464#4042695, @craig.topper wrote:

The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.

bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.

Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.

The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?

llvm/lib/Target/X86/X86ISelLowering.cpp
51899	Use curly braces here for consistency with the other blocks.
51923	This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags.

craig.topper added inline comments.Jan 11 2023, 4:04 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
51912	You can use C->getZExtValue() since we know the VT is 64-bits or less.

goldstein.w.n marked 3 inline comments as done.Jan 11 2023, 9:50 PM

Fix some nits

Harbormaster completed remote builds in B207275: Diff 488472.Jan 11 2023, 11:22 PM

pengfei added inline comments.Jan 12 2023, 12:39 AM

llvm/test/CodeGen/X86/clz.ll
971–976	The FIXME is solved :) But maybe better to leave a comment the `FASTLZCNT` is intended.

goldstein.w.n marked an inline comment as done.Jan 12 2023, 9:33 AM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
51923	This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags.

Propegate test changes. Update fixme comment with comment explaining fast-lzcnt

Harbormaster completed remote builds in B207435: Diff 488697.Jan 12 2023, 11:31 AM

Rebase

Harbormaster completed remote builds in B208158: Diff 489702.Jan 16 2023, 11:36 PM

Just a couple of minor comments

llvm/lib/Target/X86/X86ISelLowering.cpp
52076	(style) assert message
52108	if (!C)
52126	(style) remove braces
llvm/test/CodeGen/X86/clz.ll
971–976	What target can manage 1c latency bsr?

Fix some nits

llvm/lib/Target/X86/X86ISelLowering.cpp
52108	if (!C) Are the two equivilent? Always thought `!C` is a zero/non-zero check but `nullptr != 0` is techincally possible.
llvm/test/CodeGen/X86/clz.ll
971–976	What target can manage 1c latency bsr? Zen4 and I think a few other AMD ones (although more have 1c lzcnt and expensive bsr).

Harbormaster completed remote builds in B208537: Diff 490212.Jan 18 2023, 10:52 AM

ping.

ping2.

Matt added a subscriber: Matt.Feb 1 2023, 8:23 PM

ping3.

LGTM - cheers

This revision is now accepted and ready to land.Feb 6 2023, 8:28 AM

This revision was landed with ongoing or failed builds.Feb 6 2023, 12:16 PM

Closed by commit rGcf8fadcf9b93: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor` (authored by goldstein.w.n). · Explain Why

This revision was automatically updated to reflect the committed changes.

goldstein.w.n added a commit: rGcf8fadcf9b93: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`.

Diff 495240

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 51,890 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != NumElems; ++i) {
if (ZExtIn != N00In \|\| SExtIn != N01In \|\|		if (ZExtIn != N00In \|\| SExtIn != N01In \|\|
ZExtIn != N10In \|\| SExtIn != N11In)		ZExtIn != N10In \|\| SExtIn != N11In)
return SDValue();		return SDValue();
}		}

auto PMADDBuilder = [](SelectionDAG &DAG, const SDLoc &DL,		auto PMADDBuilder = [](SelectionDAG &DAG, const SDLoc &DL,
ArrayRef<SDValue> Ops) {		ArrayRef<SDValue> Ops) {
// Shrink by adding truncate nodes and let DAGCombine fold with the		// Shrink by adding truncate nodes and let DAGCombine fold with the
// sources.		// sources.
		craig.topperUnsubmitted Done Reply Inline Actions Use curly braces here for consistency with the other blocks. craig.topper: Use curly braces here for consistency with the other blocks.
EVT InVT = Ops[0].getValueType();		EVT InVT = Ops[0].getValueType();
assert(InVT.getScalarType() == MVT::i8 &&		assert(InVT.getScalarType() == MVT::i8 &&
"Unexpected scalar element type");		"Unexpected scalar element type");
assert(InVT == Ops[1].getValueType() && "Operands' types mismatch");		assert(InVT == Ops[1].getValueType() && "Operands' types mismatch");
EVT ResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16,		EVT ResVT = EVT::getVectorVT(*DAG.getContext(), MVT::i16,
InVT.getVectorNumElements() / 2);		InVT.getVectorNumElements() / 2);
return DAG.getNode(X86ISD::VPMADDUBSW, DL, ResVT, Ops[0], Ops[1]);		return DAG.getNode(X86ISD::VPMADDUBSW, DL, ResVT, Ops[0], Ops[1]);
};		};
return SplitOpsAndApply(DAG, Subtarget, DL, VT, { ZExtIn, SExtIn },		return SplitOpsAndApply(DAG, Subtarget, DL, VT, { ZExtIn, SExtIn },
PMADDBuilder);		PMADDBuilder);
}		}

static SDValue combineTruncate(SDNode *N, SelectionDAG &DAG,		static SDValue combineTruncate(SDNode *N, SelectionDAG &DAG,
		craig.topperUnsubmitted Done Reply Inline Actions You can use C->getZExtValue() since we know the VT is 64-bits or less. craig.topper: You can use C->getZExtValue() since we know the VT is 64-bits or less.
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Src = N->getOperand(0);		SDValue Src = N->getOperand(0);
SDLoc DL(N);		SDLoc DL(N);

// Attempt to pre-truncate inputs to arithmetic ops instead.		// Attempt to pre-truncate inputs to arithmetic ops instead.
if (SDValue V = combineTruncatedArithmetic(N, DAG, Subtarget, DL))		if (SDValue V = combineTruncatedArithmetic(N, DAG, Subtarget, DL))
return V;		return V;

// Try to detect AVG pattern first.		// Try to detect AVG pattern first.
if (SDValue Avg = detectAVGPattern(Src, VT, DAG, Subtarget, DL))		if (SDValue Avg = detectAVGPattern(Src, VT, DAG, Subtarget, DL))
		craig.topperUnsubmitted Done Reply Inline Actions This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags. craig.topper: This needs to be ``` SDVTList VTs = DAG.getVTList(OpVT, MVT::i32)…
		goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags. goldstein.w.n: > This needs to be > > ``` > SDVTList VTs = DAG.getVTList(OpVT, MVT::i32)…
return Avg;		return Avg;

// Try to detect PMADD		// Try to detect PMADD
if (SDValue PMAdd = detectPMADDUBSW(Src, VT, DAG, Subtarget, DL))		if (SDValue PMAdd = detectPMADDUBSW(Src, VT, DAG, Subtarget, DL))
return PMAdd;		return PMAdd;

// Try to combine truncation with signed/unsigned saturation.		// Try to combine truncation with signed/unsigned saturation.
if (SDValue Val = combineTruncateWithSat(Src, VT, DL, DAG, Subtarget))		if (SDValue Val = combineTruncateWithSat(Src, VT, DL, DAG, Subtarget))
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	if (NegMul) {
case X86ISD::FMSUB: Opcode = X86ISD::FNMSUB; break;		case X86ISD::FMSUB: Opcode = X86ISD::FNMSUB; break;
case X86ISD::STRICT_FMSUB: Opcode = X86ISD::STRICT_FNMSUB; break;		case X86ISD::STRICT_FMSUB: Opcode = X86ISD::STRICT_FNMSUB; break;
case X86ISD::FMSUB_RND: Opcode = X86ISD::FNMSUB_RND; break;		case X86ISD::FMSUB_RND: Opcode = X86ISD::FNMSUB_RND; break;
case X86ISD::FNMADD: Opcode = ISD::FMA; break;		case X86ISD::FNMADD: Opcode = ISD::FMA; break;
case X86ISD::STRICT_FNMADD: Opcode = ISD::STRICT_FMA; break;		case X86ISD::STRICT_FNMADD: Opcode = ISD::STRICT_FMA; break;
case X86ISD::FNMADD_RND: Opcode = X86ISD::FMADD_RND; break;		case X86ISD::FNMADD_RND: Opcode = X86ISD::FMADD_RND; break;
case X86ISD::FNMSUB: Opcode = X86ISD::FMSUB; break;		case X86ISD::FNMSUB: Opcode = X86ISD::FMSUB; break;
case X86ISD::STRICT_FNMSUB: Opcode = X86ISD::STRICT_FMSUB; break;		case X86ISD::STRICT_FNMSUB: Opcode = X86ISD::STRICT_FMSUB; break;
case X86ISD::FNMSUB_RND: Opcode = X86ISD::FMSUB_RND; break;		case X86ISD::FNMSUB_RND: Opcode = X86ISD::FMSUB_RND; break;
		RKSimonUnsubmitted Done Reply Inline Actions (style) assert message RKSimon: (style) assert message
}		}
}		}

if (NegAcc) {		if (NegAcc) {
switch (Opcode) {		switch (Opcode) {
default: llvm_unreachable("Unexpected opcode");		default: llvm_unreachable("Unexpected opcode");
case ISD::FMA: Opcode = X86ISD::FMSUB; break;		case ISD::FMA: Opcode = X86ISD::FMSUB; break;
case ISD::STRICT_FMA: Opcode = X86ISD::STRICT_FMSUB; break;		case ISD::STRICT_FMA: Opcode = X86ISD::STRICT_FMSUB; break;
Show All 15 Lines	static unsigned negateFMAOpcode(unsigned Opcode, bool NegMul, bool NegAcc,
}		}

if (NegRes) {		if (NegRes) {
switch (Opcode) {		switch (Opcode) {
// For accuracy reason, we never combine fneg and fma under strict FP.		// For accuracy reason, we never combine fneg and fma under strict FP.
default: llvm_unreachable("Unexpected opcode");		default: llvm_unreachable("Unexpected opcode");
case ISD::FMA: Opcode = X86ISD::FNMSUB; break;		case ISD::FMA: Opcode = X86ISD::FNMSUB; break;
case X86ISD::FMADD_RND: Opcode = X86ISD::FNMSUB_RND; break;		case X86ISD::FMADD_RND: Opcode = X86ISD::FNMSUB_RND; break;
case X86ISD::FMSUB: Opcode = X86ISD::FNMADD; break;		case X86ISD::FMSUB: Opcode = X86ISD::FNMADD; break;
		RKSimonUnsubmitted Done Reply Inline Actions if (!C) RKSimon: if (!C)
		goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions if (!C) Are the two equivilent? Always thought `!C` is a zero/non-zero check but `nullptr != 0` is techincally possible. goldstein.w.n: > if (!C) Are the two equivilent? Always thought `!C` is a zero/non-zero check but `nullptr !=…
case X86ISD::FMSUB_RND: Opcode = X86ISD::FNMADD_RND; break;		case X86ISD::FMSUB_RND: Opcode = X86ISD::FNMADD_RND; break;
case X86ISD::FNMADD: Opcode = X86ISD::FMSUB; break;		case X86ISD::FNMADD: Opcode = X86ISD::FMSUB; break;
case X86ISD::FNMADD_RND: Opcode = X86ISD::FMSUB_RND; break;		case X86ISD::FNMADD_RND: Opcode = X86ISD::FMSUB_RND; break;
case X86ISD::FNMSUB: Opcode = ISD::FMA; break;		case X86ISD::FNMSUB: Opcode = ISD::FMA; break;
case X86ISD::FNMSUB_RND: Opcode = X86ISD::FMADD_RND; break;		case X86ISD::FNMSUB_RND: Opcode = X86ISD::FMADD_RND; break;
}		}
}		}

return Opcode;		return Opcode;
}		}

/// Do target-specific dag combines on floating point negations.		/// Do target-specific dag combines on floating point negations.
static SDValue combineFneg(SDNode *N, SelectionDAG &DAG,		static SDValue combineFneg(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT OrigVT = N->getValueType(0);		EVT OrigVT = N->getValueType(0);
SDValue Arg = isFNEG(DAG, N);		SDValue Arg = isFNEG(DAG, N);
if (!Arg)		if (!Arg)
		RKSimonUnsubmitted Done Reply Inline Actions (style) remove braces RKSimon: (style) remove braces
return SDValue();		return SDValue();

const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
EVT VT = Arg.getValueType();		EVT VT = Arg.getValueType();
EVT SVT = VT.getScalarType();		EVT SVT = VT.getScalarType();
SDLoc DL(N);		SDLoc DL(N);

// Let legalize expand this if it isn't a legal type yet.		// Let legalize expand this if it isn't a legal type yet.
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	if (!isOneConstant(N->getOperand(1)) \|\| LHS->getOpcode() != X86ISD::SETCC)
return SDValue();		return SDValue();

X86::CondCode NewCC = X86::GetOppositeBranchCondition(		X86::CondCode NewCC = X86::GetOppositeBranchCondition(
X86::CondCode(LHS->getConstantOperandVal(0)));		X86::CondCode(LHS->getConstantOperandVal(0)));
SDLoc DL(N);		SDLoc DL(N);
return getSETCC(NewCC, LHS->getOperand(1), DL, DAG);		return getSETCC(NewCC, LHS->getOperand(1), DL, DAG);
}		}

		static SDValue combineXorSubCTLZ(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		assert((N->getOpcode() == ISD::XOR \|\| N->getOpcode() == ISD::SUB) &&
		"Invalid opcode for combing with CTLZ");
		if (Subtarget.hasFastLZCNT())
		return SDValue();

		EVT VT = N->getValueType(0);
		if (VT != MVT::i8 && VT != MVT::i16 && VT != MVT::i32 &&
		(VT != MVT::i64 \|\| !Subtarget.is64Bit()))
		return SDValue();

		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);

		if (N0.getOpcode() != ISD::CTLZ_ZERO_UNDEF &&
		N1.getOpcode() != ISD::CTLZ_ZERO_UNDEF)
		return SDValue();

		SDValue OpCTLZ;
		SDValue OpSizeTM1;

		if (N1.getOpcode() == ISD::CTLZ_ZERO_UNDEF) {
		OpCTLZ = N1;
		OpSizeTM1 = N0;
		} else if (N->getOpcode() == ISD::SUB) {
		return SDValue();
		} else {
		OpCTLZ = N0;
		OpSizeTM1 = N1;
		}

		if (!OpCTLZ.hasOneUse())
		return SDValue();
		auto *C = dyn_cast<ConstantSDNode>(OpSizeTM1);
		if (!C)
		return SDValue();

		if (C->getZExtValue() != uint64_t(OpCTLZ.getValueSizeInBits() - 1))
		return SDValue();
		SDLoc DL(N);
		EVT OpVT = VT;
		SDValue Op = OpCTLZ.getOperand(0);
		if (VT == MVT::i8) {
		// Zero extend to i32 since there is not an i8 bsr.
		OpVT = MVT::i32;
		Op = DAG.getNode(ISD::ZERO_EXTEND, DL, OpVT, Op);
		}

		SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);
		Op = DAG.getNode(X86ISD::BSR, DL, VTs, Op);
		if (VT == MVT::i8)
		Op = DAG.getNode(ISD::TRUNCATE, DL, MVT::i8, Op);

		return Op;
		}

static SDValue combineXor(SDNode *N, SelectionDAG &DAG,		static SDValue combineXor(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// If this is SSE1 only convert to FXOR to avoid scalarization.		// If this is SSE1 only convert to FXOR to avoid scalarization.
Show All 11 Lines	if (SDValue R = combineBitOpWithMOVMSK(N, DAG))
return R;		return R;

if (SDValue R = combineBitOpWithShift(N, DAG))		if (SDValue R = combineBitOpWithShift(N, DAG))
return R;		return R;

if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, DCI, Subtarget))		if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, DCI, Subtarget))
return FPLogic;		return FPLogic;

		if (SDValue R = combineXorSubCTLZ(N, DAG, Subtarget))
		return R;

if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

if (SDValue SetCC = foldXor1SetCC(N, DAG))		if (SDValue SetCC = foldXor1SetCC(N, DAG))
return SetCC;		return SetCC;

if (SDValue RV = foldXorTruncShiftIntoCmp(N, DAG))		if (SDValue RV = foldXorTruncShiftIntoCmp(N, DAG))
return RV;		return RV;
▲ Show 20 Lines • Show All 2,869 Lines • ▼ Show 20 Lines	if (Op1.getOpcode() == X86ISD::SBB && Op1->hasOneUse() &&
!(X86::isZeroNode(Op0) && X86::isZeroNode(Op1.getOperand(1)))) {		!(X86::isZeroNode(Op0) && X86::isZeroNode(Op1.getOperand(1)))) {
assert(!Op1->hasAnyUseOfValue(1) && "Overflow bit in use");		assert(!Op1->hasAnyUseOfValue(1) && "Overflow bit in use");
SDValue ADC = DAG.getNode(X86ISD::ADC, SDLoc(Op1), Op1->getVTList(), Op0,		SDValue ADC = DAG.getNode(X86ISD::ADC, SDLoc(Op1), Op1->getVTList(), Op0,
Op1.getOperand(1), Op1.getOperand(2));		Op1.getOperand(1), Op1.getOperand(2));
return DAG.getNode(ISD::SUB, SDLoc(N), Op0.getValueType(), ADC.getValue(0),		return DAG.getNode(ISD::SUB, SDLoc(N), Op0.getValueType(), ADC.getValue(0),
Op1.getOperand(0));		Op1.getOperand(0));
}		}

		if (SDValue V = combineXorSubCTLZ(N, DAG, Subtarget))
		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		return combineAddOrSubToADCOrSBB(N, DAG);
}		}

static SDValue combineVectorCompare(SDNode *N, SelectionDAG &DAG,		static SDValue combineVectorCompare(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
MVT VT = N->getSimpleValueType(0);		MVT VT = N->getSimpleValueType(0);
SDLoc DL(N);		SDLoc DL(N);

▲ Show 20 Lines • Show All 2,773 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/clz.ll

	Show First 20 Lines • Show All 962 Lines • ▼ Show 20 Lines
	; X86-FASTLZCNT-NEXT: retl			; X86-FASTLZCNT-NEXT: retl
	%or = or i32 %n, 1			%or = or i32 %n, 1
	%tmp1 = call i32 @llvm.ctlz.i32(i32 %or, i1 false)			%tmp1 = call i32 @llvm.ctlz.i32(i32 %or, i1 false)
	ret i32 %tmp1			ret i32 %tmp1
	}			}

	; Don't generate any xors when a 'ctlz' intrinsic is actually used to compute			; Don't generate any xors when a 'ctlz' intrinsic is actually used to compute
	; the most significant bit, which is what 'bsr' does natively.			; the most significant bit, which is what 'bsr' does natively.
	; FIXME: We should probably select BSR instead of LZCNT in these circumstances.			; NOTE: We intentionally don't select `bsr` when `fast-lzcnt` is
				; available. This is 1) because `bsr` has some drawbacks including a
				; dependency on dst, 2) very poor performance on some of the
				; `fast-lzcnt` processors, and 3) `lzcnt` runs at ALU latency/throughput
				; so `lzcnt` + `xor` has better throughput than even the 1-uop
				; (1c latency, 1c throughput) `bsr`.
				pengfeiUnsubmitted Done Reply Inline Actions The FIXME is solved :) But maybe better to leave a comment the `FASTLZCNT` is intended. pengfei: The FIXME is solved :) But maybe better to leave a comment the `FASTLZCNT` is intended.
				RKSimonUnsubmitted Done Reply Inline Actions What target can manage 1c latency bsr? RKSimon: What target can manage 1c latency bsr?
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions What target can manage 1c latency bsr? Zen4 and I think a few other AMD ones (although more have 1c lzcnt and expensive bsr). goldstein.w.n: > What target can manage 1c latency bsr? Zen4 and I think a few other AMD ones (although more…
	define i32 @ctlz_bsr(i32 %n) {			define i32 @ctlz_bsr(i32 %n) {
	; X86-LABEL: ctlz_bsr:			; X86-LABEL: ctlz_bsr:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: bsrl {{[0-9]+}}(%esp), %eax			; X86-NEXT: bsrl {{[0-9]+}}(%esp), %eax
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: ctlz_bsr:			; X64-LABEL: ctlz_bsr:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: bsrl %edi, %eax			; X64-NEXT: bsrl %edi, %eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-CLZ-LABEL: ctlz_bsr:			; X86-CLZ-LABEL: ctlz_bsr:
	; X86-CLZ: # %bb.0:			; X86-CLZ: # %bb.0:
	; X86-CLZ-NEXT: lzcntl {{[0-9]+}}(%esp), %eax			; X86-CLZ-NEXT: bsrl {{[0-9]+}}(%esp), %eax
	; X86-CLZ-NEXT: xorl $31, %eax
	; X86-CLZ-NEXT: retl			; X86-CLZ-NEXT: retl
	;			;
	; X64-CLZ-LABEL: ctlz_bsr:			; X64-CLZ-LABEL: ctlz_bsr:
	; X64-CLZ: # %bb.0:			; X64-CLZ: # %bb.0:
	; X64-CLZ-NEXT: lzcntl %edi, %eax			; X64-CLZ-NEXT: bsrl %edi, %eax
	; X64-CLZ-NEXT: xorl $31, %eax
	; X64-CLZ-NEXT: retq			; X64-CLZ-NEXT: retq
	;			;
	; X64-FASTLZCNT-LABEL: ctlz_bsr:			; X64-FASTLZCNT-LABEL: ctlz_bsr:
	; X64-FASTLZCNT: # %bb.0:			; X64-FASTLZCNT: # %bb.0:
	; X64-FASTLZCNT-NEXT: lzcntl %edi, %eax			; X64-FASTLZCNT-NEXT: lzcntl %edi, %eax
	; X64-FASTLZCNT-NEXT: xorl $31, %eax			; X64-FASTLZCNT-NEXT: xorl $31, %eax
	; X64-FASTLZCNT-NEXT: retq			; X64-FASTLZCNT-NEXT: retq
	;			;
	▲ Show 20 Lines • Show All 435 Lines • ▼ Show 20 Lines
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: bsrl %edi, %eax			; X64-NEXT: bsrl %edi, %eax
	; X64-NEXT: movsbl (%rsi,%rax), %eax			; X64-NEXT: movsbl (%rsi,%rax), %eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-CLZ-LABEL: PR47603_zext:			; X86-CLZ-LABEL: PR47603_zext:
	; X86-CLZ: # %bb.0:			; X86-CLZ: # %bb.0:
	; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-CLZ-NEXT: lzcntl {{[0-9]+}}(%esp), %ecx			; X86-CLZ-NEXT: bsrl {{[0-9]+}}(%esp), %ecx
	; X86-CLZ-NEXT: xorl $31, %ecx
	; X86-CLZ-NEXT: movsbl (%eax,%ecx), %eax			; X86-CLZ-NEXT: movsbl (%eax,%ecx), %eax
	; X86-CLZ-NEXT: retl			; X86-CLZ-NEXT: retl
	;			;
	; X64-CLZ-LABEL: PR47603_zext:			; X64-CLZ-LABEL: PR47603_zext:
	; X64-CLZ: # %bb.0:			; X64-CLZ: # %bb.0:
	; X64-CLZ-NEXT: lzcntl %edi, %eax			; X64-CLZ-NEXT: lzcntl %edi, %eax
	; X64-CLZ-NEXT: xorq $31, %rax			; X64-CLZ-NEXT: xorq $31, %rax
	; X64-CLZ-NEXT: movsbl (%rsi,%rax), %eax			; X64-CLZ-NEXT: movsbl (%rsi,%rax), %eax
	▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
	; X64-NEXT: movzbl %dil, %eax			; X64-NEXT: movzbl %dil, %eax
	; X64-NEXT: bsrl %eax, %eax			; X64-NEXT: bsrl %eax, %eax
	; X64-NEXT: # kill: def $al killed $al killed $eax			; X64-NEXT: # kill: def $al killed $al killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-CLZ-LABEL: ctlz_xor7_i8_true:			; X86-CLZ-LABEL: ctlz_xor7_i8_true:
	; X86-CLZ: # %bb.0:			; X86-CLZ: # %bb.0:
	; X86-CLZ-NEXT: movzbl {{[0-9]+}}(%esp), %eax			; X86-CLZ-NEXT: movzbl {{[0-9]+}}(%esp), %eax
	; X86-CLZ-NEXT: lzcntl %eax, %eax			; X86-CLZ-NEXT: bsrl %eax, %eax
	; X86-CLZ-NEXT: addl $-24, %eax
	; X86-CLZ-NEXT: xorb $7, %al
	; X86-CLZ-NEXT: # kill: def $al killed $al killed $eax			; X86-CLZ-NEXT: # kill: def $al killed $al killed $eax
	; X86-CLZ-NEXT: retl			; X86-CLZ-NEXT: retl
	;			;
	; X64-CLZ-LABEL: ctlz_xor7_i8_true:			; X64-CLZ-LABEL: ctlz_xor7_i8_true:
	; X64-CLZ: # %bb.0:			; X64-CLZ: # %bb.0:
	; X64-CLZ-NEXT: movzbl %dil, %eax			; X64-CLZ-NEXT: movzbl %dil, %eax
	; X64-CLZ-NEXT: lzcntl %eax, %eax			; X64-CLZ-NEXT: bsrl %eax, %eax
	; X64-CLZ-NEXT: addl $-24, %eax
	; X64-CLZ-NEXT: xorb $7, %al
	; X64-CLZ-NEXT: # kill: def $al killed $al killed $eax			; X64-CLZ-NEXT: # kill: def $al killed $al killed $eax
	; X64-CLZ-NEXT: retq			; X64-CLZ-NEXT: retq
	;			;
	; X64-FASTLZCNT-LABEL: ctlz_xor7_i8_true:			; X64-FASTLZCNT-LABEL: ctlz_xor7_i8_true:
	; X64-FASTLZCNT: # %bb.0:			; X64-FASTLZCNT: # %bb.0:
	; X64-FASTLZCNT-NEXT: movzbl %dil, %eax			; X64-FASTLZCNT-NEXT: movzbl %dil, %eax
	; X64-FASTLZCNT-NEXT: lzcntl %eax, %eax			; X64-FASTLZCNT-NEXT: lzcntl %eax, %eax
	; X64-FASTLZCNT-NEXT: addl $-24, %eax			; X64-FASTLZCNT-NEXT: addl $-24, %eax
	▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
	;			;
	; X64-LABEL: ctlz_xor15_i16_true:			; X64-LABEL: ctlz_xor15_i16_true:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: bsrw %di, %ax			; X64-NEXT: bsrw %di, %ax
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-CLZ-LABEL: ctlz_xor15_i16_true:			; X86-CLZ-LABEL: ctlz_xor15_i16_true:
	; X86-CLZ: # %bb.0:			; X86-CLZ: # %bb.0:
	; X86-CLZ-NEXT: lzcntw {{[0-9]+}}(%esp), %ax			; X86-CLZ-NEXT: bsrw {{[0-9]+}}(%esp), %ax
	; X86-CLZ-NEXT: xorl $15, %eax
	; X86-CLZ-NEXT: # kill: def $ax killed $ax killed $eax
	; X86-CLZ-NEXT: retl			; X86-CLZ-NEXT: retl
	;			;
	; X64-CLZ-LABEL: ctlz_xor15_i16_true:			; X64-CLZ-LABEL: ctlz_xor15_i16_true:
	; X64-CLZ: # %bb.0:			; X64-CLZ: # %bb.0:
	; X64-CLZ-NEXT: lzcntw %di, %ax			; X64-CLZ-NEXT: bsrw %di, %ax
	; X64-CLZ-NEXT: xorl $15, %eax
	; X64-CLZ-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-CLZ-NEXT: retq			; X64-CLZ-NEXT: retq
	;			;
	; X64-FASTLZCNT-LABEL: ctlz_xor15_i16_true:			; X64-FASTLZCNT-LABEL: ctlz_xor15_i16_true:
	; X64-FASTLZCNT: # %bb.0:			; X64-FASTLZCNT: # %bb.0:
	; X64-FASTLZCNT-NEXT: lzcntw %di, %ax			; X64-FASTLZCNT-NEXT: lzcntw %di, %ax
	; X64-FASTLZCNT-NEXT: xorl $15, %eax			; X64-FASTLZCNT-NEXT: xorl $15, %eax
	; X64-FASTLZCNT-NEXT: # kill: def $ax killed $ax killed $eax			; X64-FASTLZCNT-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-FASTLZCNT-NEXT: retq			; X64-FASTLZCNT-NEXT: retq
	▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines
	; X86-CLZ-NEXT: lzcntl %eax, %eax			; X86-CLZ-NEXT: lzcntl %eax, %eax
	; X86-CLZ-NEXT: .LBB31_3:			; X86-CLZ-NEXT: .LBB31_3:
	; X86-CLZ-NEXT: xorl $63, %eax			; X86-CLZ-NEXT: xorl $63, %eax
	; X86-CLZ-NEXT: xorl %edx, %edx			; X86-CLZ-NEXT: xorl %edx, %edx
	; X86-CLZ-NEXT: retl			; X86-CLZ-NEXT: retl
	;			;
	; X64-CLZ-LABEL: ctlz_xor63_i64_true:			; X64-CLZ-LABEL: ctlz_xor63_i64_true:
	; X64-CLZ: # %bb.0:			; X64-CLZ: # %bb.0:
	; X64-CLZ-NEXT: lzcntq %rdi, %rax			; X64-CLZ-NEXT: bsrq %rdi, %rax
	; X64-CLZ-NEXT: xorq $63, %rax
	; X64-CLZ-NEXT: retq			; X64-CLZ-NEXT: retq
	;			;
	; X64-FASTLZCNT-LABEL: ctlz_xor63_i64_true:			; X64-FASTLZCNT-LABEL: ctlz_xor63_i64_true:
	; X64-FASTLZCNT: # %bb.0:			; X64-FASTLZCNT: # %bb.0:
	; X64-FASTLZCNT-NEXT: lzcntq %rdi, %rax			; X64-FASTLZCNT-NEXT: lzcntq %rdi, %rax
	; X64-FASTLZCNT-NEXT: xorq $63, %rax			; X64-FASTLZCNT-NEXT: xorq $63, %rax
	; X64-FASTLZCNT-NEXT: retq			; X64-FASTLZCNT-NEXT: retq
	;			;
	Show All 19 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86]: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 495240

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/clz.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86]: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 495240

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/clz.ll

[X86]: Match (xor TSize - 1, ctlz) to `bsr` instead of `lzcnt` + `xor`
ClosedPublic