This is an archive of the discontinued LLVM Phabricator instance.

Differential D22038

[X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture.
AbandonedPublic

Authored by pgousseau on Jul 6 2016, 3:08 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
andreadb

Summary

Hi All,

I would like to propose a change to turn zext+seteq+cmp into shr+lzcnt.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.

Diff Detail

Event Timeline

pgousseau updated this revision to Diff 62837.Jul 6 2016, 3:08 AM

pgousseau retitled this revision from to [X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture..

pgousseau updated this object.

pgousseau added reviewers: qcolombet, RKSimon, spatel, andreadb.

pgousseau added a subscriber: llvm-commits.

gbedwell added a subscriber: gbedwell.Jul 6 2016, 5:36 AM

This seems to be limited to (a==0) and ((a== 0) || (b== 0)) patterns - is that the best way to do this? Can this be easily compounded to support different numbers of tests?

lib/Target/X86/X86Subtarget.h
137	This isn't a processor/cpuid feature, please move this further down to be closer to the other fast/slow characteristic features.
test/CodeGen/X86/lzcnt-zext-cmp.ll
2	Please regenerate with utils/update_llc_test_checks.py

Following Simon's comments:

Move the new flag declaration among declarations with similar meaning.
Regenerated test's asserts with update_llc_test_checks.py

In D22038#475155, @RKSimon wrote:

This seems to be limited to (a==0) and ((a== 0) || (b== 0)) patterns - is that the best way to do this? Can this be easily compounded to support different numbers of tests?

Yes I was hoping to do the generic case initially, I had a version implemented in X86ISelLowering which was handling the case (ak == 0 || ... || an == 0), but this caused some optimisation opportunities in h264's hot functions to be missed. Not quite sure why this happened, it seems that there are optimisations occurring before Instruction selection that will undo lzcnt + shr optimisations.

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

rob.lougher added a subscriber: rob.lougher.Jul 6 2016, 10:54 AM

In D22038#475372, @spatel wrote:

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

Regarding the size, I have noticed that openssl does get smaller but I would need to do more testing to ensure this is always the case.

In D22038#476520, @pgousseau wrote:

In D22038#475372, @spatel wrote:

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

You should keep the HasFastLZCNT attribute that you already have proposed, and the hook will trigger off of that. We don't want to base transforms on CPU models; that doesn't evolve well.

In D22038#476724, @spatel wrote:

In D22038#476520, @pgousseau wrote:

In D22038#475372, @spatel wrote:

Did you look at doing this in DAGCombine with a TLI hook? PPC already has this or something very close in PPCTargetLowering::LowerSETCC(), so it seems like the code could simply be lifted up from there?

Would x86 targets other than btver2 want to do this transform when optimizing for size?

I did not know there was a similar change in PPCTargetLowering, thanks for pointing it out, I will investigate this.
For the TLI hook I could add a method 'bool isCTLZFast()', which on X86, I suppose would be something like:
bool isCTLZFast() { return (hasLZCNT() && getCPU() == "btver2");}
It looks a bit less maintainable than the Target feature way though, what do you think?

You should keep the HasFastLZCNT attribute that you already have proposed, and the hook will trigger off of that. We don't want to base transforms on CPU models; that doesn't evolve well.

Makes sense yes thanks. This should be done in 3 patches I think.
The first patch will be NFC, adding the "isCTLZFast" TLI hook, enabling it for PPC and moving the PPC specific code to DAGCombiner.
The second patch will enable "isCTLZFast" for X86.
The third patch will add the transformation for the OR case.
I think this is what you and Simon are suggesting?
Will experiment a bit with it and abandon this review once the first patch is ready.

In D22038#477768, @pgousseau wrote:

Makes sense yes thanks. This should be done in 3 patches I think.
The first patch will be NFC, adding the "isCTLZFast" TLI hook, enabling it for PPC and moving the PPC specific code to DAGCombiner.
The second patch will enable "isCTLZFast" for X86.
The third patch will add the transformation for the OR case.
I think this is what you and Simon are suggesting?

That sounds like a good plan to me.

pgousseau mentioned this in D23445: [x86] Refactor a PowerPC specific ctlz/srl transformation (NFC)..Aug 12 2016, 5:24 AM

pgousseau mentioned this in D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures..Aug 12 2016, 5:27 AM

pgousseau mentioned this in rL278799: [x86] Refactor a PowerPC specific ctlz/srl transformation (NFC)..Aug 16 2016, 7:01 AM

Abandon this now that D23446 is committed?

In D22038#571614, @RKSimon wrote:

Abandon this now that D23446 is committed?

Ah yes! Thanks for reminding me.

Revision Contents

Path

Size

lib/

Target/

X86/

X86.td

4 lines

X86InstrInfo.td

1 line

X86InstrShiftRotate.td

80 lines

X86Subtarget.h

4 lines

X86Subtarget.cpp

1 line

test/

CodeGen/

X86/

lzcnt-zext-cmp.ll

171 lines

Diff 62882

lib/Target/X86/X86.td

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	def FeatureRDRAND : SubtargetFeature<"rdrnd", "HasRDRAND", "true",
"Support RDRAND instruction">;		"Support RDRAND instruction">;
def FeatureF16C : SubtargetFeature<"f16c", "HasF16C", "true",		def FeatureF16C : SubtargetFeature<"f16c", "HasF16C", "true",
"Support 16-bit floating point conversion instructions",		"Support 16-bit floating point conversion instructions",
[FeatureAVX]>;		[FeatureAVX]>;
def FeatureFSGSBase : SubtargetFeature<"fsgsbase", "HasFSGSBase", "true",		def FeatureFSGSBase : SubtargetFeature<"fsgsbase", "HasFSGSBase", "true",
"Support FS/GS Base instructions">;		"Support FS/GS Base instructions">;
def FeatureLZCNT : SubtargetFeature<"lzcnt", "HasLZCNT", "true",		def FeatureLZCNT : SubtargetFeature<"lzcnt", "HasLZCNT", "true",
"Support LZCNT instruction">;		"Support LZCNT instruction">;
		// On some architectures, such as AMD's Jaguar, LZCNT is fast.
		def FeatureFastLZCNT : SubtargetFeature<"fastlzcnt", "HasFastLZCNT", "true",
		"LZCNT instructions are fast">;
def FeatureBMI : SubtargetFeature<"bmi", "HasBMI", "true",		def FeatureBMI : SubtargetFeature<"bmi", "HasBMI", "true",
"Support BMI instructions">;		"Support BMI instructions">;
def FeatureBMI2 : SubtargetFeature<"bmi2", "HasBMI2", "true",		def FeatureBMI2 : SubtargetFeature<"bmi2", "HasBMI2", "true",
"Support BMI2 instructions">;		"Support BMI2 instructions">;
def FeatureRTM : SubtargetFeature<"rtm", "HasRTM", "true",		def FeatureRTM : SubtargetFeature<"rtm", "HasRTM", "true",
"Support RTM instructions">;		"Support RTM instructions">;
def FeatureHLE : SubtargetFeature<"hle", "HasHLE", "true",		def FeatureHLE : SubtargetFeature<"hle", "HasHLE", "true",
"Support HLE">;		"Support HLE">;
▲ Show 20 Lines • Show All 436 Lines • ▼ Show 20 Lines	def : ProcessorModel<"btver2", BtVer2Model, [
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeaturePRFCHW,		FeaturePRFCHW,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureBMI,		FeatureBMI,
FeatureF16C,		FeatureF16C,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
		FeatureFastLZCNT,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureSlowSHLD,		FeatureSlowSHLD,
FeatureLAHFSAHF,		FeatureLAHFSAHF,
FeatureFastPartialYMMWrite		FeatureFastPartialYMMWrite
]>;		]>;

▲ Show 20 Lines • Show All 189 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.td

	Show First 20 Lines • Show All 828 Lines • ▼ Show 20 Lines
	def HasFMA4 : Predicate<"Subtarget->hasFMA4()">;			def HasFMA4 : Predicate<"Subtarget->hasFMA4()">;
	def HasXOP : Predicate<"Subtarget->hasXOP()">;			def HasXOP : Predicate<"Subtarget->hasXOP()">;
	def HasTBM : Predicate<"Subtarget->hasTBM()">;			def HasTBM : Predicate<"Subtarget->hasTBM()">;
	def HasMOVBE : Predicate<"Subtarget->hasMOVBE()">;			def HasMOVBE : Predicate<"Subtarget->hasMOVBE()">;
	def HasRDRAND : Predicate<"Subtarget->hasRDRAND()">;			def HasRDRAND : Predicate<"Subtarget->hasRDRAND()">;
	def HasF16C : Predicate<"Subtarget->hasF16C()">;			def HasF16C : Predicate<"Subtarget->hasF16C()">;
	def HasFSGSBase : Predicate<"Subtarget->hasFSGSBase()">;			def HasFSGSBase : Predicate<"Subtarget->hasFSGSBase()">;
	def HasLZCNT : Predicate<"Subtarget->hasLZCNT()">;			def HasLZCNT : Predicate<"Subtarget->hasLZCNT()">;
				def HasFastLZCNT : Predicate<"Subtarget->hasFastLZCNT()">;
	def HasBMI : Predicate<"Subtarget->hasBMI()">;			def HasBMI : Predicate<"Subtarget->hasBMI()">;
	def HasBMI2 : Predicate<"Subtarget->hasBMI2()">;			def HasBMI2 : Predicate<"Subtarget->hasBMI2()">;
	def HasVBMI : Predicate<"Subtarget->hasVBMI()">,			def HasVBMI : Predicate<"Subtarget->hasVBMI()">,
	AssemblerPredicate<"FeatureVBMI", "AVX-512 VBMI ISA">;			AssemblerPredicate<"FeatureVBMI", "AVX-512 VBMI ISA">;
	def HasIFMA : Predicate<"Subtarget->hasIFMA()">,			def HasIFMA : Predicate<"Subtarget->hasIFMA()">,
	AssemblerPredicate<"FeatureIFMA", "AVX-512 IFMA ISA">;			AssemblerPredicate<"FeatureIFMA", "AVX-512 IFMA ISA">;
	def HasRTM : Predicate<"Subtarget->hasRTM()">;			def HasRTM : Predicate<"Subtarget->hasRTM()">;
	def HasHLE : Predicate<"Subtarget->hasHLE()">;			def HasHLE : Predicate<"Subtarget->hasHLE()">;
	▲ Show 20 Lines • Show All 2,260 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrShiftRotate.td

Show First 20 Lines • Show All 961 Lines • ▼ Show 20 Lines	let Predicates = [HasBMI2] in {
// over		// over
//		//
// movb $imm %al		// movb $imm %al
// shlx %al, (%ecx), %esi		// shlx %al, (%ecx), %esi
//		//
// As SARXrr/SHRXrr/SHLXrr is favored on variable shift, the peephole		// As SARXrr/SHRXrr/SHLXrr is favored on variable shift, the peephole
// optimization will fold them into SARXrm/SHRXrm/SHLXrm if possible.		// optimization will fold them into SARXrm/SHRXrm/SHLXrm if possible.
}		}

		let Predicates = [HasLZCNT, HasFastLZCNT] in {

		// Transform comparisons with 0, followed by a zero extend,
		// into lzcnt + shift:
		// Eg:
		//
		// test %edi, %eax
		// sete %al
		// movzbl %eax
		//
		// into
		//
		// lzcntl %edi, %eax
		// shrl $5, %eax
		//

		// Shift by 4 for 16-bits flavor.
		def : Pat<(zext (X86setcc X86_COND_E, (X86cmp GR16:$src, (i16 0)))),
		(SHR16ri (LZCNT16rr GR16:$src), (i8 4))>;

		// Shift by 5 for 32-bits flavor.
		def : Pat<(zext (X86setcc X86_COND_E, (X86cmp GR32:$src, (i32 0)))),
		(SHR32ri (LZCNT32rr GR32:$src), (i8 5))>;

		// Shift by 6 for 64-bits flavor.
		def : Pat<(zext (X86setcc X86_COND_E, (X86cmp GR64:$src, (i64 0)))),
		(SHR64ri (LZCNT64rr GR64:$src), (i8 6))>;

		// Input is 64-bit, result is 32-bit.
		def : Pat<(i32 (zext (X86setcc X86_COND_E, (X86cmp GR64:$src, (i64 0))))),
		(EXTRACT_SUBREG
		(SHR64ri (LZCNT64rr GR64:$src), (i8 6)),
		sub_32bit)>;

		// Input is 32-bit, result is 64-bit.
		def : Pat<(i64 (zext (X86setcc X86_COND_E, (X86cmp GR32:$src, (i32 0))))),
		(SUBREG_TO_REG
		(i64 0),
		(SHR32ri(LZCNT32rr GR32:$src), (i8 5)),
		sub_32bit)>;

		// Transform 2 OR'ed comparisons with 0, followed by a zero extend,
		// into lzcnt + shift.
		//
		// Eg:
		//
		// testl %edi, %edi
		// sete %al
		// testl %esi, %esi
		// sete %cl
		// orb %al, %cl
		// movzbl %cl, %eax
		//
		// into
		//
		// lzcntl %edi, %ecx
		// lzcntl %esi, %eax
		// orl %ecx, %eax
		// shrl $5, %eax
		//

		def : Pat<(zext (or (X86setcc X86_COND_E, (X86cmp GR16:$src1, (i16 0))),
		(X86setcc X86_COND_E, (X86cmp GR16:$src2, (i16 0))))),
		(SHR16ri (OR16rr (LZCNT16rr GR16:$src1),
		(LZCNT16rr GR16:$src2)),
		(i8 4))>;

		def : Pat<(zext (or (X86setcc X86_COND_E, (X86cmp GR32:$src1, (i32 0))),
		(X86setcc X86_COND_E, (X86cmp GR32:$src2, (i32 0))))),
		(SHR32ri (OR32rr (LZCNT32rr GR32:$src1),
		(LZCNT32rr GR32:$src2)),
		(i8 5))>;

		def : Pat<(zext (or (X86setcc X86_COND_E, (X86cmp GR64:$src1, (i64 0))),
		(X86setcc X86_COND_E, (X86cmp GR64:$src2, (i64 0))))),
		(SHR64ri (OR64rr (LZCNT64rr GR64:$src1),
		(LZCNT64rr GR64:$src2)),
		(i8 6))>;
		}

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	protected:
/// Processor has FS/GS base insturctions.		/// Processor has FS/GS base insturctions.
bool HasFSGSBase;		bool HasFSGSBase;

/// Processor has LZCNT instruction.		/// Processor has LZCNT instruction.
bool HasLZCNT;		bool HasLZCNT;

/// Processor has BMI1 instructions.		/// Processor has BMI1 instructions.
bool HasBMI;		bool HasBMI;

		RKSimonUnsubmitted Not Done Reply Inline Actions This isn't a processor/cpuid feature, please move this further down to be closer to the other fast/slow characteristic features. RKSimon: This isn't a processor/cpuid feature, please move this further down to be closer to the other…
/// Processor has BMI2 instructions.		/// Processor has BMI2 instructions.
bool HasBMI2;		bool HasBMI2;

/// Processor has VBMI instructions.		/// Processor has VBMI instructions.
bool HasVBMI;		bool HasVBMI;

/// Processor has Integer Fused Multiply Add		/// Processor has Integer Fused Multiply Add
bool HasIFMA;		bool HasIFMA;
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	protected:
/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 16-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

		/// True if LZCNT instruction is fast.
		bool HasFastLZCNT;

/// True if the short functions should be padded to prevent		/// True if the short functions should be padded to prevent
/// a stall when returning too early.		/// a stall when returning too early.
bool PadShortFunctions;		bool PadShortFunctions;

/// True if the Calls with memory reference should be converted		/// True if the Calls with memory reference should be converted
/// to a register-based indirect call.		/// to a register-based indirect call.
bool CallRegIndirect;		bool CallRegIndirect;

▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	public:
bool hasAnyFMA() const { return hasFMA() \|\| hasFMA4() \|\| hasAVX512(); }		bool hasAnyFMA() const { return hasFMA() \|\| hasFMA4() \|\| hasAVX512(); }
bool hasXOP() const { return HasXOP; }		bool hasXOP() const { return HasXOP; }
bool hasTBM() const { return HasTBM; }		bool hasTBM() const { return HasTBM; }
bool hasMOVBE() const { return HasMOVBE; }		bool hasMOVBE() const { return HasMOVBE; }
bool hasRDRAND() const { return HasRDRAND; }		bool hasRDRAND() const { return HasRDRAND; }
bool hasF16C() const { return HasF16C; }		bool hasF16C() const { return HasF16C; }
bool hasFSGSBase() const { return HasFSGSBase; }		bool hasFSGSBase() const { return HasFSGSBase; }
bool hasLZCNT() const { return HasLZCNT; }		bool hasLZCNT() const { return HasLZCNT; }
		bool hasFastLZCNT() const { return HasFastLZCNT; }
bool hasBMI() const { return HasBMI; }		bool hasBMI() const { return HasBMI; }
bool hasBMI2() const { return HasBMI2; }		bool hasBMI2() const { return HasBMI2; }
bool hasVBMI() const { return HasVBMI; }		bool hasVBMI() const { return HasVBMI; }
bool hasIFMA() const { return HasIFMA; }		bool hasIFMA() const { return HasIFMA; }
bool hasRTM() const { return HasRTM; }		bool hasRTM() const { return HasRTM; }
bool hasHLE() const { return HasHLE; }		bool hasHLE() const { return HasHLE; }
bool hasADX() const { return HasADX; }		bool hasADX() const { return HasADX; }
bool hasSHA() const { return HasSHA; }		bool hasSHA() const { return HasSHA; }
▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasFMA4 = false;		HasFMA4 = false;
HasXOP = false;		HasXOP = false;
HasTBM = false;		HasTBM = false;
HasMOVBE = false;		HasMOVBE = false;
HasRDRAND = false;		HasRDRAND = false;
HasF16C = false;		HasF16C = false;
HasFSGSBase = false;		HasFSGSBase = false;
HasLZCNT = false;		HasLZCNT = false;
		HasFastLZCNT = false;
HasBMI = false;		HasBMI = false;
HasBMI2 = false;		HasBMI2 = false;
HasVBMI = false;		HasVBMI = false;
HasIFMA = false;		HasIFMA = false;
HasRTM = false;		HasRTM = false;
HasHLE = false;		HasHLE = false;
HasERI = false;		HasERI = false;
HasCDI = false;		HasCDI = false;
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

test/CodeGen/X86/lzcnt-zext-cmp.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; Test patterns which generates lzcnt instructions.
				RKSimonUnsubmitted Not Done Reply Inline Actions Please regenerate with utils/update_llc_test_checks.py RKSimon: Please regenerate with utils/update_llc_test_checks.py
				; Eg: zext(setcc(cmp)) -> shr(lzcnt)
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=btver2 \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mattr=+lzcnt -mcpu=haswell \| FileCheck --check-prefix=NOFASTLZCNT %s

				define i32 @foo1(i32 %a) {
				; CHECK-LABEL: foo1:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: foo1:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i32 %a, 0
				%conv = zext i1 %cmp to i32
				ret i32 %conv

				}

				define i64 @foo2(i32 %a) {
				; CHECK-LABEL: foo2:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: foo2:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i32 %a, 0
				%conv1 = zext i1 %cmp to i64
				ret i64 %conv1
				}

				define i64 @foo3(i64 %a) {
				; CHECK-LABEL: foo3:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntq %rdi, %rax
				; CHECK-NEXT: shrq $6, %rax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: foo3:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i64 %a, 0
				%conv1 = zext i1 %cmp to i64
				ret i64 %conv1
				}

				define i32 @foo4(i64 %a) {
				; CHECK-LABEL: foo4:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntq %rdi, %rax
				; CHECK-NEXT: shrq $6, %rax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: foo4:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i64 %a, 0
				%conv1 = zext i1 %cmp to i32
				ret i32 %conv1
				}

				define i16 @foo5(i16 %a) {
				; CHECK-LABEL: foo5:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntw %di, %ax
				; CHECK-NEXT: shrw $4, %ax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: foo5:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testw %di, %di
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i16 %a, 0
				%conv = zext i1 %cmp to i16
				ret i16 %conv
				}

				define i32 @bar1(i32 %a, i32 %b) {
				; CHECK-LABEL: bar1:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntl %esi, %ecx
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar1:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i32
				ret i32 %lor.ext
				}

				define i64 @bar2(i64 %a, i64 %b) {
				; CHECK-LABEL: bar2:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntq %rsi, %rcx
				; CHECK-NEXT: lzcntq %rdi, %rax
				; CHECK-NEXT: orq %rcx, %rax
				; CHECK-NEXT: shrq $6, %rax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar2:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i64 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i64
				ret i64 %lor.ext
				}

				define i16 @bar3(i16 %a, i16 %b) {
				; CHECK-LABEL: bar3:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntw %si, %cx
				; CHECK-NEXT: lzcntw %di, %ax
				; CHECK-NEXT: orw %cx, %ax
				; CHECK-NEXT: shrw $4, %ax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar3:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testw %di, %di
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testw %si, %si
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i16 %a, 0
				%cmp1 = icmp eq i16 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i16
				ret i16 %lor.ext
				}