This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
2/6
X86MCInstLower.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
2/4
clz.ll
2/8
peephole-na-phys-copy-folding.ll
-
stack-folding-x86_64.ll

Differential D130956

[X86][MC] Always emit `rep` prefix for `bsf`
ClosedPublic

Authored by pengfei on Aug 1 2022, 7:02 PM.

Download Raw Diff

Details

Reviewers

nikic
RKSimon
craig.topper
skan

Commits

rG7f648d27a85a: Reland "[X86][MC] Always emit `rep` prefix for `bsf`"
rGc2066d19cda2: [X86][MC] Always emit `rep` prefix for `bsf`

Summary

BMI new instruction tzcnt has better performance than bsf on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
bsf. If we force emit rep prefix for bsf, we will gain better
performance when the same code run on new processors.

GCC has already done this way: https://c.godbolt.org/z/6xere6fs1

Fixes #34191

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pengfei created this revision.Aug 1 2022, 7:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2022, 7:02 PM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

pengfei requested review of this revision.Aug 1 2022, 7:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2022, 7:02 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

Also need to print the rep in the assembly listing so -fno-integrated-as works

In D130956#3692469, @craig.topper wrote:

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

I believe Linus's suggestion is adding a prefix to BSF rather than changing the encoding.
Maybe you can consider implementing it in X86MCInstLower::Lower by using Flags of MCInst.

In D130956#3692473, @pengfei wrote:

In D130956#3692469, @craig.topper wrote:

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

No, I don't think it is a good way. It breaks the support of original bsf.

This basically need to be done with a CodeGenOnly pseudo instruction. We can’t change assembler or disassembler behavior

In D130956#3692473, @pengfei wrote:

In D130956#3692469, @craig.topper wrote:

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

Oh, we can't do it consider the decoding. I'll reconsider the solution. Thanks!

Maybe you can consider implementing it in X86MCInstLower::Lower by using Flags of MCInst.

Thanks for the suggestion, let me have a try.

Harbormaster completed remote builds in B178674: Diff 449166.Aug 1 2022, 7:41 PM

Compared both methods, adding flag in X86MCInstLower::Lower seems convenient. Thanks Craig and Shengchen!

nikic added inline comments.Aug 2 2022, 12:26 AM

llvm/lib/Target/X86/X86MCInstLower.cpp
986	Add a comment to explain why we do this?

Harbormaster completed remote builds in B178698: Diff 449198.Aug 2 2022, 12:42 AM

Add comment. Thanks @nikic

skan added inline comments.Aug 2 2022, 12:54 AM

llvm/lib/Target/X86/X86MCInstLower.cpp
987	Do we need to check any target feature here?

craig.topper added inline comments.Aug 2 2022, 1:01 AM

llvm/test/CodeGen/X86/clz.ll
49–50	We need to promote bsrw to bsrl. tzcntl is faster than tzcntw. tzcntw has a false dependency to preserve the upper 48 bits of the result register in Intel CPUs. But separate patch please.

craig.topper added inline comments.Aug 2 2022, 1:05 AM

llvm/lib/Target/X86/X86MCInstLower.cpp
987	I don’t think so. Probably should skip for minsize though?

skan added inline comments.Aug 2 2022, 1:09 AM

llvm/lib/Target/X86/X86MCInstLower.cpp
987	I think it's a good suggestion.

Add check for minsize.

llvm/lib/Target/X86/X86MCInstLower.cpp
987	Agreed. And good point! Added check for minsize.

By the way, the original bug for this is #34191.

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	Goes a bit too far I think. We can turn a generic builtin into `rep; bsf`, but if the inline assembly explicitly asks for `bsf` I think we should emit that.

pengfei added inline comments.Aug 2 2022, 1:32 AM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	You are right. So this is to check no REP prefix was generated. :)

Harbormaster completed remote builds in B178710: Diff 449214.Aug 2 2022, 1:58 AM

pengfei added inline comments.Aug 2 2022, 2:00 AM

llvm/test/CodeGen/X86/clz.ll
49–50	Can we allow to always do that? The result is not equal for tzcnt if we care the src is 0: https://godbolt.org/z/enahabbM7

pengfei edited the summary of this revision. (Show Details)Aug 2 2022, 2:26 AM

craig.topper added inline comments.Aug 2 2022, 2:34 AM

llvm/test/CodeGen/X86/clz.ll
49–50	I meant in SelectionDAG.

RKSimon added inline comments.Aug 2 2022, 3:08 AM

llvm/test/CodeGen/X86/clz.ll
1128	please can you add a minsize test as well ?

Add minsize test.

xbolva00 added a subscriber: xbolva00.Aug 2 2022, 3:41 AM

Harbormaster completed remote builds in B178733: Diff 449244.Aug 2 2022, 4:52 AM

craig.topper mentioned this in D130995: [X86] Promote i16 CTTZ/CTTZ_ZERO_UNDEF always..Aug 2 2022, 9:25 AM

nickdesaulniers added a subscriber: nickdesaulniers.Aug 2 2022, 11:45 AM

aaronpuchert added inline comments.Aug 2 2022, 3:32 PM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	Sorry, I missed the `-NOT`. But is that needed? After all the `-NEXT` shouldn't match if there is a `rep` in between. If there is a reason to keep this, the second occurrence should likely be `CHECK64` instead of `CHECK32`.

skan added inline comments.Aug 2 2022, 6:40 PM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	I think the two `CHECK-NOT` are redundant here b/c `CHECK-NEXT` can gurantee `#APP` is followed by `bsfl`.

craig.topper added inline comments.Aug 2 2022, 7:24 PM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	Doesn't the CHECK-NEXT only guarantee that the next line contains bsfl. It doesn't rule out any text before the bsfl.

skan added inline comments.Aug 2 2022, 7:45 PM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	I think if there is `rep` between `#APP` and `#NO_APP`, the following check will fail. ; CHECK32-NEXT: #APP ; CHECK32-NEXT: bsfl %edx, %edx ; CHECK32-NEXT: #NO_APP

craig.topper added inline comments.Aug 2 2022, 8:01 PM

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll

376

The test still passes with this change

diff --git a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
index f3d4b6221d08..4c7094d5c5f0 100644
--- a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
+++ b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
@@ -371,7 +371,7 @@ define i1 @asm_clobbering_flags(ptr %mem) nounwind {
 entry:
   %val = load i32, ptr %mem, align 4
   %cmp = icmp sgt i32 %val, 0
-  %res = tail call i32 asm "bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)
+  %res = tail call i32 asm "rep bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)
   store i32 %res, ptr %mem, align 4
   ret i1 %cmp

That puts rep on the same line before bsfl in the output. Every test change in this review shows rep on the same line as the bsfl.

Fix the check prefix error.

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376	Craig is correct. We'd better to add explicit check to avoid false negative, though I think re-generate from tool will emit the `rep` for the opposite case.

Harbormaster completed remote builds in B178935: Diff 449525.Aug 2 2022, 9:10 PM

LGTM

This revision is now accepted and ready to land.Aug 2 2022, 11:08 PM

This revision was landed with ongoing or failed builds.Aug 3 2022, 2:09 AM

Closed by commit rGc2066d19cda2: [X86][MC] Always emit `rep` prefix for `bsf` (authored by pengfei). · Explain Why

This revision was automatically updated to reflect the committed changes.

pengfei added a commit: rGc2066d19cda2: [X86][MC] Always emit `rep` prefix for `bsf`.

craig.topper mentioned this in rGff91b2d9df80: [X86] Promote i16 CTTZ/CTTZ_ZERO_UNDEF always..Aug 3 2022, 1:12 PM

This seems related to an error we see on our sanitizer bots: https://lab.llvm.org/buildbot/#/builders/37/builds/15446/steps/33/logs/stdio

FAIL: Builtins-i386-linux :: ffssi2_test.c (2860 of 8387)
******************** TEST 'Builtins-i386-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m32 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-i386.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.
FAIL: Builtins-x86_64-linux :: ffssi2_test.c (3084 of 8387)
******************** TEST 'Builtins-x86_64-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m64 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-x86_64.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..

In D130956#3697844, @fmayer wrote:

This seems related to an error we see on our sanitizer bots: https://lab.llvm.org/buildbot/#/builders/37/builds/15446/steps/33/logs/stdio

FAIL: Builtins-i386-linux :: ffssi2_test.c (2860 of 8387)
******************** TEST 'Builtins-i386-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m32 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-i386.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.
FAIL: Builtins-x86_64-linux :: ffssi2_test.c (3084 of 8387)
******************** TEST 'Builtins-x86_64-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m64 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-x86_64.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..

Random guess, we forgot to take into account that tzcnt and bsf don't set the flags the same way. If the flags from the bsf are being used we can't put an f3 in front of it.

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

In D130956#3697869, @craig.topper wrote:

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

Thanks! I can also verify that a revert actually fixes this, but judging from the Changes in the first build that introduced the failure, I think it's very likely that it was this.

craig.topper added a reverting change: rG84e91948289c: Revert "[X86][MC] Always emit `rep` prefix for `bsf`".Aug 3 2022, 2:52 PM

craig.topper reopened this revision.Aug 3 2022, 2:52 PM

This revision is now accepted and ready to land.Aug 3 2022, 2:52 PM

In D130956#3697870, @fmayer wrote:

In D130956#3697869, @craig.topper wrote:

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

Thanks! I can also verify that a revert actually fixes this, but judging from the Changes in the first build that introduced the failure, I think it's very likely that it was this.

It was definitely this. The ffs in the name of the failing function is 'find first set' bit. That's what the bsf and tzcnt instructions do.

craig.topper added inline comments.Aug 3 2022, 2:57 PM

llvm/test/CodeGen/X86/dagcombine-select.ll
310 ↗	(On Diff #449601)	This is incorrect. If the instruction is detected as tzcnt, the cmov would need to be cmovaeq not cmovneq. Since we can't know at compile time which CPU we'll be running on. We can't do this transform if the flags are used.

craig.topper requested changes to this revision.Aug 3 2022, 2:57 PM

This revision now requires changes to proceed.Aug 3 2022, 2:57 PM

Make sure EFLAGS is not used.

Didn't notice the differences on the EFLAG. Thanks @fmayer and @craig.topper!

Harbormaster completed remote builds in B179206: Diff 449881.Aug 3 2022, 11:34 PM

LGTM with the comment typo fixed

llvm/lib/Target/X86/X86MCInstLower.cpp
989	latter -> later

This revision is now accepted and ready to land.Aug 4 2022, 9:38 AM

This revision was landed with ongoing or failed builds.Aug 4 2022, 7:26 PM

Closed by commit rG7f648d27a85a: Reland "[X86][MC] Always emit `rep` prefix for `bsf`" (authored by pengfei). · Explain Why

This revision was automatically updated to reflect the committed changes.

pengfei added a commit: rG7f648d27a85a: Reland "[X86][MC] Always emit `rep` prefix for `bsf`".

Thanks @craig.topper!

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86MCInstLower.cpp

9 lines

test/

CodeGen/

X86/

clz.ll

107 lines

peephole-na-phys-copy-folding.ll

2 lines

stack-folding-x86_64.ll

4 lines

Diff 450206

llvm/lib/Target/X86/X86MCInstLower.cpp

Show First 20 Lines • Show All 976 Lines • ▼ Show 20 Lines	if (MI->getDesc().isCommutable() &&
(TSFlags & X86II::OpMapMask) == X86II::TB &&		(TSFlags & X86II::OpMapMask) == X86II::TB &&
(TSFlags & X86II::FormMask) == X86II::MRMSrcReg &&		(TSFlags & X86II::FormMask) == X86II::MRMSrcReg &&
!(TSFlags & X86II::VEX_W) && (TSFlags & X86II::VEX_4V) &&		!(TSFlags & X86II::VEX_W) && (TSFlags & X86II::VEX_4V) &&
OutMI.getNumOperands() == 3) {		OutMI.getNumOperands() == 3) {
if (!X86II::isX86_64ExtendedReg(OutMI.getOperand(1).getReg()) &&		if (!X86II::isX86_64ExtendedReg(OutMI.getOperand(1).getReg()) &&
X86II::isX86_64ExtendedReg(OutMI.getOperand(2).getReg()))		X86II::isX86_64ExtendedReg(OutMI.getOperand(2).getReg()))
std::swap(OutMI.getOperand(1), OutMI.getOperand(2));		std::swap(OutMI.getOperand(1), OutMI.getOperand(2));
}		}
		// Add an REP prefix to BSF instructions so that new processors can
		// recognize as TZCNT, which has better performance than BSF.
		nikicUnsubmitted Done Reply Inline Actions Add a comment to explain why we do this? nikic: Add a comment to explain why we do this?
		if (X86::isBSF(OutMI.getOpcode()) && !MF.getFunction().hasOptSize()) {
		skanUnsubmitted Not Done Reply Inline Actions Do we need to check any target feature here? skan: Do we need to check any target feature here?
		craig.topperUnsubmitted Not Done Reply Inline Actions I don’t think so. Probably should skip for minsize though? craig.topper: I don’t think so. Probably should skip for minsize though?
		skanUnsubmitted Not Done Reply Inline Actions I think it's a good suggestion. skan: I think it's a good suggestion.
		pengfeiAuthorUnsubmitted Done Reply Inline Actions Agreed. And good point! Added check for minsize. pengfei: Agreed. And good point! Added check for minsize.
		// BSF and TZCNT have different interpretations on ZF bit. So make sure
		// it won't be used later.
		craig.topperUnsubmitted Not Done Reply Inline Actions latter -> later craig.topper: latter -> later
		const MachineOperand *FlagDef = MI->findRegisterDefOperand(X86::EFLAGS);
		if (FlagDef && FlagDef->isDead())
		OutMI.setFlags(X86::IP_HAS_REPEAT);
		}
break;		break;
}		}
}		}
}		}

void X86AsmPrinter::LowerTlsAddr(X86MCInstLower &MCInstLowering,		void X86AsmPrinter::LowerTlsAddr(X86MCInstLower &MCInstLowering,
const MachineInstr &MI) {		const MachineInstr &MI) {
NoAutoPaddingScope NoPadScope(*OutStreamer);		NoAutoPaddingScope NoPadScope(*OutStreamer);
▲ Show 20 Lines • Show All 1,710 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/clz.ll

Show All 12 Lines
declare i8 @llvm.ctlz.i8(i8, i1)		declare i8 @llvm.ctlz.i8(i8, i1)
declare i16 @llvm.ctlz.i16(i16, i1)		declare i16 @llvm.ctlz.i16(i16, i1)
declare i32 @llvm.ctlz.i32(i32, i1)		declare i32 @llvm.ctlz.i32(i32, i1)
declare i64 @llvm.ctlz.i64(i64, i1)		declare i64 @llvm.ctlz.i64(i64, i1)

define i8 @cttz_i8(i8 %x) {		define i8 @cttz_i8(i8 %x) {
; X86-LABEL: cttz_i8:		; X86-LABEL: cttz_i8:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: bsfl {{[0-9]+}}(%esp), %eax		; X86-NEXT: rep bsfl {{[0-9]+}}(%esp), %eax
; X86-NEXT: # kill: def $al killed $al killed $eax		; X86-NEXT: # kill: def $al killed $al killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i8:		; X64-LABEL: cttz_i8:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: bsfl %edi, %eax		; X64-NEXT: rep bsfl %edi, %eax
; X64-NEXT: # kill: def $al killed $al killed $eax		; X64-NEXT: # kill: def $al killed $al killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i8:		; X86-CLZ-LABEL: cttz_i8:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: # kill: def $al killed $al killed $eax		; X86-CLZ-NEXT: # kill: def $al killed $al killed $eax
; X86-CLZ-NEXT: retl		; X86-CLZ-NEXT: retl
;		;
; X64-CLZ-LABEL: cttz_i8:		; X64-CLZ-LABEL: cttz_i8:
; X64-CLZ: # %bb.0:		; X64-CLZ: # %bb.0:
; X64-CLZ-NEXT: tzcntl %edi, %eax		; X64-CLZ-NEXT: tzcntl %edi, %eax
; X64-CLZ-NEXT: # kill: def $al killed $al killed $eax		; X64-CLZ-NEXT: # kill: def $al killed $al killed $eax
; X64-CLZ-NEXT: retq		; X64-CLZ-NEXT: retq
%tmp = call i8 @llvm.cttz.i8( i8 %x, i1 true )		%tmp = call i8 @llvm.cttz.i8( i8 %x, i1 true )
ret i8 %tmp		ret i8 %tmp
}		}

define i16 @cttz_i16(i16 %x) {		define i16 @cttz_i16(i16 %x) {
; X86-LABEL: cttz_i16:		; X86-LABEL: cttz_i16:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: bsfl {{[0-9]+}}(%esp), %eax		; X86-NEXT: rep bsfl {{[0-9]+}}(%esp), %eax
; X86-NEXT: # kill: def $ax killed $ax killed $eax		; X86-NEXT: # kill: def $ax killed $ax killed $eax
		craig.topperUnsubmitted Not Done Reply Inline Actions We need to promote bsrw to bsrl. tzcntl is faster than tzcntw. tzcntw has a false dependency to preserve the upper 48 bits of the result register in Intel CPUs. But separate patch please. craig.topper: We need to promote bsrw to bsrl. tzcntl is faster than tzcntw. tzcntw has a false dependency to…
		pengfeiAuthorUnsubmitted Done Reply Inline Actions Can we allow to always do that? The result is not equal for tzcnt if we care the src is 0: https://godbolt.org/z/enahabbM7 pengfei: Can we allow to always do that? The result is not equal for tzcnt if we care the src is 0…
		craig.topperUnsubmitted Not Done Reply Inline Actions I meant in SelectionDAG. craig.topper: I meant in SelectionDAG.
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i16:		; X64-LABEL: cttz_i16:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: bsfl %edi, %eax		; X64-NEXT: rep bsfl %edi, %eax
; X64-NEXT: # kill: def $ax killed $ax killed $eax		; X64-NEXT: # kill: def $ax killed $ax killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i16:		; X86-CLZ-LABEL: cttz_i16:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: # kill: def $ax killed $ax killed $eax		; X86-CLZ-NEXT: # kill: def $ax killed $ax killed $eax
; X86-CLZ-NEXT: retl		; X86-CLZ-NEXT: retl
;		;
; X64-CLZ-LABEL: cttz_i16:		; X64-CLZ-LABEL: cttz_i16:
; X64-CLZ: # %bb.0:		; X64-CLZ: # %bb.0:
; X64-CLZ-NEXT: tzcntl %edi, %eax		; X64-CLZ-NEXT: tzcntl %edi, %eax
; X64-CLZ-NEXT: # kill: def $ax killed $ax killed $eax		; X64-CLZ-NEXT: # kill: def $ax killed $ax killed $eax
; X64-CLZ-NEXT: retq		; X64-CLZ-NEXT: retq
%tmp = call i16 @llvm.cttz.i16( i16 %x, i1 true )		%tmp = call i16 @llvm.cttz.i16( i16 %x, i1 true )
ret i16 %tmp		ret i16 %tmp
}		}

define i32 @cttz_i32(i32 %x) {		define i32 @cttz_i32(i32 %x) {
; X86-LABEL: cttz_i32:		; X86-LABEL: cttz_i32:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: bsfl {{[0-9]+}}(%esp), %eax		; X86-NEXT: rep bsfl {{[0-9]+}}(%esp), %eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i32:		; X64-LABEL: cttz_i32:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: bsfl %edi, %eax		; X64-NEXT: rep bsfl %edi, %eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i32:		; X86-CLZ-LABEL: cttz_i32:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: retl		; X86-CLZ-NEXT: retl
;		;
; X64-CLZ-LABEL: cttz_i32:		; X64-CLZ-LABEL: cttz_i32:
; X64-CLZ: # %bb.0:		; X64-CLZ: # %bb.0:
; X64-CLZ-NEXT: tzcntl %edi, %eax		; X64-CLZ-NEXT: tzcntl %edi, %eax
; X64-CLZ-NEXT: retq		; X64-CLZ-NEXT: retq
%tmp = call i32 @llvm.cttz.i32( i32 %x, i1 true )		%tmp = call i32 @llvm.cttz.i32( i32 %x, i1 true )
ret i32 %tmp		ret i32 %tmp
}		}

define i64 @cttz_i64(i64 %x) {		define i64 @cttz_i64(i64 %x) {
; X86-NOCMOV-LABEL: cttz_i64:		; X86-NOCMOV-LABEL: cttz_i64:
; X86-NOCMOV: # %bb.0:		; X86-NOCMOV: # %bb.0:
; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NOCMOV-NEXT: testl %eax, %eax		; X86-NOCMOV-NEXT: testl %eax, %eax
; X86-NOCMOV-NEXT: jne .LBB3_1		; X86-NOCMOV-NEXT: jne .LBB3_1
; X86-NOCMOV-NEXT: # %bb.2:		; X86-NOCMOV-NEXT: # %bb.2:
; X86-NOCMOV-NEXT: bsfl {{[0-9]+}}(%esp), %eax		; X86-NOCMOV-NEXT: rep bsfl {{[0-9]+}}(%esp), %eax
; X86-NOCMOV-NEXT: addl $32, %eax		; X86-NOCMOV-NEXT: addl $32, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
; X86-NOCMOV-NEXT: .LBB3_1:		; X86-NOCMOV-NEXT: .LBB3_1:
; X86-NOCMOV-NEXT: bsfl %eax, %eax		; X86-NOCMOV-NEXT: rep bsfl %eax, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
;		;
; X86-CMOV-LABEL: cttz_i64:		; X86-CMOV-LABEL: cttz_i64:
; X86-CMOV: # %bb.0:		; X86-CMOV: # %bb.0:
; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-CMOV-NEXT: bsfl %ecx, %edx		; X86-CMOV-NEXT: rep bsfl %ecx, %edx
; X86-CMOV-NEXT: bsfl {{[0-9]+}}(%esp), %eax		; X86-CMOV-NEXT: rep bsfl {{[0-9]+}}(%esp), %eax
; X86-CMOV-NEXT: addl $32, %eax		; X86-CMOV-NEXT: addl $32, %eax
; X86-CMOV-NEXT: testl %ecx, %ecx		; X86-CMOV-NEXT: testl %ecx, %ecx
; X86-CMOV-NEXT: cmovnel %edx, %eax		; X86-CMOV-NEXT: cmovnel %edx, %eax
; X86-CMOV-NEXT: xorl %edx, %edx		; X86-CMOV-NEXT: xorl %edx, %edx
; X86-CMOV-NEXT: retl		; X86-CMOV-NEXT: retl
;		;
; X64-LABEL: cttz_i64:		; X64-LABEL: cttz_i64:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: bsfq %rdi, %rax		; X64-NEXT: rep bsfq %rdi, %rax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i64:		; X86-CLZ-LABEL: cttz_i64:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: testl %eax, %eax		; X86-CLZ-NEXT: testl %eax, %eax
; X86-CLZ-NEXT: jne .LBB3_1		; X86-CLZ-NEXT: jne .LBB3_1
; X86-CLZ-NEXT: # %bb.2:		; X86-CLZ-NEXT: # %bb.2:
▲ Show 20 Lines • Show All 378 Lines • ▼ Show 20 Lines
define i8 @cttz_i8_zero_test(i8 %n) {		define i8 @cttz_i8_zero_test(i8 %n) {
; X86-LABEL: cttz_i8_zero_test:		; X86-LABEL: cttz_i8_zero_test:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax
; X86-NEXT: testb %al, %al		; X86-NEXT: testb %al, %al
; X86-NEXT: je .LBB12_1		; X86-NEXT: je .LBB12_1
; X86-NEXT: # %bb.2: # %cond.false		; X86-NEXT: # %bb.2: # %cond.false
; X86-NEXT: movzbl %al, %eax		; X86-NEXT: movzbl %al, %eax
; X86-NEXT: bsfl %eax, %eax		; X86-NEXT: rep bsfl %eax, %eax
; X86-NEXT: # kill: def $al killed $al killed $eax		; X86-NEXT: # kill: def $al killed $al killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
; X86-NEXT: .LBB12_1:		; X86-NEXT: .LBB12_1:
; X86-NEXT: movb $8, %al		; X86-NEXT: movb $8, %al
; X86-NEXT: # kill: def $al killed $al killed $eax		; X86-NEXT: # kill: def $al killed $al killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i8_zero_test:		; X64-LABEL: cttz_i8_zero_test:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: testb %dil, %dil		; X64-NEXT: testb %dil, %dil
; X64-NEXT: je .LBB12_1		; X64-NEXT: je .LBB12_1
; X64-NEXT: # %bb.2: # %cond.false		; X64-NEXT: # %bb.2: # %cond.false
; X64-NEXT: movzbl %dil, %eax		; X64-NEXT: movzbl %dil, %eax
; X64-NEXT: bsfl %eax, %eax		; X64-NEXT: rep bsfl %eax, %eax
; X64-NEXT: # kill: def $al killed $al killed $eax		; X64-NEXT: # kill: def $al killed $al killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
; X64-NEXT: .LBB12_1:		; X64-NEXT: .LBB12_1:
; X64-NEXT: movb $8, %al		; X64-NEXT: movb $8, %al
; X64-NEXT: # kill: def $al killed $al killed $eax		; X64-NEXT: # kill: def $al killed $al killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i8_zero_test:		; X86-CLZ-LABEL: cttz_i8_zero_test:
Show All 17 Lines
; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.		; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.
define i16 @cttz_i16_zero_test(i16 %n) {		define i16 @cttz_i16_zero_test(i16 %n) {
; X86-LABEL: cttz_i16_zero_test:		; X86-LABEL: cttz_i16_zero_test:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movzwl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movzwl {{[0-9]+}}(%esp), %eax
; X86-NEXT: testw %ax, %ax		; X86-NEXT: testw %ax, %ax
; X86-NEXT: je .LBB13_1		; X86-NEXT: je .LBB13_1
; X86-NEXT: # %bb.2: # %cond.false		; X86-NEXT: # %bb.2: # %cond.false
; X86-NEXT: bsfl %eax, %eax		; X86-NEXT: rep bsfl %eax, %eax
; X86-NEXT: # kill: def $ax killed $ax killed $eax		; X86-NEXT: # kill: def $ax killed $ax killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
; X86-NEXT: .LBB13_1:		; X86-NEXT: .LBB13_1:
; X86-NEXT: movw $16, %ax		; X86-NEXT: movw $16, %ax
; X86-NEXT: # kill: def $ax killed $ax killed $eax		; X86-NEXT: # kill: def $ax killed $ax killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i16_zero_test:		; X64-LABEL: cttz_i16_zero_test:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: testw %di, %di		; X64-NEXT: testw %di, %di
; X64-NEXT: je .LBB13_1		; X64-NEXT: je .LBB13_1
; X64-NEXT: # %bb.2: # %cond.false		; X64-NEXT: # %bb.2: # %cond.false
; X64-NEXT: bsfl %edi, %eax		; X64-NEXT: rep bsfl %edi, %eax
; X64-NEXT: # kill: def $ax killed $ax killed $eax		; X64-NEXT: # kill: def $ax killed $ax killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
; X64-NEXT: .LBB13_1:		; X64-NEXT: .LBB13_1:
; X64-NEXT: movw $16, %ax		; X64-NEXT: movw $16, %ax
; X64-NEXT: # kill: def $ax killed $ax killed $eax		; X64-NEXT: # kill: def $ax killed $ax killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i16_zero_test:		; X86-CLZ-LABEL: cttz_i16_zero_test:
Show All 17 Lines
; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.		; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.
define i32 @cttz_i32_zero_test(i32 %n) {		define i32 @cttz_i32_zero_test(i32 %n) {
; X86-LABEL: cttz_i32_zero_test:		; X86-LABEL: cttz_i32_zero_test:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: testl %eax, %eax		; X86-NEXT: testl %eax, %eax
; X86-NEXT: je .LBB14_1		; X86-NEXT: je .LBB14_1
; X86-NEXT: # %bb.2: # %cond.false		; X86-NEXT: # %bb.2: # %cond.false
; X86-NEXT: bsfl %eax, %eax		; X86-NEXT: rep bsfl %eax, %eax
; X86-NEXT: retl		; X86-NEXT: retl
; X86-NEXT: .LBB14_1:		; X86-NEXT: .LBB14_1:
; X86-NEXT: movl $32, %eax		; X86-NEXT: movl $32, %eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i32_zero_test:		; X64-LABEL: cttz_i32_zero_test:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: testl %edi, %edi		; X64-NEXT: testl %edi, %edi
; X64-NEXT: je .LBB14_1		; X64-NEXT: je .LBB14_1
; X64-NEXT: # %bb.2: # %cond.false		; X64-NEXT: # %bb.2: # %cond.false
; X64-NEXT: bsfl %edi, %eax		; X64-NEXT: rep bsfl %edi, %eax
; X64-NEXT: retq		; X64-NEXT: retq
; X64-NEXT: .LBB14_1:		; X64-NEXT: .LBB14_1:
; X64-NEXT: movl $32, %eax		; X64-NEXT: movl $32, %eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i32_zero_test:		; X86-CLZ-LABEL: cttz_i32_zero_test:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: retl		; X86-CLZ-NEXT: retl
;		;
; X64-CLZ-LABEL: cttz_i32_zero_test:		; X64-CLZ-LABEL: cttz_i32_zero_test:
; X64-CLZ: # %bb.0:		; X64-CLZ: # %bb.0:
; X64-CLZ-NEXT: tzcntl %edi, %eax		; X64-CLZ-NEXT: tzcntl %edi, %eax
; X64-CLZ-NEXT: retq		; X64-CLZ-NEXT: retq
%tmp1 = call i32 @llvm.cttz.i32(i32 %n, i1 false)		%tmp1 = call i32 @llvm.cttz.i32(i32 %n, i1 false)
ret i32 %tmp1		ret i32 %tmp1
}		}

; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.		; Generate a test and branch to handle zero inputs because bsr/bsf are very slow.
define i64 @cttz_i64_zero_test(i64 %n) {		define i64 @cttz_i64_zero_test(i64 %n) {
; X86-NOCMOV-LABEL: cttz_i64_zero_test:		; X86-NOCMOV-LABEL: cttz_i64_zero_test:
; X86-NOCMOV: # %bb.0:		; X86-NOCMOV: # %bb.0:
; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx
		; X86-NOCMOV-NOT: rep
; X86-NOCMOV-NEXT: bsfl {{[0-9]+}}(%esp), %edx		; X86-NOCMOV-NEXT: bsfl {{[0-9]+}}(%esp), %edx
; X86-NOCMOV-NEXT: movl $32, %eax		; X86-NOCMOV-NEXT: movl $32, %eax
; X86-NOCMOV-NEXT: je .LBB15_2		; X86-NOCMOV-NEXT: je .LBB15_2
; X86-NOCMOV-NEXT: # %bb.1:		; X86-NOCMOV-NEXT: # %bb.1:
; X86-NOCMOV-NEXT: movl %edx, %eax		; X86-NOCMOV-NEXT: movl %edx, %eax
; X86-NOCMOV-NEXT: .LBB15_2:		; X86-NOCMOV-NEXT: .LBB15_2:
; X86-NOCMOV-NEXT: testl %ecx, %ecx		; X86-NOCMOV-NEXT: testl %ecx, %ecx
; X86-NOCMOV-NEXT: jne .LBB15_3		; X86-NOCMOV-NEXT: jne .LBB15_3
; X86-NOCMOV-NEXT: # %bb.4:		; X86-NOCMOV-NEXT: # %bb.4:
; X86-NOCMOV-NEXT: addl $32, %eax		; X86-NOCMOV-NEXT: addl $32, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
; X86-NOCMOV-NEXT: .LBB15_3:		; X86-NOCMOV-NEXT: .LBB15_3:
; X86-NOCMOV-NEXT: bsfl %ecx, %eax		; X86-NOCMOV-NEXT: rep bsfl %ecx, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
;		;
; X86-CMOV-LABEL: cttz_i64_zero_test:		; X86-CMOV-LABEL: cttz_i64_zero_test:
; X86-CMOV: # %bb.0:		; X86-CMOV: # %bb.0:
; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %eax
		; X86-CMOV-NOT: rep
; X86-CMOV-NEXT: bsfl {{[0-9]+}}(%esp), %ecx		; X86-CMOV-NEXT: bsfl {{[0-9]+}}(%esp), %ecx
; X86-CMOV-NEXT: movl $32, %edx		; X86-CMOV-NEXT: movl $32, %edx
; X86-CMOV-NEXT: cmovnel %ecx, %edx		; X86-CMOV-NEXT: cmovnel %ecx, %edx
; X86-CMOV-NEXT: addl $32, %edx		; X86-CMOV-NEXT: addl $32, %edx
		; X86-CMOV-NOT: rep
; X86-CMOV-NEXT: bsfl %eax, %eax		; X86-CMOV-NEXT: bsfl %eax, %eax
; X86-CMOV-NEXT: cmovel %edx, %eax		; X86-CMOV-NEXT: cmovel %edx, %eax
; X86-CMOV-NEXT: xorl %edx, %edx		; X86-CMOV-NEXT: xorl %edx, %edx
; X86-CMOV-NEXT: retl		; X86-CMOV-NEXT: retl
;		;
; X64-LABEL: cttz_i64_zero_test:		; X64-LABEL: cttz_i64_zero_test:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: testq %rdi, %rdi		; X64-NEXT: testq %rdi, %rdi
; X64-NEXT: je .LBB15_1		; X64-NEXT: je .LBB15_1
; X64-NEXT: # %bb.2: # %cond.false		; X64-NEXT: # %bb.2: # %cond.false
; X64-NEXT: bsfq %rdi, %rax		; X64-NEXT: rep bsfq %rdi, %rax
; X64-NEXT: retq		; X64-NEXT: retq
; X64-NEXT: .LBB15_1:		; X64-NEXT: .LBB15_1:
; X64-NEXT: movl $64, %eax		; X64-NEXT: movl $64, %eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i64_zero_test:		; X86-CLZ-LABEL: cttz_i64_zero_test:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax
▲ Show 20 Lines • Show All 133 Lines • ▼ Show 20 Lines
}		}

define i8 @cttz_i8_knownbits(i8 %x) {		define i8 @cttz_i8_knownbits(i8 %x) {
; X86-LABEL: cttz_i8_knownbits:		; X86-LABEL: cttz_i8_knownbits:
; X86: # %bb.0:		; X86: # %bb.0:
; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movzbl {{[0-9]+}}(%esp), %eax
; X86-NEXT: orb $2, %al		; X86-NEXT: orb $2, %al
; X86-NEXT: movzbl %al, %eax		; X86-NEXT: movzbl %al, %eax
; X86-NEXT: bsfl %eax, %eax		; X86-NEXT: rep bsfl %eax, %eax
; X86-NEXT: # kill: def $al killed $al killed $eax		; X86-NEXT: # kill: def $al killed $al killed $eax
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: cttz_i8_knownbits:		; X64-LABEL: cttz_i8_knownbits:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: orb $2, %dil		; X64-NEXT: orb $2, %dil
; X64-NEXT: movzbl %dil, %eax		; X64-NEXT: movzbl %dil, %eax
; X64-NEXT: bsfl %eax, %eax		; X64-NEXT: rep bsfl %eax, %eax
; X64-NEXT: # kill: def $al killed $al killed $eax		; X64-NEXT: # kill: def $al killed $al killed $eax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i8_knownbits:		; X86-CLZ-LABEL: cttz_i8_knownbits:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: movzbl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: movzbl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: orb $2, %al		; X86-CLZ-NEXT: orb $2, %al
; X86-CLZ-NEXT: movzbl %al, %eax		; X86-CLZ-NEXT: movzbl %al, %eax
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines
; X86-NOCMOV-LABEL: cttz_i64_zero_test_knownneverzero:		; X86-NOCMOV-LABEL: cttz_i64_zero_test_knownneverzero:
; X86-NOCMOV: # %bb.0:		; X86-NOCMOV: # %bb.0:
; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NOCMOV-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NOCMOV-NEXT: testl %eax, %eax		; X86-NOCMOV-NEXT: testl %eax, %eax
; X86-NOCMOV-NEXT: jne .LBB22_1		; X86-NOCMOV-NEXT: jne .LBB22_1
; X86-NOCMOV-NEXT: # %bb.2:		; X86-NOCMOV-NEXT: # %bb.2:
; X86-NOCMOV-NEXT: movl $-2147483648, %eax # imm = 0x80000000		; X86-NOCMOV-NEXT: movl $-2147483648, %eax # imm = 0x80000000
; X86-NOCMOV-NEXT: orl {{[0-9]+}}(%esp), %eax		; X86-NOCMOV-NEXT: orl {{[0-9]+}}(%esp), %eax
; X86-NOCMOV-NEXT: bsfl %eax, %eax		; X86-NOCMOV-NEXT: rep bsfl %eax, %eax
; X86-NOCMOV-NEXT: orl $32, %eax		; X86-NOCMOV-NEXT: orl $32, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
; X86-NOCMOV-NEXT: .LBB22_1:		; X86-NOCMOV-NEXT: .LBB22_1:
; X86-NOCMOV-NEXT: bsfl %eax, %eax		; X86-NOCMOV-NEXT: rep bsfl %eax, %eax
; X86-NOCMOV-NEXT: xorl %edx, %edx		; X86-NOCMOV-NEXT: xorl %edx, %edx
; X86-NOCMOV-NEXT: retl		; X86-NOCMOV-NEXT: retl
;		;
; X86-CMOV-LABEL: cttz_i64_zero_test_knownneverzero:		; X86-CMOV-LABEL: cttz_i64_zero_test_knownneverzero:
; X86-CMOV: # %bb.0:		; X86-CMOV: # %bb.0:
; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx		; X86-CMOV-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-CMOV-NEXT: movl $-2147483648, %eax # imm = 0x80000000		; X86-CMOV-NEXT: movl $-2147483648, %eax # imm = 0x80000000
; X86-CMOV-NEXT: orl {{[0-9]+}}(%esp), %eax		; X86-CMOV-NEXT: orl {{[0-9]+}}(%esp), %eax
; X86-CMOV-NEXT: bsfl %ecx, %edx		; X86-CMOV-NEXT: rep bsfl %ecx, %edx
; X86-CMOV-NEXT: bsfl %eax, %eax		; X86-CMOV-NEXT: rep bsfl %eax, %eax
; X86-CMOV-NEXT: orl $32, %eax		; X86-CMOV-NEXT: orl $32, %eax
; X86-CMOV-NEXT: testl %ecx, %ecx		; X86-CMOV-NEXT: testl %ecx, %ecx
; X86-CMOV-NEXT: cmovnel %edx, %eax		; X86-CMOV-NEXT: cmovnel %edx, %eax
; X86-CMOV-NEXT: xorl %edx, %edx		; X86-CMOV-NEXT: xorl %edx, %edx
; X86-CMOV-NEXT: retl		; X86-CMOV-NEXT: retl
;		;
; X64-LABEL: cttz_i64_zero_test_knownneverzero:		; X64-LABEL: cttz_i64_zero_test_knownneverzero:
; X64: # %bb.0:		; X64: # %bb.0:
; X64-NEXT: movabsq $-9223372036854775808, %rax # imm = 0x8000000000000000		; X64-NEXT: movabsq $-9223372036854775808, %rax # imm = 0x8000000000000000
; X64-NEXT: orq %rdi, %rax		; X64-NEXT: orq %rdi, %rax
; X64-NEXT: bsfq %rax, %rax		; X64-NEXT: rep bsfq %rax, %rax
; X64-NEXT: retq		; X64-NEXT: retq
;		;
; X86-CLZ-LABEL: cttz_i64_zero_test_knownneverzero:		; X86-CLZ-LABEL: cttz_i64_zero_test_knownneverzero:
; X86-CLZ: # %bb.0:		; X86-CLZ: # %bb.0:
; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-CLZ-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-CLZ-NEXT: testl %eax, %eax		; X86-CLZ-NEXT: testl %eax, %eax
; X86-CLZ-NEXT: jne .LBB22_1		; X86-CLZ-NEXT: jne .LBB22_1
; X86-CLZ-NEXT: # %bb.2:		; X86-CLZ-NEXT: # %bb.2:
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	; X64-CLZ-NEXT: retq
%ctlz = tail call i32 @llvm.ctlz.i32(i32 %a0, i1 true)		%ctlz = tail call i32 @llvm.ctlz.i32(i32 %a0, i1 true)
%xor = xor i32 %ctlz, 31		%xor = xor i32 %ctlz, 31
%zext = zext i32 %xor to i64		%zext = zext i32 %xor to i64
%gep = getelementptr inbounds [32 x i8], ptr %a1, i64 0, i64 %zext		%gep = getelementptr inbounds [32 x i8], ptr %a1, i64 0, i64 %zext
%load = load i8, ptr %gep, align 1		%load = load i8, ptr %gep, align 1
%sext = sext i8 %load to i32		%sext = sext i8 %load to i32
ret i32 %sext		ret i32 %sext
}		}

		define i32 @cttz_i32_osize(i32 %x) optsize {
		RKSimonUnsubmitted Done Reply Inline Actions please can you add a minsize test as well ? RKSimon: please can you add a minsize test as well ?
		; X86-LABEL: cttz_i32_osize:
		; X86: # %bb.0:
		; X86-NOT: rep
		; X86-NEXT: bsfl {{[0-9]+}}(%esp), %eax
		; X86-NEXT: retl
		;
		; X64-LABEL: cttz_i32_osize:
		; X64: # %bb.0:
		; X64-NOT: rep
		; X64-NEXT: bsfl %edi, %eax
		; X64-NEXT: retq
		;
		; X86-CLZ-LABEL: cttz_i32_osize:
		; X86-CLZ: # %bb.0:
		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
		; X86-CLZ-NEXT: retl
		;
		; X64-CLZ-LABEL: cttz_i32_osize:
		; X64-CLZ: # %bb.0:
		; X64-CLZ-NEXT: tzcntl %edi, %eax
		; X64-CLZ-NEXT: retq
		%tmp = call i32 @llvm.cttz.i32( i32 %x, i1 true)
		ret i32 %tmp
		}

		define i32 @cttz_i32_msize(i32 %x) minsize {
		; X86-LABEL: cttz_i32_msize:
		; X86: # %bb.0:
		; X86-NOT: rep
		; X86-NEXT: bsfl {{[0-9]+}}(%esp), %eax
		; X86-NEXT: retl
		;
		; X64-LABEL: cttz_i32_msize:
		; X64: # %bb.0:
		; X64-NOT: rep
		; X64-NEXT: bsfl %edi, %eax
		; X64-NEXT: retq
		;
		; X86-CLZ-LABEL: cttz_i32_msize:
		; X86-CLZ: # %bb.0:
		; X86-CLZ-NEXT: tzcntl {{[0-9]+}}(%esp), %eax
		; X86-CLZ-NEXT: retl
		;
		; X64-CLZ-LABEL: cttz_i32_msize:
		; X64-CLZ: # %bb.0:
		; X64-CLZ-NEXT: tzcntl %edi, %eax
		; X64-CLZ-NEXT: retq
		%tmp = call i32 @llvm.cttz.i32( i32 %x, i1 true)
		ret i32 %tmp
		}

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll

	Show First 20 Lines • Show All 347 Lines • ▼ Show 20 Lines
	define i1 @asm_clobbering_flags(ptr %mem) nounwind {			define i1 @asm_clobbering_flags(ptr %mem) nounwind {
	; CHECK32-LABEL: asm_clobbering_flags:			; CHECK32-LABEL: asm_clobbering_flags:
	; CHECK32: # %bb.0: # %entry			; CHECK32: # %bb.0: # %entry
	; CHECK32-NEXT: movl {{[0-9]+}}(%esp), %ecx			; CHECK32-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; CHECK32-NEXT: movl (%ecx), %edx			; CHECK32-NEXT: movl (%ecx), %edx
	; CHECK32-NEXT: testl %edx, %edx			; CHECK32-NEXT: testl %edx, %edx
	; CHECK32-NEXT: setg %al			; CHECK32-NEXT: setg %al
	; CHECK32-NEXT: #APP			; CHECK32-NEXT: #APP
				; CHECK32-NOT: rep
	; CHECK32-NEXT: bsfl %edx, %edx			; CHECK32-NEXT: bsfl %edx, %edx
	; CHECK32-NEXT: #NO_APP			; CHECK32-NEXT: #NO_APP
	; CHECK32-NEXT: movl %edx, (%ecx)			; CHECK32-NEXT: movl %edx, (%ecx)
	; CHECK32-NEXT: retl			; CHECK32-NEXT: retl
	;			;
	; CHECK64-LABEL: asm_clobbering_flags:			; CHECK64-LABEL: asm_clobbering_flags:
	; CHECK64: # %bb.0: # %entry			; CHECK64: # %bb.0: # %entry
	; CHECK64-NEXT: movl (%rdi), %ecx			; CHECK64-NEXT: movl (%rdi), %ecx
	; CHECK64-NEXT: testl %ecx, %ecx			; CHECK64-NEXT: testl %ecx, %ecx
	; CHECK64-NEXT: setg %al			; CHECK64-NEXT: setg %al
	; CHECK64-NEXT: #APP			; CHECK64-NEXT: #APP
				; CHECK64-NOT: rep
	; CHECK64-NEXT: bsfl %ecx, %ecx			; CHECK64-NEXT: bsfl %ecx, %ecx
	; CHECK64-NEXT: #NO_APP			; CHECK64-NEXT: #NO_APP
	; CHECK64-NEXT: movl %ecx, (%rdi)			; CHECK64-NEXT: movl %ecx, (%rdi)
	; CHECK64-NEXT: retq			; CHECK64-NEXT: retq
	entry:			entry:
	%val = load i32, ptr %mem, align 4			%val = load i32, ptr %mem, align 4
	%cmp = icmp sgt i32 %val, 0			%cmp = icmp sgt i32 %val, 0
	%res = tail call i32 asm "bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)			%res = tail call i32 asm "bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)
				aaronpuchertUnsubmitted Not Done Reply Inline Actions Goes a bit too far I think. We can turn a generic builtin into `rep; bsf`, but if the inline assembly explicitly asks for `bsf` I think we should emit that. aaronpuchert: Goes a bit too far I think. We can turn a generic builtin into `rep; bsf`, but if the inline…
				pengfeiAuthorUnsubmitted Done Reply Inline Actions You are right. So this is to check no REP prefix was generated. :) pengfei: You are right. So this is to check no REP prefix was generated. :)
				aaronpuchertUnsubmitted Not Done Reply Inline Actions Sorry, I missed the `-NOT`. But is that needed? After all the `-NEXT` shouldn't match if there is a `rep` in between. If there is a reason to keep this, the second occurrence should likely be `CHECK64` instead of `CHECK32`. aaronpuchert: Sorry, I missed the `-NOT`. But is that needed? After all the `-NEXT` shouldn't match if there…
				skanUnsubmitted Not Done Reply Inline Actions I think the two `CHECK-NOT` are redundant here b/c `CHECK-NEXT` can gurantee `#APP` is followed by `bsfl`. skan: I think the two `CHECK-NOT` are redundant here b/c `CHECK-NEXT` can gurantee `#APP` is followed…
				craig.topperUnsubmitted Not Done Reply Inline Actions Doesn't the CHECK-NEXT only guarantee that the next line contains bsfl. It doesn't rule out any text before the bsfl. craig.topper: Doesn't the CHECK-NEXT only guarantee that the next line contains bsfl. It doesn't rule out any…
				skanUnsubmitted Not Done Reply Inline Actions I think if there is `rep` between `#APP` and `#NO_APP`, the following check will fail. ; CHECK32-NEXT: #APP ; CHECK32-NEXT: bsfl %edx, %edx ; CHECK32-NEXT: #NO_APP skan: I think if there is `rep` between `#APP` and `#NO_APP`, the following check will fail. ```…
				craig.topperUnsubmitted Not Done Reply Inline Actions The test still passes with this change diff --git a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll index f3d4b6221d08..4c7094d5c5f0 100644 --- a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll +++ b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll @@ -371,7 +371,7 @@ define i1 @asm_clobbering_flags(ptr %mem) nounwind { entry: %val = load i32, ptr %mem, align 4 %cmp = icmp sgt i32 %val, 0 - %res = tail call i32 asm "bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val) + %res = tail call i32 asm "rep bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val) store i32 %res, ptr %mem, align 4 ret i1 %cmp That puts rep on the same line before bsfl in the output. Every test change in this review shows rep on the same line as the bsfl. craig.topper: The test still passes with this change ``` diff --git a/llvm/test/CodeGen/X86/peephole-na-phys…
				pengfeiAuthorUnsubmitted Done Reply Inline Actions Craig is correct. We'd better to add explicit check to avoid false negative, though I think re-generate from tool will emit the `rep` for the opposite case. pengfei: Craig is correct. We'd better to add explicit check to avoid false negative, though I think re…
	store i32 %res, ptr %mem, align 4			store i32 %res, ptr %mem, align 4
	ret i1 %cmp			ret i1 %cmp
	}			}

llvm/test/CodeGen/X86/stack-folding-x86_64.ll

	Show All 31 Lines
	; CHECK-NEXT: .cfi_offset %r13, -40			; CHECK-NEXT: .cfi_offset %r13, -40
	; CHECK-NEXT: .cfi_offset %r14, -32			; CHECK-NEXT: .cfi_offset %r14, -32
	; CHECK-NEXT: .cfi_offset %r15, -24			; CHECK-NEXT: .cfi_offset %r15, -24
	; CHECK-NEXT: .cfi_offset %rbp, -16			; CHECK-NEXT: .cfi_offset %rbp, -16
	; CHECK-NEXT: movl %edi, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill			; CHECK-NEXT: movl %edi, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: bsfl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Folded Reload			; CHECK-NEXT: rep bsfl {{[-0-9]+}}(%r{{[sb]}}p), %eax # 4-byte Folded Reload
	; CHECK-NEXT: popq %rbx			; CHECK-NEXT: popq %rbx
	; CHECK-NEXT: .cfi_def_cfa_offset 48			; CHECK-NEXT: .cfi_def_cfa_offset 48
	; CHECK-NEXT: popq %r12			; CHECK-NEXT: popq %r12
	; CHECK-NEXT: .cfi_def_cfa_offset 40			; CHECK-NEXT: .cfi_def_cfa_offset 40
	; CHECK-NEXT: popq %r13			; CHECK-NEXT: popq %r13
	; CHECK-NEXT: .cfi_def_cfa_offset 32			; CHECK-NEXT: .cfi_def_cfa_offset 32
	; CHECK-NEXT: popq %r14			; CHECK-NEXT: popq %r14
	; CHECK-NEXT: .cfi_def_cfa_offset 24			; CHECK-NEXT: .cfi_def_cfa_offset 24
	Show All 28 Lines
	; CHECK-NEXT: .cfi_offset %r13, -40			; CHECK-NEXT: .cfi_offset %r13, -40
	; CHECK-NEXT: .cfi_offset %r14, -32			; CHECK-NEXT: .cfi_offset %r14, -32
	; CHECK-NEXT: .cfi_offset %r15, -24			; CHECK-NEXT: .cfi_offset %r15, -24
	; CHECK-NEXT: .cfi_offset %rbp, -16			; CHECK-NEXT: .cfi_offset %rbp, -16
	; CHECK-NEXT: movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill			; CHECK-NEXT: movq %rdi, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: bsfq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded Reload			; CHECK-NEXT: rep bsfq {{[-0-9]+}}(%r{{[sb]}}p), %rax # 8-byte Folded Reload
	; CHECK-NEXT: popq %rbx			; CHECK-NEXT: popq %rbx
	; CHECK-NEXT: .cfi_def_cfa_offset 48			; CHECK-NEXT: .cfi_def_cfa_offset 48
	; CHECK-NEXT: popq %r12			; CHECK-NEXT: popq %r12
	; CHECK-NEXT: .cfi_def_cfa_offset 40			; CHECK-NEXT: .cfi_def_cfa_offset 40
	; CHECK-NEXT: popq %r13			; CHECK-NEXT: popq %r13
	; CHECK-NEXT: .cfi_def_cfa_offset 32			; CHECK-NEXT: .cfi_def_cfa_offset 32
	; CHECK-NEXT: popq %r14			; CHECK-NEXT: popq %r14
	; CHECK-NEXT: .cfi_def_cfa_offset 24			; CHECK-NEXT: .cfi_def_cfa_offset 24
	▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86][MC] Always emit `rep` prefix for `bsf`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 450206

llvm/lib/Target/X86/X86MCInstLower.cpp

llvm/test/CodeGen/X86/clz.ll

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll

llvm/test/CodeGen/X86/stack-folding-x86_64.ll

[X86][MC] Always emit `rep` prefix for `bsf`
ClosedPublic