This is an archive of the discontinued LLVM Phabricator instance.

[X86][MC] Always emit `rep` prefix for `bsf`
ClosedPublic

Authored by pengfei on Aug 1 2022, 7:02 PM.

Details

Summary

BMI new instruction tzcnt has better performance than bsf on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
bsf. If we force emit rep prefix for bsf, we will gain better
performance when the same code run on new processors.

GCC has already done this way: https://c.godbolt.org/z/6xere6fs1

Fixes #34191

Diff Detail

Event Timeline

pengfei created this revision.Aug 1 2022, 7:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2022, 7:02 PM
pengfei requested review of this revision.Aug 1 2022, 7:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 1 2022, 7:02 PM

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

Also need to print the rep in the assembly listing so -fno-integrated-as works

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

skan added a comment.Aug 1 2022, 7:13 PM

I believe Linus's suggestion is adding a prefix to BSF rather than changing the encoding.
Maybe you can consider implementing it in X86MCInstLower::Lower by using Flags of MCInst.

skan added a comment.Aug 1 2022, 7:15 PM

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

No, I don't think it is a good way. It breaks the support of original bsf.

This basically need to be done with a CodeGenOnly pseudo instruction. We can’t change assembler or disassembler behavior

Assembler should faithful to the original source. This needs to be done at codegen only I would think.

I see, how about changing the assembly string to "rep bsf{w}\t{$src, $dst|$dst, $src}"?

Oh, we can't do it consider the decoding. I'll reconsider the solution. Thanks!

Maybe you can consider implementing it in X86MCInstLower::Lower by using Flags of MCInst.

Thanks for the suggestion, let me have a try.

pengfei updated this revision to Diff 449198.Aug 2 2022, 12:08 AM

Compared both methods, adding flag in X86MCInstLower::Lower seems convenient. Thanks Craig and Shengchen!

nikic added inline comments.Aug 2 2022, 12:26 AM
llvm/lib/Target/X86/X86MCInstLower.cpp
986 ↗(On Diff #449198)

Add a comment to explain why we do this?

pengfei updated this revision to Diff 449203.Aug 2 2022, 12:49 AM
pengfei marked an inline comment as done.

Add comment. Thanks @nikic

skan added inline comments.Aug 2 2022, 12:54 AM
llvm/lib/Target/X86/X86MCInstLower.cpp
987 ↗(On Diff #449203)

Do we need to check any target feature here?

craig.topper added inline comments.Aug 2 2022, 1:01 AM
llvm/test/CodeGen/X86/clz.ll
49 ↗(On Diff #449203)

We need to promote bsrw to bsrl. tzcntl is faster than tzcntw. tzcntw has a false dependency to preserve the upper 48 bits of the result register in Intel CPUs. But separate patch please.

craig.topper added inline comments.Aug 2 2022, 1:05 AM
llvm/lib/Target/X86/X86MCInstLower.cpp
987 ↗(On Diff #449203)

I don’t think so.

Probably should skip for minsize though?

skan added inline comments.Aug 2 2022, 1:09 AM
llvm/lib/Target/X86/X86MCInstLower.cpp
987 ↗(On Diff #449203)

I think it's a good suggestion.

pengfei updated this revision to Diff 449214.Aug 2 2022, 1:22 AM

Add check for minsize.

llvm/lib/Target/X86/X86MCInstLower.cpp
987 ↗(On Diff #449203)

Agreed. And good point! Added check for minsize.

By the way, the original bug for this is #34191.

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

Goes a bit too far I think. We can turn a generic builtin into rep; bsf, but if the inline assembly explicitly asks for bsf I think we should emit that.

pengfei added inline comments.Aug 2 2022, 1:32 AM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

You are right. So this is to check *no* REP prefix was generated. :)

pengfei added inline comments.Aug 2 2022, 2:00 AM
llvm/test/CodeGen/X86/clz.ll
49 ↗(On Diff #449203)

Can we allow to always do that? The result is not equal for tzcnt if we care the src is 0: https://godbolt.org/z/enahabbM7

pengfei edited the summary of this revision. (Show Details)Aug 2 2022, 2:26 AM
craig.topper added inline comments.Aug 2 2022, 2:34 AM
llvm/test/CodeGen/X86/clz.ll
49 ↗(On Diff #449203)

I meant in SelectionDAG.

RKSimon added inline comments.Aug 2 2022, 3:08 AM
llvm/test/CodeGen/X86/clz.ll
1114 ↗(On Diff #449214)

please can you add a minsize test as well ?

pengfei updated this revision to Diff 449244.Aug 2 2022, 3:27 AM
pengfei marked an inline comment as done.

Add minsize test.

aaronpuchert added inline comments.Aug 2 2022, 3:32 PM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

Sorry, I missed the -NOT. But is that needed? After all the -NEXT shouldn't match if there is a rep in between.

If there is a reason to keep this, the second occurrence should likely be CHECK64 instead of CHECK32.

skan added inline comments.Aug 2 2022, 6:40 PM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

I think the two CHECK-NOT are redundant here b/c CHECK-NEXT can gurantee #APP is followed by bsfl.

craig.topper added inline comments.Aug 2 2022, 7:24 PM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

Doesn't the CHECK-NEXT only guarantee that the next line contains bsfl. It doesn't rule out any text before the bsfl.

skan added inline comments.Aug 2 2022, 7:45 PM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

I think if there is rep between #APP and #NO_APP, the following check will fail.

; CHECK32-NEXT:    #APP
; CHECK32-NEXT:    bsfl %edx, %edx
; CHECK32-NEXT:    #NO_APP
craig.topper added inline comments.Aug 2 2022, 8:01 PM
llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

The test still passes with this change

diff --git a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
index f3d4b6221d08..4c7094d5c5f0 100644
--- a/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
+++ b/llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
@@ -371,7 +371,7 @@ define i1 @asm_clobbering_flags(ptr %mem) nounwind {
 entry:
   %val = load i32, ptr %mem, align 4
   %cmp = icmp sgt i32 %val, 0
-  %res = tail call i32 asm "bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)
+  %res = tail call i32 asm "rep bsfl $1,$0", "=r,r,~{cc},~{dirflag},~{fpsr},~{flags}"(i32 %val)
   store i32 %res, ptr %mem, align 4
   ret i1 %cmp

That puts rep on the same line before bsfl in the output. Every test change in this review shows rep on the same line as the bsfl.

pengfei updated this revision to Diff 449525.Aug 2 2022, 8:25 PM

Fix the check prefix error.

llvm/test/CodeGen/X86/peephole-na-phys-copy-folding.ll
376 ↗(On Diff #449214)

Craig is correct. We'd better to add explicit check to avoid false negative, though I think re-generate from tool will emit the rep for the opposite case.

skan accepted this revision.Aug 2 2022, 11:08 PM

LGTM

This revision is now accepted and ready to land.Aug 2 2022, 11:08 PM
This revision was landed with ongoing or failed builds.Aug 3 2022, 2:09 AM
This revision was automatically updated to reflect the committed changes.
fmayer added a subscriber: fmayer.Aug 3 2022, 2:39 PM

This seems related to an error we see on our sanitizer bots: https://lab.llvm.org/buildbot/#/builders/37/builds/15446/steps/33/logs/stdio

FAIL: Builtins-i386-linux :: ffssi2_test.c (2860 of 8387)
******************** TEST 'Builtins-i386-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m32 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-i386.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.
FAIL: Builtins-x86_64-linux :: ffssi2_test.c (3084 of 8387)
******************** TEST 'Builtins-x86_64-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m64 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-x86_64.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..

This seems related to an error we see on our sanitizer bots: https://lab.llvm.org/buildbot/#/builders/37/builds/15446/steps/33/logs/stdio

FAIL: Builtins-i386-linux :: ffssi2_test.c (2860 of 8387)
******************** TEST 'Builtins-i386-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m32 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-i386.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/I386LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.
FAIL: Builtins-x86_64-linux :: ffssi2_test.c (3084 of 8387)
******************** TEST 'Builtins-x86_64-linux :: ffssi2_test.c' FAILED ********************
Script:
--
: 'RUN: at line 1';       /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang   -gline-tables-only  -m64 -DCOMPILER_RT_HAS_FLOAT16  -fno-builtin -I /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/builtins -nodefaultlibs /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/builtins/Unit/ffssi2_test.c /b/sanitizer-x86_64-linux/build/compiler_rt_build/lib/linux/libclang_rt.builtins-x86_64.a -lc -lm -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp &&  /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/builtins/Unit/X86_64LinuxConfig/Output/ffssi2_test.c.tmp
--
Exit Code: 1
Command Output (stdout):
--
error in __ffssi2(0x0) = 33, expected 0
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..

Random guess, we forgot to take into account that tzcnt and bsf don't set the flags the same way. If the flags from the bsf are being used we can't put an f3 in front of it.

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

fmayer added a comment.Aug 3 2022, 2:50 PM

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

Thanks! I can also verify that a revert actually fixes this, but judging from the Changes in the first build that introduced the failure, I think it's very likely that it was this.

craig.topper reopened this revision.Aug 3 2022, 2:52 PM
This revision is now accepted and ready to land.Aug 3 2022, 2:52 PM

I'll revert, I'll have to do a fixup for a dependent patch that I committed after this.

Thanks! I can also verify that a revert actually fixes this, but judging from the Changes in the first build that introduced the failure, I think it's very likely that it was this.

It was definitely this. The ffs in the name of the failing function is 'find first set' bit. That's what the bsf and tzcnt instructions do.

craig.topper added inline comments.Aug 3 2022, 2:57 PM
llvm/test/CodeGen/X86/dagcombine-select.ll
310 ↗(On Diff #449601)

This is incorrect. If the instruction is detected as tzcnt, the cmov would need to be cmovaeq not cmovneq. Since we can't know at compile time which CPU we'll be running on. We can't do this transform if the flags are used.

craig.topper requested changes to this revision.Aug 3 2022, 2:57 PM
This revision now requires changes to proceed.Aug 3 2022, 2:57 PM
pengfei updated this revision to Diff 449881.Aug 3 2022, 11:10 PM

Make sure EFLAGS is not used.

Didn't notice the differences on the EFLAG. Thanks @fmayer and @craig.topper!

craig.topper accepted this revision.Aug 4 2022, 9:38 AM

LGTM with the comment typo fixed

llvm/lib/Target/X86/X86MCInstLower.cpp
989 ↗(On Diff #449881)

latter -> later

This revision is now accepted and ready to land.Aug 4 2022, 9:38 AM
This revision was landed with ongoing or failed builds.Aug 4 2022, 7:26 PM
This revision was automatically updated to reflect the committed changes.