Shift count is masked by hardware so these peepholes just extend
common patterns for NOT to the lower bits of shift count.
As well (32/64 ^ X) is masked off by the shift so can be safely
ignored.
Differential D140087
[X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count goldstein.w.n on Dec 14 2022, 8:11 PM. Authored by
Details Shift count is masked by hardware so these peepholes just extend As well (32/64 ^ X) is masked off by the shift so can be safely
Diff Detail
Event TimelineComment Actions Pre-commit the test case and rebase to show the diff.
Comment Actions Assume this is what you mean so made the revision two commits. Accidentally made two new revisions in the process (D140314/D140316). Comment Actions I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087 Comment Actions I see. I don't have commit access. Here is the revision with just the test: Once that goes up I'll rebase this change and update? Comment Actions @pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is. I was looking to add a peephole to change something like: ptr[x / 32] |= (1 << (x % 32)) Currently codegen is something like: mov $0x1,%gpr1 shlx %cnt,%gpr1,%mask shr $0x5,%cnt or %mask, (%ptr, %cnt, 4) And it could be as simple as: bts %cnt, (%ptr) (other pattern with bt{s|r|c} could also be improved) I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend Any chance you could direct me as where I should look at add this type of Comment Actions bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code. Comment Actions I think that translates to worse throughput (so worse in a tight loop iff no carried
on ICX: 3,782,331 port0 3,207,023 port1 1,001,220 port23 3,216,022 port5 4,940,975 port6 11,575,101 port49 Same loop using btr 2,055,213 port0 1,298,859 port1 1,000,372 port23 1,505,077 port5 3,261,176 port6 1,088,049 port49 The loop: .global _start .p2align 6 .text _start: movl $1, %eax movl $123, %ecx leaq (buf_start)(%rip), %rdi movl $1000000, %edx loop: #if 0 btr %rcx, (%rdi) #else shlx %ecx, %eax, %ebx movl %ecx, %esi shr $5, %esi andl %ebx, (%rdi, %rsi, 4) #endif decl %edx jnz loop movl $60, %eax xorl %edi, %edi syscall .section .data .balign 4096 buf_start: .space 4096 buf_end: Comment Actions Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version. Comment Actions
I'm having trouble accounting for these numbers. As far as I know so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing? You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops. Comment Actions Also note that microcoded instructions (with more than 2 uops), Comment Actions No (was suprised too! thats why I added the code), although looking a bit more if you add some arbitrary padding the perf numbers changed (and the p49 for the shlx version If I had to guess something is going awry with uop replay. Changing the benchmark: .global _start .p2align 6 .text _start: movl $1, %eax movl $128, %ecx leaq (buf_start)(%rip), %rdi xorl %ebp, %ebp movl $1000000, %edx loop: #if 1 btr %rcx, (%rdi) #else shlx %ecx, %eax, %ebx movl %ecx, %esi shr $5, %esi andl %ebx, (%rdi, %rsi, 4) #endif nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop nop decl %edx jnz loop movl $60, %eax xorl %edi, %edi syscall .section .data .balign 4096 buf_start: .space 4096 buf_end: The shlx version performs much better. 1,224,172 p0 272,089 p1 1,000,431 p23 527,889 p5 2,210,403 p6 1,000,316 p78 1,217,652 p49 Versus the btr version: 2,300,942 p0 1,001,087 p1 1,000,150 p23 1,000,933 p5 4,001,759 p6 1,000,100 p78 1,300,149 p49 Which for some reason ends up bottenecking on Think I was wrong. Comment Actions See my other comment, but I think the benchmark was misleading and the numbers Think you and @lebedev.ri are right and unless there is intense register pressure Comment Actions If its all the same, would still like guidance about how to implement it. Think it may Comment Actions AtomicExpansionKind::BitTestIntrinsic is different, it is intended to replace the expensive lock cmpxchg. It may not be beneficial in non atomic cases. Comment Actions Yeah, I realized I needed new definitions for int_x86_atomic_bts that take gpr arguments. Comment Actions $> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/not-shift.ll -- Testing: 1 tests, 1 workers -- PASS: LLVM :: CodeGen/X86/not-shift.ll (1 of 1) Testing Time: 0.51s Passed: 1 $> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/legalize-shift-64.ll -- Testing: 1 tests, 1 workers -- PASS: LLVM :: CodeGen/X86/legalize-shift-64.ll (1 of 1) Testing Time: 0.08s Passed: 1 Not sure if you meant something else (new the the tool sorry), if so let me know. Note the AMX tests fail with alive2. I checked and they fail before the patch Comment Actions Alive2 doesn't deal with assembly/dag, only ir. Comment Actions I see. Sorry about that. How about the following: in.ll: define i32 @foo32_(i32 %val, i32 %cnt) { %adjcnt = sub i32 31, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } define i32 @foo32_2(i32 %val, i32 %cnt) { %adjcnt = xor i32 31, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } define i32 @foo32_3(i32 %val, i32 %cnt) { %adjcnt = xor i32 32, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } define i64 @foo64_(i64 %val, i64 %cnt) { %adjcnt = sub i64 63, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } define i64 @foo64_2(i64 %val, i64 %cnt) { %adjcnt = xor i64 63, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } define i64 @foo64_3(i64 %val, i64 %cnt) { %adjcnt = xor i64 64, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } out.ll define i32 @foo32_(i32 %val, i32 %cnt) { %adjcnt = xor i32 -1, %cnt %shiftcnt = and i32 31, %adjcnt %result = shl i32 %val, %shiftcnt ret i32 %result } define i32 @foo32_2(i32 %val, i32 %cnt) { %adjcnt = xor i32 -1, %cnt %shiftcnt = and i32 31, %adjcnt %result = shl i32 %val, %shiftcnt ret i32 %result } define i32 @foo32_3(i32 %val, i32 %cnt) { %shiftcnt = and i32 31, %cnt %result = shl i32 %val, %shiftcnt ret i32 %result } define i64 @foo64_(i64 %val, i64 %cnt) { %adjcnt = xor i64 -1, %cnt %shiftcnt = and i64 63, %adjcnt %result = shl i64 %val, %shiftcnt ret i64 %result } define i64 @foo64_2(i64 %val, i64 %cnt) { %adjcnt = xor i64 -1, %cnt %shiftcnt = and i64 63, %adjcnt %result = shl i64 %val, %shiftcnt ret i64 %result } define i64 @foo64_3(i64 %val, i64 %cnt) { %shiftcnt = and i64 63, %cnt %result = shl i64 %val, %shiftcnt ret i64 %result } Running: $> /home/noah/programs/opensource/llvm-dev/src/alive2/build/alive-tv in.ll out.ll ---------------------------------------- define i32 @foo32_(i32 %val, i32 %cnt) { %0: %adjcnt = sub i32 31, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } => define i32 @foo32_(i32 %val, i32 %cnt) { %0: %adjcnt = xor i32 4294967295, %cnt %shiftcnt = and i32 31, %adjcnt %result = shl i32 %val, %shiftcnt ret i32 %result } Transformation seems to be correct! ---------------------------------------- define i32 @foo32_2(i32 %val, i32 %cnt) { %0: %adjcnt = xor i32 31, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } => define i32 @foo32_2(i32 %val, i32 %cnt) { %0: %adjcnt = xor i32 4294967295, %cnt %shiftcnt = and i32 31, %adjcnt %result = shl i32 %val, %shiftcnt ret i32 %result } Transformation seems to be correct! ---------------------------------------- define i32 @foo32_3(i32 %val, i32 %cnt) { %0: %adjcnt = xor i32 32, %cnt %result = shl i32 %val, %adjcnt ret i32 %result } => define i32 @foo32_3(i32 %val, i32 %cnt) { %0: %shiftcnt = and i32 31, %cnt %result = shl i32 %val, %shiftcnt ret i32 %result } Transformation seems to be correct! ---------------------------------------- define i64 @foo64_(i64 %val, i64 %cnt) { %0: %adjcnt = sub i64 63, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } => define i64 @foo64_(i64 %val, i64 %cnt) { %0: %adjcnt = xor i64 -1, %cnt %shiftcnt = and i64 63, %adjcnt %result = shl i64 %val, %shiftcnt ret i64 %result } Transformation seems to be correct! ---------------------------------------- define i64 @foo64_2(i64 %val, i64 %cnt) { %0: %adjcnt = xor i64 63, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } => define i64 @foo64_2(i64 %val, i64 %cnt) { %0: %adjcnt = xor i64 -1, %cnt %shiftcnt = and i64 63, %adjcnt %result = shl i64 %val, %shiftcnt ret i64 %result } Transformation seems to be correct! ---------------------------------------- define i64 @foo64_3(i64 %val, i64 %cnt) { %0: %adjcnt = xor i64 64, %cnt %result = shl i64 %val, %adjcnt ret i64 %result } => define i64 @foo64_3(i64 %val, i64 %cnt) { %0: %shiftcnt = and i64 63, %cnt %result = shl i64 %val, %shiftcnt ret i64 %result } Transformation seems to be correct! Summary: 6 correct transformations 0 incorrect transformations 0 failed-to-prove transformations 0 Alive2 errors
Comment Actions Please be sure to precommit the test changes before committing the change itself.
Comment Actions You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?
Comment Actions In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
Comment Actions Done: https://reviews.llvm.org/D141076 Sorry for long delay, I was sitting on my last round of comments for a week but forgot to hit submit!
Comment Actions The test dependency have landed https://github.com/llvm/llvm-project/commit/a698790c51ec2804c3a7ba4c59438e7816690ea2
Comment Actions If you need someone to commit this for you, please ask for that directly. (i'll commit in +~12h) |
add https://alive2.llvm.org/ce/z/gFh16W
sub of constant https://alive2.llvm.org/ce/z/umLun5
xor https://alive2.llvm.org/ce/z/ewQ7Bd