This is an archive of the discontinued LLVM Phabricator instance.

[X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count
ClosedPublic

Authored by goldstein.w.n on Dec 14 2022, 8:11 PM.

Details

Summary

Shift count is masked by hardware so these peepholes just extend
common patterns for NOT to the lower bits of shift count.

As well (32/64 ^ X) is masked off by the shift so can be safely
ignored.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Dec 14 2022, 8:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 8:11 PM
goldstein.w.n requested review of this revision.Dec 14 2022, 8:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 8:11 PM

Fix broken test

Pre-commit the test case and rebase to show the diff.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4009–4011

The comments inside a different block look missleading. Should be moved inside the condition.

4015

Remove it. This cannot be false according to the if condition.

llvm/test/CodeGen/X86/not-shift.ll
2

What's tbm used for. Seems it doesn't affect the result?

Move comment, remove TBM, remove unneeded assert

goldstein.w.n marked 3 inline comments as done.Dec 19 2022, 9:07 AM
goldstein.w.n added inline comments.
llvm/test/CodeGen/X86/not-shift.ll
2

Removed in V3. Was unneeded. Had copied the command from another test.

goldstein.w.n marked an inline comment as done.Dec 19 2022, 9:09 AM

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Add commit with not-shift.ll and no peephole.

Add missing earlier commit (I think)

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

goldstein.w.n added a comment.EditedDec 19 2022, 6:42 PM

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

I see. I don't have commit access. Here is the revision with just the test:
https://reviews.llvm.org/D140362

Once that goes up I'll rebase this change and update?

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

craig.topper added a subscriber: craig.topper.EditedDec 20 2022, 10:23 AM

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

goldstein.w.n added a comment.EditedDec 20 2022, 11:05 AM

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

  1. Better latency.
  2. Less register pressure
  3. Less code size.
  4. Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

Same loop using btr

2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49

The loop:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

  1. Better latency.
  2. Less register pressure
  3. Less code size.
  4. Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

Same loop using btr

2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49

The loop:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

craig.topper added a comment.EditedDec 20 2022, 11:31 AM
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

Also note that microcoded instructions (with more than 2 uops),
at least on older AMD CPU's, used to significantly affect decoder throughput,
which was shared by threads of a core. I'm also not quite sure bts would be a win.

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

  1. Better latency.
  2. Less register pressure
  3. Less code size.
  4. Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

Same loop using btr

2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49

The loop:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

No (was suprised too! thats why I added the code), although looking a bit more
into it I think the benchmark probably isn't so good and is misleading.

if you add some arbitrary padding the perf numbers changed (and the p49 for the shlx version
goes down 1/cycle as expected).

If I had to guess something is going awry with uop replay.

Changing the benchmark:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$128, %ecx
	leaq	(buf_start)(%rip), %rdi
	xorl	%ebp, %ebp
	movl	$1000000, %edx

loop:
#if 1
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop

	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

The shlx version performs much better.

1,224,172      p0                                                          
  272,089      p1                                                          
1,000,431      p23                                                         
  527,889      p5                                                          
2,210,403      p6                                                          
1,000,316      p78                                                         
1,217,652      p49

Versus the btr version:

2,300,942      p0                                                          
1,001,087      p1                                                          
1,000,150      p23                                                         
1,000,933      p5                                                          
4,001,759      p6                                                          
1,000,100      p78                                                         
1,300,149      p49

Which for some reason ends up bottenecking on
p6 uops although should be able to schedule
on p0 too.

Think I was wrong.

goldstein.w.n added a comment.EditedDec 20 2022, 11:42 AM
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.

Rebased after tests landed in master

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

Saw that the not-shift.ll test was pushed. Have rebased this PR.

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.

AtomicExpansionKind::BitTestIntrinsic is different, it is intended to replace the expensive lock cmpxchg. It may not be beneficial in non atomic cases.

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.

AtomicExpansionKind::BitTestIntrinsic is different, it is intended to replace the expensive lock cmpxchg. It may not be beneficial in non atomic cases.

Yeah, I realized I needed new definitions for int_x86_atomic_bts that take gpr arguments.

Can you please add some alive2 proof links into the patch description?

Can you please add some alive2 proof links into the patch description?

$> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/not-shift.ll
-- Testing: 1 tests, 1 workers --
PASS: LLVM :: CodeGen/X86/not-shift.ll (1 of 1)

Testing Time: 0.51s
  Passed: 1
$> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/legalize-shift-64.ll
-- Testing: 1 tests, 1 workers --
PASS: LLVM :: CodeGen/X86/legalize-shift-64.ll (1 of 1)

Testing Time: 0.08s
  Passed: 1

Not sure if you meant something else (new the the tool sorry), if so let me know.

Note the AMX tests fail with alive2. I checked and they fail before the patch
was added as well and the patch doesn't change any of codegen in those tests
so have to imagine its unrelated.

lebedev.ri requested changes to this revision.Dec 29 2022, 5:33 AM

Alive2 doesn't deal with assembly/dag, only ir.
I'm asking you to write the proofs for these changes via an IR tests,
explicitly modelling the implicit modulo of the shift amount, which isn't a thing in IR.

This revision now requires changes to proceed.Dec 29 2022, 5:33 AM

Alive2 doesn't deal with assembly/dag, only ir.
I'm asking you to write the proofs for these changes via an IR tests,
explicitly modelling the implicit modulo of the shift amount, which isn't a thing in IR.

I see. Sorry about that. How about the following:

in.ll:

define i32 @foo32_(i32 %val, i32 %cnt) {
  %adjcnt = sub i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

define i32 @foo32_2(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

    
define i32 @foo32_3(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 32, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

define i64 @foo64_(i64 %val, i64 %cnt) {
  %adjcnt = sub i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

define i64 @foo64_2(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

    
define i64 @foo64_3(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 64, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

out.ll

define i32 @foo32_(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 -1, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}

define i32 @foo32_2(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 -1, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}

define i32 @foo32_3(i32 %val, i32 %cnt) {
  %shiftcnt = and i32 31, %cnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}


define i64 @foo64_(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

define i64 @foo64_2(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

define i64 @foo64_3(i64 %val, i64 %cnt) {
  %shiftcnt = and i64 63, %cnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

Running:

$> /home/noah/programs/opensource/llvm-dev/src/alive2/build/alive-tv in.ll out.ll 
----------------------------------------
define i32 @foo32_(i32 %val, i32 %cnt) {
%0:
  %adjcnt = sub i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 4294967295, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i32 @foo32_2(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_2(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 4294967295, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i32 @foo32_3(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 32, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_3(i32 %val, i32 %cnt) {
%0:
  %shiftcnt = and i32 31, %cnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_(i64 %val, i64 %cnt) {
%0:
  %adjcnt = sub i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_2(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_2(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_3(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 64, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_3(i64 %val, i64 %cnt) {
%0:
  %shiftcnt = and i64 63, %cnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!

Summary:
  6 correct transformations
  0 incorrect transformations
  0 failed-to-prove transformations
  0 Alive2 errors
goldstein.w.n added inline comments.Dec 29 2022, 9:11 AM
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4009–4010

Ah, good catch.

Fixing and will add test case in not-shift.ll.

Sorry for missing that.

goldstein.w.n marked an inline comment as done.

Fix bug + tests

goldstein.w.n marked an inline comment as done.Dec 29 2022, 9:39 AM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4009–4010

Fixed. The Guard the condition against add and only transform sub if its (Size - 1) - X, not X - (Size - 1)

Added some tests for it.

lebedev.ri accepted this revision.Dec 29 2022, 11:46 AM

Please be sure to precommit the test changes before committing the change itself.
Looks good to me now. Thanks!

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4023–4025

Since we are already here, why not just do it ourselves?
That would be less LOC even.

4027
4028

Remove newline

llvm/test/CodeGen/X86/not-shift.ll
2–11

Run lines are still wrong, please deduplicate them.
There should be only 4 i think?

This revision is now accepted and ready to land.Dec 29 2022, 11:46 AM

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

Looks good to me now. Thanks!

Fix some nits

goldstein.w.n marked 2 inline comments as done.Dec 29 2022, 2:08 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4023–4025

Done.

4028

Done.

goldstein.w.n marked 2 inline comments as done.Jan 5 2023, 7:28 AM

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

Looks good to me now. Thanks!

Propegate test changes

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

Done: https://reviews.llvm.org/D141076

Sorry for long delay, I was sitting on my last round of comments for a week but forgot to hit submit!

Looks good to me now. Thanks!

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

The test dependency have landed https://github.com/llvm/llvm-project/commit/a698790c51ec2804c3a7ba4c59438e7816690ea2
is this good to go?

Looks good to me now. Thanks!

pengfei accepted this revision.Jan 7 2023, 6:03 AM

If you need someone to commit this for you, please ask for that directly. (i'll commit in +~12h)

This revision was landed with ongoing or failed builds.Jan 12 2023, 8:54 PM
This revision was automatically updated to reflect the committed changes.