This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
10/11
X86ISelDAGToDAG.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
legalize-shift-64.ll
2/3
not-shift.ll

Differential D140087

[X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count
ClosedPublic

Authored by goldstein.w.n on Dec 14 2022, 8:11 PM.

Download Raw Diff

Details

Reviewers

pengfei
RKSimon
lebedev.ri

Commits

rG4916523053d7: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing…

Summary

Shift count is masked by hardware so these peepholes just extend
common patterns for NOT to the lower bits of shift count.

As well (32/64 ^ X) is masked off by the shift so can be safely
ignored.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,070 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

goldstein.w.n created this revision.Dec 14 2022, 8:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 8:11 PM

Herald added subscribers: pengfei, hiraditya. · View Herald Transcript

goldstein.w.n requested review of this revision.Dec 14 2022, 8:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 8:11 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

goldstein.w.n added reviewers: pengfei, RKSimon.Dec 14 2022, 8:14 PM

Fix broken test

Harbormaster completed remote builds in B203277: Diff 483077.Dec 14 2022, 10:14 PM

Pre-commit the test case and rebase to show the diff.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4005–4007	The comments inside a different block look missleading. Should be moved inside the condition.
4011	Remove it. This cannot be false according to the if condition.
llvm/test/CodeGen/X86/not-shift.ll
2	What's `tbm` used for. Seems it doesn't affect the result?

Move comment, remove TBM, remove unneeded assert

goldstein.w.n marked 3 inline comments as done.Dec 19 2022, 9:07 AM

goldstein.w.n added inline comments.

llvm/test/CodeGen/X86/not-shift.ll
2	Removed in V3. Was unneeded. Had copied the command from another test.

In D140087#3997239, @pengfei wrote:

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Add commit with not-shift.ll and no peephole.

Harbormaster completed remote builds in B203936: Diff 483985.Dec 19 2022, 9:20 AM

Add missing earlier commit (I think)

In D140087#4005193, @goldstein.w.n wrote:

In D140087#3997239, @pengfei wrote:

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

Harbormaster completed remote builds in B203939: Diff 483988.Dec 19 2022, 10:44 AM

In D140087#4005300, @goldstein.w.n wrote:

In D140087#4005193, @goldstein.w.n wrote:

In D140087#3997239, @pengfei wrote:

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

In D140087#4006734, @pengfei wrote:

In D140087#4005300, @goldstein.w.n wrote:

In D140087#4005193, @goldstein.w.n wrote:

In D140087#3997239, @pengfei wrote:

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

I see. I don't have commit access. Here is the revision with just the test:
https://reviews.llvm.org/D140362

Once that goes up I'll rebase this change and update?

@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:

ptr[x / 32] |= (1 << (x % 32))

Currently codegen is something like:

mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)

And it could be as simple as:

bts %cnt, (%ptr)

(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

In D140087#4008447, @goldstein.w.n wrote:
@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:
ptr[x / 32] |= (1 << (x % 32))
Currently codegen is something like:
mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)
And it could be as simple as:
bts %cnt, (%ptr)
(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?

bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

In D140087#4008536, @craig.topper wrote:
In D140087#4008447, @goldstein.w.n wrote:
@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:
ptr[x / 32] |= (1 << (x % 32))
Currently codegen is something like:
mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)
And it could be as simple as:
bts %cnt, (%ptr)
(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?
bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.

I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

Better latency.
Less register pressure
Less code size.
Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

Same loop using btr

2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49

The loop:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

In D140087#4008678, @goldstein.w.n wrote:
In D140087#4008536, @craig.topper wrote:
In D140087#4008447, @goldstein.w.n wrote:
@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:
ptr[x / 32] |= (1 << (x % 32))
Currently codegen is something like:
mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)
And it could be as simple as:
bts %cnt, (%ptr)
(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?
bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.
I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

Better latency.

Less register pressure

Less code size.

Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
Same loop using btr
2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49
The loop:
	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49

I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

Also note that microcoded instructions (with more than 2 uops),
at least on older AMD CPU's, used to significantly affect decoder throughput,
which was shared by threads of a core. I'm also not quite sure bts would be a win.

In D140087#4008693, @craig.topper wrote:
In D140087#4008678, @goldstein.w.n wrote:
In D140087#4008536, @craig.topper wrote:
In D140087#4008447, @goldstein.w.n wrote:
@pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.

I was looking to add a peephole to change something like:
ptr[x / 32] |= (1 << (x % 32))
Currently codegen is something like:
mov    $0x1,%gpr1
shlx   %cnt,%gpr1,%mask
shr    $0x5,%cnt
or  %mask, (%ptr, %cnt, 4)
And it could be as simple as:
bts %cnt, (%ptr)
(other pattern with bt{s|r|c} could also be improved)

I saw one_bit_patterns in X86InstrCompiler but don't see a way to extend
the peephole s.t addr is a function of the inputs and not just one of the inputs.

Any chance you could direct me as where I should look at add this type of
peephole?
bts %cnt, (%ptr) is a 10 or 11 uop instruction. It might not be better than current code.
I think that translates to worse throughput (so worse in a tight loop iff no carried
dependency (better latency so if carried dependency still preferable)) but outside
of that once case have to imagine its a win.

Better latency.

Less register pressure

Less code size.

Less Backend resources(unless this is some bizarre program thats retirement bound)

on ICX:
Loop using shlx method with hoisted movl $1, %gpr. 1,000,000 iterations (with a decl; jne for loop impl)
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
Same loop using btr
2,055,213      port0                                                          
1,298,859      port1                                                          
1,000,372      port23                                                         
1,505,077      port5                                                          
3,261,176      port6                                                          
1,088,049      port49
The loop:
	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$123, %ecx
	leaq	(buf_start)(%rip), %rdi

	movl	$1000000, %edx

loop:
#if 0
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:
Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

No (was suprised too! thats why I added the code), although looking a bit more
into it I think the benchmark probably isn't so good and is misleading.

if you add some arbitrary padding the perf numbers changed (and the p49 for the shlx version
goes down 1/cycle as expected).

If I had to guess something is going awry with uop replay.

Changing the benchmark:

	.global	_start
	.p2align 6
	.text
_start:
	movl	$1, %eax
	movl	$128, %ecx
	leaq	(buf_start)(%rip), %rdi
	xorl	%ebp, %ebp
	movl	$1000000, %edx

loop:
#if 1
	btr	%rcx, (%rdi)
#else
	shlx	%ecx, %eax, %ebx
	movl	%ecx, %esi
	shr	$5, %esi
	andl	%ebx, (%rdi, %rsi, 4)
#endif
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop
	nop

	decl	%edx
	jnz	loop

	movl	$60, %eax
	xorl	%edi, %edi
	syscall

	.section .data
	.balign	4096
buf_start:	.space 4096
buf_end:

The shlx version performs much better.

1,224,172      p0                                                          
  272,089      p1                                                          
1,000,431      p23                                                         
  527,889      p5                                                          
2,210,403      p6                                                          
1,000,316      p78                                                         
1,217,652      p49

Versus the btr version:

2,300,942      p0                                                          
1,001,087      p1                                                          
1,000,150      p23                                                         
1,000,933      p5                                                          
4,001,759      p6                                                          
1,000,100      p78                                                         
1,300,149      p49

Which for some reason ends up bottenecking on
p6 uops although should be able to schedule
on p0 too.

Think I was wrong.

In D140087#4008747, @craig.topper wrote:
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.

See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

In D140087#4008802, @goldstein.w.n wrote:
In D140087#4008747, @craig.topper wrote:
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.
See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.

If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.

Rebased after tests landed in master

In D140087#4006734, @pengfei wrote:

In D140087#4005300, @goldstein.w.n wrote:

In D140087#4005193, @goldstein.w.n wrote:

In D140087#3997239, @pengfei wrote:

Pre-commit the test case and rebase to show the diff.

What do you mean? Add a commit of not-shift.ll
w.o the peephole added?

Assume this is what you mean so made the revision two commits.
Think I must have messed something up though. Can you let me
know if this is what you mean and if not what I should do?

Accidentally made two new revisions in the process (D140314/D140316).
I didn't see a way to delete them so I changed their visibility to no-one.
Sorry for the spam.

I mean commit the test case to llvm trunk directly if you have the permission. Otherwise, you can put the test in a separate review and update this one. Tips, you can update this one with your new local commit iff you preserve the message Differential Revision: https://reviews.llvm.org/D140087

Saw that the not-shift.ll test was pushed. Have rebased this PR.

Harbormaster completed remote builds in B204299: Diff 484448.Dec 20 2022, 7:50 PM

In D140087#4008910, @goldstein.w.n wrote:
In D140087#4008802, @goldstein.w.n wrote:
In D140087#4008747, @craig.topper wrote:
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.
See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.
If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.

AtomicExpansionKind::BitTestIntrinsic is different, it is intended to replace the expensive lock cmpxchg. It may not be beneficial in non atomic cases.

In D140087#4010506, @pengfei wrote:
In D140087#4008910, @goldstein.w.n wrote:
In D140087#4008802, @goldstein.w.n wrote:
In D140087#4008747, @craig.topper wrote:
 3,782,331      port0                                                          
 3,207,023      port1                                                          
 1,001,220      port23                                                         
 3,216,022      port5                                                          
 4,940,975      port6                                                          
11,575,101      port49
I'm having trouble accounting for these numbers. As far as I know
shlx is 1 uop
mov is 1 uop
shr is 1 uop
and with load+store is 4 uops
dec is 1 uop
bnz is 1uop and could possibly be macrofused with the dec.

so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?

You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.
See my other comment, but I think the benchmark was misleading and the numbers
where dramatically skewed by uop replay (maybe something else, but thats the only
thing I can think of at the moment).

Think you and @lebedev.ri are right and unless there is intense register pressure
or -Os it doesn't win out.
If its all the same, would still like guidance about how to implement it. Think it may
be useful at the very least for AtomicExpansionKind::BitTestIntrinsic.
AtomicExpansionKind::BitTestIntrinsic is different, it is intended to replace the expensive lock cmpxchg. It may not be beneficial in non atomic cases.

Yeah, I realized I needed new definitions for int_x86_atomic_bts that take gpr arguments.

ping.

Can you please add some alive2 proof links into the patch description?

In D140087#4018605, @lebedev.ri wrote:

Can you please add some alive2 proof links into the patch description?

$> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/not-shift.ll
-- Testing: 1 tests, 1 workers --
PASS: LLVM :: CodeGen/X86/not-shift.ll (1 of 1)

Testing Time: 0.51s
  Passed: 1
$> ./build/bin/llvm-lit -vv -Dopt=/home/noah/programs/opensource/llvm-dev/src/alive2/build/opt-alive.sh llvm/test/CodeGen/X86/legalize-shift-64.ll
-- Testing: 1 tests, 1 workers --
PASS: LLVM :: CodeGen/X86/legalize-shift-64.ll (1 of 1)

Testing Time: 0.08s
  Passed: 1

Not sure if you meant something else (new the the tool sorry), if so let me know.

Note the AMX tests fail with alive2. I checked and they fail before the patch
was added as well and the patch doesn't change any of codegen in those tests
so have to imagine its unrelated.

Alive2 doesn't deal with assembly/dag, only ir.
I'm asking you to write the proofs for these changes via an IR tests,
explicitly modelling the implicit modulo of the shift amount, which isn't a thing in IR.

This revision now requires changes to proceed.Dec 29 2022, 5:33 AM

In D140087#4019429, @lebedev.ri wrote:

Alive2 doesn't deal with assembly/dag, only ir.
I'm asking you to write the proofs for these changes via an IR tests,
explicitly modelling the implicit modulo of the shift amount, which isn't a thing in IR.

I see. Sorry about that. How about the following:

in.ll:

define i32 @foo32_(i32 %val, i32 %cnt) {
  %adjcnt = sub i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

define i32 @foo32_2(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

    
define i32 @foo32_3(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 32, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}

define i64 @foo64_(i64 %val, i64 %cnt) {
  %adjcnt = sub i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

define i64 @foo64_2(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

    
define i64 @foo64_3(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 64, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}

out.ll

define i32 @foo32_(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 -1, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}

define i32 @foo32_2(i32 %val, i32 %cnt) {
  %adjcnt = xor i32 -1, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}

define i32 @foo32_3(i32 %val, i32 %cnt) {
  %shiftcnt = and i32 31, %cnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}


define i64 @foo64_(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

define i64 @foo64_2(i64 %val, i64 %cnt) {
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

define i64 @foo64_3(i64 %val, i64 %cnt) {
  %shiftcnt = and i64 63, %cnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}

Running:

$> /home/noah/programs/opensource/llvm-dev/src/alive2/build/alive-tv in.ll out.ll 
----------------------------------------
define i32 @foo32_(i32 %val, i32 %cnt) {
%0:
  %adjcnt = sub i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 4294967295, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i32 @foo32_2(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 31, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_2(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 4294967295, %cnt
  %shiftcnt = and i32 31, %adjcnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i32 @foo32_3(i32 %val, i32 %cnt) {
%0:
  %adjcnt = xor i32 32, %cnt
  %result = shl i32 %val, %adjcnt
  ret i32 %result
}
=>
define i32 @foo32_3(i32 %val, i32 %cnt) {
%0:
  %shiftcnt = and i32 31, %cnt
  %result = shl i32 %val, %shiftcnt
  ret i32 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_(i64 %val, i64 %cnt) {
%0:
  %adjcnt = sub i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_2(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 63, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_2(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 -1, %cnt
  %shiftcnt = and i64 63, %adjcnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!


----------------------------------------
define i64 @foo64_3(i64 %val, i64 %cnt) {
%0:
  %adjcnt = xor i64 64, %cnt
  %result = shl i64 %val, %adjcnt
  ret i64 %result
}
=>
define i64 @foo64_3(i64 %val, i64 %cnt) {
%0:
  %shiftcnt = and i64 63, %cnt
  %result = shl i64 %val, %shiftcnt
  ret i64 %result
}
Transformation seems to be correct!

Summary:
  6 correct transformations
  0 incorrect transformations
  0 failed-to-prove transformations
  0 Alive2 errors

lebedev.ri added inline comments.Dec 29 2022, 8:40 AM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4000–4004	add https://alive2.llvm.org/ce/z/gFh16W sub of constant https://alive2.llvm.org/ce/z/umLun5 xor https://alive2.llvm.org/ce/z/ewQ7Bd
4005–4006	xor https://alive2.llvm.org/ce/z/k_3jok sub from bias https://alive2.llvm.org/ce/z/-BBhDe BUT NOT sub of bias: https://alive2.llvm.org/ce/z/AY9Aa7 and not add: https://alive2.llvm.org/ce/z/yUyTC7

goldstein.w.n added inline comments.Dec 29 2022, 9:11 AM

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4005–4006	Ah, good catch. Fixing and will add test case in `not-shift.ll`. Sorry for missing that.

Fix bug + tests

goldstein.w.n marked an inline comment as done.Dec 29 2022, 9:39 AM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4005–4006	Fixed. The Guard the condition against `add` and only transform `sub` if its `(Size - 1) - X`, not `X - (Size - 1)` Added some tests for it.

Harbormaster completed remote builds in B205165: Diff 485622.Dec 29 2022, 10:24 AM

Please be sure to precommit the test changes before committing the change itself.
Looks good to me now. Thanks!

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4019–4021	Since we are already here, why not just do it ourselves? That would be less LOC even.
4023
4024	Remove newline
llvm/test/CodeGen/X86/not-shift.ll
2–11	Run lines are still wrong, please deduplicate them. There should be only 4 i think?

This revision is now accepted and ready to land.Dec 29 2022, 11:46 AM

In D140087#4019617, @lebedev.ri wrote:

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

Looks good to me now. Thanks!

Fix some nits

goldstein.w.n marked 2 inline comments as done.Dec 29 2022, 2:08 PM

goldstein.w.n added inline comments.

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
4019–4021	Done.
4024	Done.

Harbormaster completed remote builds in B205180: Diff 485644.Dec 29 2022, 2:52 PM

goldstein.w.n marked 2 inline comments as done.Jan 5 2023, 7:28 AM

In D140087#4019772, @goldstein.w.n wrote:

In D140087#4019617, @lebedev.ri wrote:

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

Looks good to me now. Thanks!

goldstein.w.n added a parent revision: D141076: [X86] Add additional tests to no-shift.ll.Jan 5 2023, 10:45 AM

Propegate test changes

In D140087#4028903, @lebedev.ri wrote:

In D140087#4019772, @goldstein.w.n wrote:

In D140087#4019617, @lebedev.ri wrote:

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

Done: https://reviews.llvm.org/D141076

Sorry for long delay, I was sitting on my last round of comments for a week but forgot to hit submit!

Looks good to me now. Thanks!

Harbormaster completed remote builds in B205953: Diff 486630.Jan 5 2023, 1:04 PM

In D140087#4028903, @lebedev.ri wrote:

In D140087#4019772, @goldstein.w.n wrote:

In D140087#4019617, @lebedev.ri wrote:

Please be sure to precommit the test changes before committing the change itself.

You want the new tests invalid_add31 / invalid_sub31 split into a seperate commit?

In general, when adding new tests, if the new tests do not crash the opt/llc before the change,
they should be committed first, so the change shows the diff of CHECK lines, not just the new CHECK lines.

The test dependency have landed https://github.com/llvm/llvm-project/commit/a698790c51ec2804c3a7ba4c59438e7816690ea2
is this good to go?

Looks good to me now. Thanks!

pengfei accepted this revision.Jan 7 2023, 6:03 AM

Rebase

Herald added a subscriber: StephenFan. · View Herald TranscriptJan 12 2023, 6:13 PM

If you need someone to commit this for you, please ask for that directly. (i'll commit in +~12h)

Harbormaster completed remote builds in B207530: Diff 488838.Jan 12 2023, 7:10 PM

This revision was landed with ongoing or failed builds.Jan 12 2023, 8:54 PM

Closed by commit rG4916523053d7: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing… (authored by goldstein.w.n, committed by pengfei). · Explain Why

This revision was automatically updated to reflect the committed changes.

pengfei added a commit: rG4916523053d7: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing….

vitalybuka mentioned this in rG3e198884642a: [X86] Remove unused variable after D140087.Jan 12 2023, 9:47 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelDAGToDAG.cpp

29 lines

test/

CodeGen/

X86/

legalize-shift-64.ll

2 lines

not-shift.ll

637 lines

Diff 483077

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

Show First 20 Lines • Show All 3,985 Lines • ▼ Show 20 Lines bool X86DAGToDAGISel::tryShiftAmountMod(SDNode *N) {

// Skip over a truncate of the shift amount. // Skip over a truncate of the shift amount.

if (ShiftAmt->getOpcode() == ISD::TRUNCATE) if (ShiftAmt->getOpcode() == ISD::TRUNCATE)

ShiftAmt = ShiftAmt->getOperand(0); ShiftAmt = ShiftAmt->getOperand(0);

// This function is called after X86DAGToDAGISel::matchBitExtract(), // This function is called after X86DAGToDAGISel::matchBitExtract(),

// so we are not afraid that we might mess up BZHI/BEXTR pattern. // so we are not afraid that we might mess up BZHI/BEXTR pattern.

SDValue NewShiftAmt; SDValue NewShiftAmt;

if (ShiftAmt->getOpcode() == ISD::ADD || ShiftAmt->getOpcode() == ISD::SUB) { if (ShiftAmt->getOpcode() == ISD::ADD || ShiftAmt->getOpcode() == ISD::SUB ||

ShiftAmt->getOpcode() == ISD::XOR) {

SDValue Add0 = ShiftAmt->getOperand(0); SDValue Add0 = ShiftAmt->getOperand(0);

SDValue Add1 = ShiftAmt->getOperand(1); SDValue Add1 = ShiftAmt->getOperand(1);

auto *Add0C = dyn_cast<ConstantSDNode>(Add0); auto *Add0C = dyn_cast<ConstantSDNode>(Add0);

auto *Add1C = dyn_cast<ConstantSDNode>(Add1); auto *Add1C = dyn_cast<ConstantSDNode>(Add1);

// If we are shifting by X+/-N where N == 0 mod Size, then just shift by X // If we are shifting by X+/-/^N where N == 0 mod Size, then just shift by X

// to avoid the ADD/SUB. // to avoid the ADD/SUB/XOR.

if (Add1C && Add1C->getAPIntValue().urem(Size) == 0) { if (Add1C && Add1C->getAPIntValue().urem(Size) == 0) {

NewShiftAmt = Add0; NewShiftAmt = Add0;

// If we are shifting by N-X where N == 0 mod Size, then just shift by -X

lebedev.riUnsubmitted

Done

add https://alive2.llvm.org/ce/z/gFh16W
sub of constant https://alive2.llvm.org/ce/z/umLun5
xor https://alive2.llvm.org/ce/z/ewQ7Bd

lebedev.ri: add https://alive2.llvm.org/ce/z/gFh16W sub of constant https://alive2.llvm.org/ce/z/umLun5 xor…

// to generate a NEG instead of a SUB of a constant. // If we are doing a NOT on just the lower bits with (Size*N-1) -/^ X

// we can replace it with a NOT. In the XOR case it may save some code

lebedev.riUnsubmitted

Done

xor https://alive2.llvm.org/ce/z/k_3jok
sub from bias https://alive2.llvm.org/ce/z/-BBhDe
BUT NOT sub of bias: https://alive2.llvm.org/ce/z/AY9Aa7
and not add: https://alive2.llvm.org/ce/z/yUyTC7

lebedev.ri: xor https://alive2.llvm.org/ce/z/k_3jok sub from bias https://alive2.llvm.org/ce/z/-BBhDe BUT…

goldstein.w.nAuthorUnsubmitted

Done

Ah, good catch.

Fixing and will add test case in not-shift.ll.

Sorry for missing that.

goldstein.w.n: Ah, good catch. Fixing and will add test case in `not-shift.ll`. Sorry for missing that.

goldstein.w.nAuthorUnsubmitted

Done

Fixed. The Guard the condition against add and only transform sub if its (Size - 1) - X, not X - (Size - 1)

Added some tests for it.

goldstein.w.n: Fixed. The Guard the condition against `add` and only transform `sub` if its `(Size - 1) - X`…

// size, in the SUB case it also may save a move.

pengfeiUnsubmitted

Done

The comments inside a different block look missleading. Should be moved inside the condition.

pengfei: The comments inside a different block look missleading. Should be moved inside the condition.

} else if ((Add0C && Add0C->getAPIntValue().urem(Size) == Size - 1) ||

(Add1C && Add1C->getAPIntValue().urem(Size) == Size - 1)) {

assert(Add0C == nullptr || Add1C == nullptr);

assert(Add0C != nullptr || Add1C != nullptr);

pengfeiUnsubmitted

Done

Remove it. This cannot be false according to the if condition.

pengfei: Remove it. This cannot be false according to the if condition.

auto *ConstValOp = Add0C == nullptr ? Add1C : Add0C;

EVT OpVT = ShiftAmt.getValueType();

// ISelLowering will convert this to NOT already.

if (ConstValOp->isAllOnes())

return false;

NewShiftAmt = CurDAG->getNOT(DL, Add0C == nullptr ? Add0 : Add1, OpVT);

insertDAGNode(*CurDAG, OrigShiftAmt, NewShiftAmt);

lebedev.riUnsubmitted

Done

Since we are already here, why not just do it ourselves?
That would be less LOC even.

lebedev.ri: Since we are already here, why not just do it ourselves? That would be less LOC even.

goldstein.w.nAuthorUnsubmitted

Done

Done.

goldstein.w.n: Done.

// If we are shifting by N-X where N == 0 mod Size, then just shift by

// -X to generate a NEG instead of a SUB of a constant.

lebedev.riUnsubmitted

Not Done

return false;

- NewShiftAmt = CurDAG->getNOT(DL, Add0C == nullptr ? Add0 : Add1, OpVT);

+ SDValue NotX = CurDAG->getNOT(DL, Add0C == nullptr ? Add0 : Add1, OpVT);

insertDAGNode(*CurDAG, OrigShiftAmt, NewShiftAmt);

lebedev.ri:

} else if (ShiftAmt->getOpcode() == ISD::SUB && Add0C && } else if (ShiftAmt->getOpcode() == ISD::SUB && Add0C &&

lebedev.riUnsubmitted

Done

Remove newline

lebedev.ri: Remove newline

goldstein.w.nAuthorUnsubmitted

Done

Done.

goldstein.w.n: Done.

Add0C->getZExtValue() != 0) { Add0C->getZExtValue() != 0) {

EVT SubVT = ShiftAmt.getValueType(); EVT SubVT = ShiftAmt.getValueType();

SDValue X; SDValue X;

if (Add0C->getZExtValue() % Size == 0) if (Add0C->getZExtValue() % Size == 0)

X = Add1; X = Add1;

else if (ShiftAmt.hasOneUse() && Size == 64 && else if (ShiftAmt.hasOneUse() && Size == 64 &&

Add0C->getZExtValue() % 32 == 0) { Add0C->getZExtValue() % 32 == 0) {

// We have a 64-bit shift by (n*32-x), turn it into -(x+n*32). // We have a 64-bit shift by (n*32-x), turn it into -(x+n*32).

▲ Show 20 Lines • Show All 2,187 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/legalize-shift-64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i686-unknown-unknown \| FileCheck %s			; RUN: llc < %s -mtriple=i686-unknown-unknown \| FileCheck %s

	define i64 @test1(i32 %xx, i32 %test) nounwind {			define i64 @test1(i32 %xx, i32 %test) nounwind {
	; CHECK-LABEL: test1:			; CHECK-LABEL: test1:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: movl {{[0-9]+}}(%esp), %edx			; CHECK-NEXT: movl {{[0-9]+}}(%esp), %edx
	; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx			; CHECK-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
	; CHECK-NEXT: andb $7, %cl			; CHECK-NEXT: andb $7, %cl
	; CHECK-NEXT: movl %edx, %eax			; CHECK-NEXT: movl %edx, %eax
	; CHECK-NEXT: shll %cl, %eax			; CHECK-NEXT: shll %cl, %eax
	; CHECK-NEXT: shrl %edx			; CHECK-NEXT: shrl %edx
	; CHECK-NEXT: xorb $31, %cl			; CHECK-NEXT: notb %cl
	; CHECK-NEXT: shrl %cl, %edx			; CHECK-NEXT: shrl %cl, %edx
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%conv = zext i32 %xx to i64			%conv = zext i32 %xx to i64
	%and = and i32 %test, 7			%and = and i32 %test, 7
	%sh_prom = zext i32 %and to i64			%sh_prom = zext i32 %and to i64
	%shl = shl i64 %conv, %sh_prom			%shl = shl i64 %conv, %sh_prom
	ret i64 %shl			ret i64 %shl
	}			}
	▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/not-shift.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=-bmi,-tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X86-NOBMI2
				pengfeiUnsubmitted Done Reply Inline Actions What's `tbm` used for. Seems it doesn't affect the result? pengfei: What's `tbm` used for. Seems it doesn't affect the result?
				goldstein.w.nAuthorUnsubmitted Done Reply Inline Actions Removed in V3. Was unneeded. Had copied the command from another test. goldstein.w.n: Removed in V3. Was unneeded. Had copied the command from another test.
				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=+bmi,-tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X86-NOBMI2
				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=+bmi,+tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X86-NOBMI2
				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=+bmi,+tbm,+bmi2 < %s \| FileCheck %s --check-prefixes=X86-BMI2
				; RUN: llc -mtriple=i686-unknown-linux-gnu -mattr=+bmi,-tbm,+bmi2 < %s \| FileCheck %s --check-prefixes=X86-BMI2
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=-bmi,-tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X64-NOBMI2
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi,-tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X64-NOBMI2
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi,+tbm,-bmi2 < %s \| FileCheck %s --check-prefixes=X64-NOBMI2
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi,+tbm,+bmi2 < %s \| FileCheck %s --check-prefixes=X64-BMI2
				; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi,-tbm,+bmi2 < %s \| FileCheck %s --check-prefixes=X64-BMI2
				lebedev.riUnsubmitted Not Done Reply Inline Actions Run lines are still wrong, please deduplicate them. There should be only 4 i think? lebedev.ri: Run lines are still wrong, please deduplicate them. There should be only 4 i think?



				define i64 @sub63_shiftl64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub63_shiftl64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-NOBMI2-NEXT: movb $63, %cl
				; X86-NOBMI2-NEXT: subb {{[0-9]+}}(%esp), %cl
				; X86-NOBMI2-NEXT: movl %esi, %eax
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: shldl %cl, %esi, %edx
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB0_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %eax, %edx
				; X86-NOBMI2-NEXT: xorl %eax, %eax
				; X86-NOBMI2-NEXT: .LBB0_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub63_shiftl64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movb $63, %cl
				; X86-BMI2-NEXT: subb {{[0-9]+}}(%esp), %cl
				; X86-BMI2-NEXT: shldl %cl, %eax, %edx
				; X86-BMI2-NEXT: shlxl %ecx, %eax, %eax
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB0_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %eax, %edx
				; X86-BMI2-NEXT: xorl %eax, %eax
				; X86-BMI2-NEXT: .LBB0_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub63_shiftl64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shlq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub63_shiftl64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = sub i64 63, %cnt
				%result = shl i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @xor63_shiftr64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor63_shiftr64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: xorb $63, %cl
				; X86-NOBMI2-NEXT: movl %esi, %edx
				; X86-NOBMI2-NEXT: shrl %cl, %edx
				; X86-NOBMI2-NEXT: shrdl %cl, %esi, %eax
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB1_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %edx, %eax
				; X86-NOBMI2-NEXT: xorl %edx, %edx
				; X86-NOBMI2-NEXT: .LBB1_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor63_shiftr64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: xorb $63, %cl
				; X86-BMI2-NEXT: shrdl %cl, %edx, %eax
				; X86-BMI2-NEXT: shrxl %ecx, %edx, %edx
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB1_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %edx, %eax
				; X86-BMI2-NEXT: xorl %edx, %edx
				; X86-BMI2-NEXT: .LBB1_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor63_shiftr64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shrq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor63_shiftr64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i64 %cnt, 63
				%result = lshr i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @sub127_shiftl64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub127_shiftl64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: xorb $127, %cl
				; X86-NOBMI2-NEXT: movl %esi, %eax
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: shldl %cl, %esi, %edx
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB2_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %eax, %edx
				; X86-NOBMI2-NEXT: xorl %eax, %eax
				; X86-NOBMI2-NEXT: .LBB2_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub127_shiftl64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: xorb $127, %cl
				; X86-BMI2-NEXT: shldl %cl, %eax, %edx
				; X86-BMI2-NEXT: shlxl %ecx, %eax, %eax
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB2_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %eax, %edx
				; X86-BMI2-NEXT: xorl %eax, %eax
				; X86-BMI2-NEXT: .LBB2_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub127_shiftl64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shlq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub127_shiftl64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = sub i64 127, %cnt
				%result = shl i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @xor127_shiftr64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor127_shiftr64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: xorb $127, %cl
				; X86-NOBMI2-NEXT: movl %esi, %edx
				; X86-NOBMI2-NEXT: shrl %cl, %edx
				; X86-NOBMI2-NEXT: shrdl %cl, %esi, %eax
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB3_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %edx, %eax
				; X86-NOBMI2-NEXT: xorl %edx, %edx
				; X86-NOBMI2-NEXT: .LBB3_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor127_shiftr64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: xorb $127, %cl
				; X86-BMI2-NEXT: shrdl %cl, %edx, %eax
				; X86-BMI2-NEXT: shrxl %ecx, %edx, %edx
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB3_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %edx, %eax
				; X86-BMI2-NEXT: xorl %edx, %edx
				; X86-BMI2-NEXT: .LBB3_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor127_shiftr64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shrq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor127_shiftr64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i64 %cnt, 127
				%result = lshr i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @xor64_shiftl64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor64_shiftl64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: xorb $64, %cl
				; X86-NOBMI2-NEXT: movl %esi, %eax
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: shldl %cl, %esi, %edx
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB4_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %eax, %edx
				; X86-NOBMI2-NEXT: xorl %eax, %eax
				; X86-NOBMI2-NEXT: .LBB4_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor64_shiftl64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: xorb $64, %cl
				; X86-BMI2-NEXT: shldl %cl, %eax, %edx
				; X86-BMI2-NEXT: shlxl %ecx, %eax, %eax
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB4_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %eax, %edx
				; X86-BMI2-NEXT: xorl %eax, %eax
				; X86-BMI2-NEXT: .LBB4_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor64_shiftl64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shlq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor64_shiftl64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: shlxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i64 %cnt, 64
				%result = shl i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @sub1s_shiftr64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub1s_shiftr64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: movl %esi, %edx
				; X86-NOBMI2-NEXT: shrl %cl, %edx
				; X86-NOBMI2-NEXT: shrdl %cl, %esi, %eax
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB5_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %edx, %eax
				; X86-NOBMI2-NEXT: xorl %edx, %edx
				; X86-NOBMI2-NEXT: .LBB5_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub1s_shiftr64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: notb %cl
				; X86-BMI2-NEXT: shrdl %cl, %edx, %eax
				; X86-BMI2-NEXT: shrxl %ecx, %edx, %edx
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB5_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %edx, %eax
				; X86-BMI2-NEXT: xorl %edx, %edx
				; X86-BMI2-NEXT: .LBB5_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub1s_shiftr64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shrq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub1s_shiftr64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i64 %cnt, -1
				%result = lshr i64 %val, %adjcnt
				ret i64 %result
				}

				define i64 @xor1s_shiftl64(i64 %val, i64 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor1s_shiftl64:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: pushl %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %esi
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: movl %esi, %eax
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: shldl %cl, %esi, %edx
				; X86-NOBMI2-NEXT: testb $32, %cl
				; X86-NOBMI2-NEXT: je .LBB6_2
				; X86-NOBMI2-NEXT: # %bb.1:
				; X86-NOBMI2-NEXT: movl %eax, %edx
				; X86-NOBMI2-NEXT: xorl %eax, %eax
				; X86-NOBMI2-NEXT: .LBB6_2:
				; X86-NOBMI2-NEXT: popl %esi
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor1s_shiftl64:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: movl {{[0-9]+}}(%esp), %edx
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-BMI2-NEXT: notb %cl
				; X86-BMI2-NEXT: shldl %cl, %eax, %edx
				; X86-BMI2-NEXT: shlxl %ecx, %eax, %eax
				; X86-BMI2-NEXT: testb $32, %cl
				; X86-BMI2-NEXT: je .LBB6_2
				; X86-BMI2-NEXT: # %bb.1:
				; X86-BMI2-NEXT: movl %eax, %edx
				; X86-BMI2-NEXT: xorl %eax, %eax
				; X86-BMI2-NEXT: .LBB6_2:
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor1s_shiftl64:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movq %rsi, %rcx
				; X64-NOBMI2-NEXT: movq %rdi, %rax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $rcx
				; X64-NOBMI2-NEXT: shlq %cl, %rax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor1s_shiftl64:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxq %rsi, %rdi, %rax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i64 %cnt, -1
				%result = shl i64 %val, %adjcnt
				ret i64 %result
				}

				define i32 @sub31_shiftr32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub31_shiftr32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shrl %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub31_shiftr32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shrxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub31_shiftr32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shrl %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub31_shiftr32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = sub i32 31, %cnt
				%result = lshr i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @xor31_shiftl32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor31_shiftl32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor31_shiftl32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shlxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor31_shiftl32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shll %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor31_shiftl32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i32 %cnt, 31
				%result = shl i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @sub63_shiftr32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub63_shiftr32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shrl %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub63_shiftr32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shrxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub63_shiftr32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shrl %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub63_shiftr32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = sub i32 63, %cnt
				%result = lshr i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @xor63_shiftl32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor63_shiftl32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor63_shiftl32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shlxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor63_shiftl32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shll %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor63_shiftl32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i32 %cnt, 63
				%result = shl i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @xor32_shiftr32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor32_shiftr32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: shrl %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor32_shiftr32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: shrxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor32_shiftr32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shrl %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor32_shiftr32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: shrxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i32 %cnt, 32
				%result = lshr i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @sub1s_shiftl32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: sub1s_shiftl32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shll %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: sub1s_shiftl32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shlxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: sub1s_shiftl32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shll %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: sub1s_shiftl32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shlxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i32 %cnt, -1
				%result = shl i32 %val, %adjcnt
				ret i32 %result
				}

				define i32 @xor1s_shiftr32(i32 %val, i32 %cnt) nounwind {
				; X86-NOBMI2-LABEL: xor1s_shiftr32:
				; X86-NOBMI2: # %bb.0:
				; X86-NOBMI2-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NOBMI2-NEXT: movzbl {{[0-9]+}}(%esp), %ecx
				; X86-NOBMI2-NEXT: notb %cl
				; X86-NOBMI2-NEXT: shrl %cl, %eax
				; X86-NOBMI2-NEXT: retl
				;
				; X86-BMI2-LABEL: xor1s_shiftr32:
				; X86-BMI2: # %bb.0:
				; X86-BMI2-NEXT: movzbl {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: notb %al
				; X86-BMI2-NEXT: shrxl %eax, {{[0-9]+}}(%esp), %eax
				; X86-BMI2-NEXT: retl
				;
				; X64-NOBMI2-LABEL: xor1s_shiftr32:
				; X64-NOBMI2: # %bb.0:
				; X64-NOBMI2-NEXT: movl %esi, %ecx
				; X64-NOBMI2-NEXT: movl %edi, %eax
				; X64-NOBMI2-NEXT: notb %cl
				; X64-NOBMI2-NEXT: # kill: def $cl killed $cl killed $ecx
				; X64-NOBMI2-NEXT: shrl %cl, %eax
				; X64-NOBMI2-NEXT: retq
				;
				; X64-BMI2-LABEL: xor1s_shiftr32:
				; X64-BMI2: # %bb.0:
				; X64-BMI2-NEXT: notb %sil
				; X64-BMI2-NEXT: shrxl %esi, %edi, %eax
				; X64-BMI2-NEXT: retq
				%adjcnt = xor i32 %cnt, -1
				%result = lshr i32 %val, %adjcnt
				ret i32 %result
				}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift countClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 483077

llvm/lib/Target/X86/X86ISelDAGToDAG.cpp

llvm/test/CodeGen/X86/legalize-shift-64.ll

llvm/test/CodeGen/X86/not-shift.ll

[X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count
ClosedPublic