This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
ClosedPublic

Authored by pgousseau on Aug 12 2016, 5:27 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
andreadb

Commits

rGb6d652adb5b1: [X86] Take advantage of the lzcnt instruction on btver2 architectures when…
rL284248: [X86] Take advantage of the lzcnt instruction on btver2 architectures when…

Summary

Following the discussion on D22038, this change enables the setcc to srl(ctlz) transformation on the btver2 architecture.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.

This patch requires D23445

Diff Detail

Repository: rL LLVM

Event Timeline

pgousseau updated this revision to Diff 67824.Aug 12 2016, 5:27 AM

pgousseau retitled this revision from to [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures..

pgousseau updated this object.

pgousseau added reviewers: qcolombet, andreadb, RKSimon, spatel.

pgousseau added a subscriber: llvm-commits.

Herald added a subscriber: nemanjai. · View Herald TranscriptAug 12 2016, 5:27 AM

Updating patch to reflect changes in D23445

spatel added inline comments.Aug 15 2016, 9:15 AM

lib/Target/X86/X86.td
182–184 ↗	(On Diff #67858)	I would move this down with the other 'fake' features (ie, the other fast/slow attributes). Someday, we may come up with a better way to distinguish performance "features" from architectural ones. It would also be good to explain exactly what we mean by "fast" in this context. Finally, use a hyphen to make this more readable: "fast-lzcnt".
lib/Target/X86/X86ISelLowering.cpp
30780–30796 ↗	(On Diff #67858)	I don't understand the need for this check, so at the least it needs a code comment to explain why it is here. Related: if we're matching the pattern starting from a zext, doesn't that miss the icmp/icmp/or patterns that you were originally hoping to optimize in D22038?
test/CodeGen/X86/lzcnt-zext-cmp.ll
5 ↗	(On Diff #67858)	Instead of specifying a different CPU, this RUN would be better if it also used btver2, but explicitly disabled the 'fast-lzcnt' attribute. That way we verify the codegen with and without the attribute while simultaneously verifying that btver2 has this attribute by default.
7 ↗	(On Diff #67858)	Please give each test a meaningful name and/or add a comment to explain exactly what each test is checking.
81–100 ↗	(On Diff #67858)	There should be at least one test of the 'HasInterestingUses' logic (if that logic really belongs in this patch).

spatel mentioned this in D23445: [x86] Refactor a PowerPC specific ctlz/srl transformation (NFC)..Aug 15 2016, 10:39 AM

pgousseau added inline comments.Aug 15 2016, 11:04 AM

lib/Target/X86/X86.td
182–184 ↗	(On Diff #67858)	Sounds good will do.
lib/Target/X86/X86ISelLowering.cpp
30780–30796 ↗	(On Diff #67858)	Sounds good, I will think of a better comment. I added this check to be conservative for now as I noticed several worst code gen occurences (around 50% of matches in openssl). I hope to address this and the icmp/icmp/OR case in another patch.
test/CodeGen/X86/lzcnt-zext-cmp.ll
81–100 ↗	(On Diff #67858)	Yes I meant to add those, will do thanks.

Rebased changes and following Sanjay's comments:

Move down feature declaration, add hyphen, add comment.
Remove hasInterestingUses check.
Use fast-lzcnt in test and add comments.

Looking again the codegen of openssl without the "hasInterestingUses" constraint the codegen does not seem worse in terms of speed, only the size is not as good as it could be but I think it is ok? Something must have gone wrong during my initial testing I suppose ...

Minor tweak request.

lib/Target/X86/X86InstrInfo.td
837 ↗	(On Diff #68209)	Move this down to the other fast/slow definitions.
lib/Target/X86/X86Subtarget.cpp
257 ↗	(On Diff #68209)	Move this down to the other fast/slow variables.
lib/Target/X86/X86Subtarget.h
428 ↗	(On Diff #68209)	Move this down to the other fast/slow functions.

Moving down fast-lzcnt feature changes following Simon's comments.

In D23446#516878, @pgousseau wrote:

I wasn't expecting a size difference. Can you provide more details about the size and perf changes that you see with this change? We may want to gate the transform based on 'optForSize'.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3566–3567 ↗	(On Diff #68336)	Remove "llvm::"
lib/Target/X86/X86.td
265–266 ↗	(On Diff #68336)	The description still seems too vague. The key point is that lzcnt must have the same latency and throughput as test/set, right? How about: "If lzcnt has equivalent latency/throughput to most simple integer ops, it can be used to replace test/set sequences."
test/CodeGen/X86/lzcnt-zext-cmp.ll
85–92 ↗	(On Diff #68336)	lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here?

Yes sounds like I should disable this if optForSize is enabled.

For example in openssl, out of 89 srl(lzcnt) optimisations, around 30 cases cause larger code, for example I see 25 of those:

3055:	31 ed                	xor    %ebp,%ebp
3057:	ff c2                	inc    %edx
3059:	40 0f 94 c5          	sete   %bpl
305d:	01 e9                	add    %ebp,%ecx

-> 10 bytes

309d:	ff c2                	inc    %edx
309f:	f3 0f bd ea          	lzcnt  %edx,%ebp
30a3:	c1 ed 05             	shr    $0x5,%ebp
30a6:	01 e9                	add    %ebp,%ecx

-> 12 bytes

But this should not affect performances.

Luckily openssl's libcrypto total size remains smaller as the other remaining matches result in fewer and as fast instructions.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3566–3567 ↗	(On Diff #68336)	Sounds good, will do.
lib/Target/X86/X86.td
265–266 ↗	(On Diff #68336)	Sounds good thanks, will do.
test/CodeGen/X86/lzcnt-zext-cmp.ll
85–92 ↗	(On Diff #68336)	I disabled the 16-bit case because this seems to lead to bigger code in general, for this test case we would get: lzcntw %di, %ax andl $16, %eax shrl $4, %eax retq instead of: xorl %eax, %eax testw %di, %di sete %al retq which seems bigger code for the same result.

Following Sanjay's comments:

Remove llvm namespace
Rewrite feature's comment
Disable transform if optForSize is true

Please can you add some tests with optsize enabled?

In D23446#519225, @pgousseau wrote:

Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

test/CodeGen/X86/lzcnt-zext-cmp.ll
86–93 ↗	(On Diff #68506)	Oh...yuck. So we really should've done 'xor %eax, %eax' ahead of the lzcnt in that case? Please add a note about why we don't do 16-bit in the code comments and also here in the test case.

Add tests for optForSize following Simon's comments.

In D23446#519626, @spatel wrote:

In D23446#519225, @pgousseau wrote:

Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Let me know if more detailed are required.

test/CodeGen/X86/lzcnt-zext-cmp.ll
87–94 ↗	(On Diff #68577)	Sounds good, will do thanks.

Add comments for the 16-bit case following Sanjay's comment.

In D23446#519879, @pgousseau wrote:
With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.
12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled
Here is an example from libcrypto where 2 instructions is saved:
f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

In D23446#520684, @spatel wrote:
In D23446#519879, @pgousseau wrote:
With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.
12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled
Here is an example from libcrypto where 2 instructions is saved:
f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d
This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

Sounds good, will remove the optForSize.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1
Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

Yes this benchmark seems to show those kind of improvements, for example BM_f1_new/8 is ~25% faster than BM_f1_old/8.
Although now I reran the SPEC h264ref benchmark and unfortunately, I am seeing some small (less than 1%) but consistent performance degradation with h264ref that I wasn't seeing with the initial tablegen patch.
Why the micro-benchmark does not show this I am not sure, one hypothesis is that the micro-benchmark is only comparing the 32-bit "register, register" variant of the change, so it might be that I need to restrict the transformation to this pattern.
Or the micro-benchmark is not representative of the performance and in that case this change is probably not worth pursuing. Will come back with the findings.

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

Any update on the performance investigations?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3588 ↗	(On Diff #68669)	These can be replaced with DAG.getZExtOrTrunc(Scc, dl, ExtTy);

In D23446#542525, @RKSimon wrote:

Any update on the performance investigations?

Hi Simon/Sanjay,

Sorry for the delayed follow-up!
I have ran more tests and it seems the regressions in performances I was observing with SPEC's h264 are within the noise now so I cant tell if this patch is improving or degrading perfomances in SPEC's h264 benchmark.
I am more confident the OR case brings better performances because we will be replacing

48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret

by this:

F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret

My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

Thanks!

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3588 ↗	(On Diff #68669)	Will do thanks!

In D23446#544797, @pgousseau wrote:
I am more confident the OR case brings better performances because we will be replacing
48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret
by this:
F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret
My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

The OR case certainly looks better in isolation (less instructions, less code size). If you are measuring perf improvements from that alone, I think we can be more confident that the transform to lzcnt is the source of that improvement. It's still not clear to me how the micro-benchmark was improved so much for the simpler case.

To match the OR pattern, I think you would either add some code to visitOr() or add tablegen patterns if it is possible to match the DAG nodes that way.

Hi Simon/Sanjay,

Following up with with the previous comments, this patch contains:

Use of DAG.getZextOrTrunc as per Simon's comment.
Removed support for the simple case.
Added handling of OR based patterns.
Added a case for SRL nodes in 'isTypeDesirableForOp()' as to favor 32 bits encoding when targetting X86.
Added support for multiple OR patterns eg: (a1||a2||a3||a4||a5)

Let me know what you think,

Thanks!

Some quick remarks - I'll do more of review later after others have had a chance to look at it.

lib/Target/X86/X86ISelLowering.cpp
29016 ↗	(On Diff #73497)	Should we return early if !Subtarget.isCtlzFast()?
29022 ↗	(On Diff #73497)	Are we letting vector types through here?
31797 ↗	(On Diff #73497)	Comment this. Should this be part of a separate patch with its own tests? Its fine if its only exposed by this patch to leave it here, but it should have a comment either way.
test/CodeGen/X86/lzcnt-zext-cmp.ll
7 ↗	(On Diff #73497)	Test single input version to make sure it isn't being used.

pgousseau added inline comments.Oct 5 2016, 10:05 AM

lib/Target/X86/X86ISelLowering.cpp
29016 ↗	(On Diff #73497)	Yes that makes sense.
29022 ↗	(On Diff #73497)	Yes it seems that way, will have to restrict to 32-bit and 64-bit integers, thanks for spotting this.
31797 ↗	(On Diff #73497)	Yes would make sense to have a separate patch, will try to find an associated test case.
test/CodeGen/X86/lzcnt-zext-cmp.ll
7 ↗	(On Diff #73497)	Makes sense.

Following Simon's comments:

Add an early return in case isCtlzFast is false
Add a test checking that no transformations occurs on single input cases.
Prevent transformation on 128-bit cases: int foo(int128_t a, int128_t b) { return a || b;} as this pessimized the codegen.
- This requires constraining the pattern to the X86 form of an equality comparison with 0.
- This remove the need from the earlier patch to modify isTypeDesirableForOp().

Possibly better test names than barXXXX, other than that I'm happy with this - @spatel ?

spatel added inline comments.Oct 13 2016, 9:01 AM

lib/Target/X86/X86ISelLowering.cpp
29037–29042 ↗	(On Diff #74136)	The pattern must always begin with zext. Is there some reason not to start in combineZext() rather than combineOr()?

Following Simon and Sanjay comments:

Renamed tests to 'test_zext_cmpXX'
Start pattern matching from combineZext instead of combineOR.

Also added missing "hasOneUse" checks to OR nodes.
Removed one unneeded check for "isSetCCCandidate"

Let me know what you guys think,

Thanks,

Pierre

LGTM. See inline for nits.

lib/Target/X86/X86ISelLowering.cpp
4191 ↗	(On Diff #74661)	Could remove or assert hasLZCNT()?
29105–29107 ↗	(On Diff #74661)	No need for braces here.
test/CodeGen/X86/lzcnt-zext-cmp.ll
201 ↗	(On Diff #74661)	Update name: "bar6"

This revision is now accepted and ready to land.Oct 14 2016, 8:04 AM

pgousseau added inline comments.Oct 14 2016, 8:59 AM

lib/Target/X86/X86ISelLowering.cpp
4191 ↗	(On Diff #74661)	Yes makes sense, will remove before commit.
29105–29107 ↗	(On Diff #74661)	Will remove before commit.
test/CodeGen/X86/lzcnt-zext-cmp.ll
201 ↗	(On Diff #74661)	Thanks for spotting this, will fix before commit.

Closed by commit rL284248: [X86] Take advantage of the lzcnt instruction on btver2 architectures when… (authored by pgousseau). · Explain WhyOct 14 2016, 9:50 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D22038: [X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture..Oct 17 2016, 7:48 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

7 lines

2 lines

114 lines

1 line

4 lines

1 line

test/

CodeGen/

X86/

lzcnt-zext-cmp.ll

341 lines

Diff 74705

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines
// But if the code is scalar that probably means that the code has some kind of		// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.		// dependency and we should care more about reducing the latency.
def FeatureFastScalarFSQRT		def FeatureFastScalarFSQRT
: SubtargetFeature<"fast-scalar-fsqrt", "HasFastScalarFSQRT",		: SubtargetFeature<"fast-scalar-fsqrt", "HasFastScalarFSQRT",
"true", "Scalar SQRT is fast (disable Newton-Raphson)">;		"true", "Scalar SQRT is fast (disable Newton-Raphson)">;
def FeatureFastVectorFSQRT		def FeatureFastVectorFSQRT
: SubtargetFeature<"fast-vector-fsqrt", "HasFastVectorFSQRT",		: SubtargetFeature<"fast-vector-fsqrt", "HasFastVectorFSQRT",
"true", "Vector SQRT is fast (disable Newton-Raphson)">;		"true", "Vector SQRT is fast (disable Newton-Raphson)">;
		// If lzcnt has equivalent latency/throughput to most simple integer ops, it can
		// be used to replace test/set sequences.
		def FeatureFastLZCNT
		: SubtargetFeature<
		"fast-lzcnt", "HasFastLZCNT", "true",
		"LZCNT instructions are as fast as most simple integer ops">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	def : ProcessorModel<"btver2", BtVer2Model, [
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeaturePRFCHW,		FeaturePRFCHW,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureBMI,		FeatureBMI,
FeatureF16C,		FeatureF16C,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
		FeatureFastLZCNT,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureSlowSHLD,		FeatureSlowSHLD,
FeatureLAHFSAHF,		FeatureLAHFSAHF,
FeatureFastPartialYMMWrite		FeatureFastPartialYMMWrite
]>;		]>;

▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 765 Lines • ▼ Show 20 Lines	public:

/// This method returns the name of a target specific DAG node.		/// This method returns the name of a target specific DAG node.
const char *getTargetNodeName(unsigned Opcode) const override;		const char *getTargetNodeName(unsigned Opcode) const override;

bool isCheapToSpeculateCttz() const override;		bool isCheapToSpeculateCttz() const override;

bool isCheapToSpeculateCtlz() const override;		bool isCheapToSpeculateCtlz() const override;

		bool isCtlzFast() const override;

bool hasBitPreservingFPLogic(EVT VT) const override {		bool hasBitPreservingFPLogic(EVT VT) const override {
return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();		return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();
}		}

bool isMultiStoresCheaperThanBitsMerge(SDValue Lo,		bool isMultiStoresCheaperThanBitsMerge(SDValue Lo,
SDValue Hi) const override {		SDValue Hi) const override {
// If the pair to store is a mixture of float and int values, we will		// If the pair to store is a mixture of float and int values, we will
// save two bitwise instructions and one float-to-int instruction and		// save two bitwise instructions and one float-to-int instruction and
▲ Show 20 Lines • Show All 493 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,172 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isCheapToSpeculateCttz() const {
return Subtarget.hasBMI();		return Subtarget.hasBMI();
}		}

bool X86TargetLowering::isCheapToSpeculateCtlz() const {		bool X86TargetLowering::isCheapToSpeculateCtlz() const {
// Speculate ctlz only if we can directly use LZCNT.		// Speculate ctlz only if we can directly use LZCNT.
return Subtarget.hasLZCNT();		return Subtarget.hasLZCNT();
}		}

		bool X86TargetLowering::isCtlzFast() const {
		return Subtarget.hasFastLZCNT();
		}

bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {		bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {
if (!Subtarget.hasBMI())		if (!Subtarget.hasBMI())
return false;		return false;

// There are only 32-bit and 64-bit forms for 'andn'.		// There are only 32-bit and 64-bit forms for 'andn'.
EVT VT = Y.getValueType();		EVT VT = Y.getValueType();
if (VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i32 && VT != MVT::i64)
return false;		return false;
▲ Show 20 Lines • Show All 24,896 Lines • ▼ Show 20 Lines	static SDValue combineLogicBlendIntoPBLENDV(SDNode *N, SelectionDAG &DAG,

X = DAG.getBitcast(BlendVT, X);		X = DAG.getBitcast(BlendVT, X);
Y = DAG.getBitcast(BlendVT, Y);		Y = DAG.getBitcast(BlendVT, Y);
Mask = DAG.getBitcast(BlendVT, Mask);		Mask = DAG.getBitcast(BlendVT, Mask);
Mask = DAG.getNode(ISD::VSELECT, DL, BlendVT, Mask, Y, X);		Mask = DAG.getNode(ISD::VSELECT, DL, BlendVT, Mask, Y, X);
return DAG.getBitcast(VT, Mask);		return DAG.getBitcast(VT, Mask);
}		}

		// Helper function for combineOrCmpEqZeroToCtlzSrl
		// Transforms:
		// seteq(cmp x, 0)
		// into:
		// srl(ctlz x), log2(bitsize(x))
		// Input pattern is checked by caller.
		SDValue lowerX86CmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy, SelectionDAG &DAG) {
		SDValue Cmp = Op.getOperand(1);
		EVT VT = Cmp.getOperand(0).getValueType();
		unsigned Log2b = Log2_32(VT.getSizeInBits());
		SDLoc dl(Op);
		SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Cmp->getOperand(0));
		// The result of the shift is true or false, and on X86, the 32-bit
		// encoding of shr and lzcnt is more desirable.
		SDValue Trunc = DAG.getZExtOrTrunc(Clz, dl, MVT::i32);
		SDValue Scc = DAG.getNode(ISD::SRL, dl, MVT::i32, Trunc,
		DAG.getConstant(Log2b, dl, VT));
		return DAG.getZExtOrTrunc(Scc, dl, ExtTy);
		}

		// Try to transform:
		// zext(or(setcc(eq, (cmp x, 0)), setcc(eq, (cmp y, 0))))
		// into:
		// srl(or(ctlz(x), ctlz(y)), log2(bitsize(x))
		// Will also attempt to match more generic cases, eg:
		// zext(or(or(setcc(eq, cmp 0), setcc(eq, cmp 0)), setcc(eq, cmp 0)))
		// Only applies if the target supports the FastLZCNT feature.
		static SDValue combineOrCmpEqZeroToCtlzSrl(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
		if (DCI.isBeforeLegalize() \|\| !Subtarget.getTargetLowering()->isCtlzFast())
		return SDValue();

		auto isORCandidate = [](SDValue N) {
		return (N->getOpcode() == ISD::OR && N->hasOneUse());
		};

		// Check the zero extend is extending to 32-bit or more. The code generated by
		// srl(ctlz) for 16-bit or less variants of the pattern would require extra
		// instructions to clear the upper bits.
		if (!N->hasOneUse() \|\| !N->getSimpleValueType(0).bitsGE(MVT::i32) \|\|
		!isORCandidate(N->getOperand(0)))
		return SDValue();

		// Check the node matches: setcc(eq, cmp 0)
		auto isSetCCCandidate = [](SDValue N) {
		return N->getOpcode() == X86ISD::SETCC && N->hasOneUse() &&
		X86::CondCode(N->getConstantOperandVal(0)) == X86::COND_E &&
		N->getOperand(1).getOpcode() == X86ISD::CMP &&
		N->getOperand(1).getConstantOperandVal(1) == 0 &&
		N->getOperand(1).getValueType().bitsGE(MVT::i32);
		};

		SDNode *OR = N->getOperand(0).getNode();
		SDValue LHS = OR->getOperand(0);
		SDValue RHS = OR->getOperand(1);

		// Save nodes matching or(or, setcc(eq, cmp 0)).
		SmallVector<SDNode *, 2> ORNodes;
		while (((isORCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
		(isORCandidate(RHS) && isSetCCCandidate(LHS)))) {
		ORNodes.push_back(OR);
		OR = (LHS->getOpcode() == ISD::OR) ? LHS.getNode() : RHS.getNode();
		LHS = OR->getOperand(0);
		RHS = OR->getOperand(1);
		}

		// The last OR node should match or(setcc(eq, cmp 0), setcc(eq, cmp 0)).
		if (!(isSetCCCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
		!isORCandidate(SDValue(OR, 0)))
		return SDValue();

		// We have a or(setcc(eq, cmp 0), setcc(eq, cmp 0)) pattern, try to lower it
		// to
		// or(srl(ctlz),srl(ctlz)).
		// The dag combiner can then fold it into:
		// srl(or(ctlz, ctlz)).
		EVT VT = OR->getValueType(0);
		SDValue NewLHS = lowerX86CmpEqZeroToCtlzSrl(LHS, VT, DAG);
		SDValue Ret, NewRHS;
		if (NewLHS && (NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG)))
		Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, NewLHS, NewRHS);

		if (!Ret)
		return SDValue();

		// Try to lower nodes matching the or(or, setcc(eq, cmp 0)) pattern.
		while (ORNodes.size() > 0) {
		OR = ORNodes.pop_back_val();
		LHS = OR->getOperand(0);
		RHS = OR->getOperand(1);
		// Swap rhs with lhs to match or(setcc(eq, cmp, 0), or).
		if (RHS->getOpcode() == ISD::OR)
		std::swap(LHS, RHS);
		EVT VT = OR->getValueType(0);
		SDValue NewRHS = lowerX86CmpEqZeroToCtlzSrl(RHS, VT, DAG);
		if (!NewRHS)
		return SDValue();
		Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, Ret, NewRHS);
		}

		if (Ret)
		Ret = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), N->getValueType(0), Ret);

		return Ret;
		}

static SDValue combineOr(SDNode *N, SelectionDAG &DAG,		static SDValue combineOr(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))		if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))
return R;		return R;
▲ Show 20 Lines • Show All 2,015 Lines • ▼ Show 20 Lines	if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
return R;		return R;

if (SDValue DivRem8 = getDivRem8(N, DAG))		if (SDValue DivRem8 = getDivRem8(N, DAG))
return DivRem8;		return DivRem8;

if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))		if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))
return NewAdd;		return NewAdd;

		if (SDValue R = combineOrCmpEqZeroToCtlzSrl(N, DAG, DCI, Subtarget))
		return R;

return SDValue();		return SDValue();
}		}

/// Optimize x == -y --> x+y == 0		/// Optimize x == -y --> x+y == 0
/// x != -y --> x+y != 0		/// x != -y --> x+y != 0
static SDValue combineSetCC(SDNode *N, SelectionDAG &DAG,		static SDValue combineSetCC(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(2))->get();		ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(2))->get();
▲ Show 20 Lines • Show All 1,580 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.td

	Show First 20 Lines • Show All 884 Lines • ▼ Show 20 Lines
	def OptForSize : Predicate<"OptForSize">;			def OptForSize : Predicate<"OptForSize">;
	def OptForMinSize : Predicate<"OptForMinSize">;			def OptForMinSize : Predicate<"OptForMinSize">;
	def OptForSpeed : Predicate<"!OptForSize">;			def OptForSpeed : Predicate<"!OptForSize">;
	def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;			def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;
	def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;			def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;
	def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;			def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;
	def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;			def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;
	def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;			def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;
				def HasFastLZCNT : Predicate<"Subtarget->hasFastLZCNT()">;
	def HasMFence : Predicate<"Subtarget->hasMFence()">;			def HasMFence : Predicate<"Subtarget->hasMFence()">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// X86 Instruction Format Definitions.			// X86 Instruction Format Definitions.
	//			//

	include "X86InstrFormats.td"			include "X86InstrFormats.td"

	▲ Show 20 Lines • Show All 2,214 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	protected:
/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 16-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

		/// True if LZCNT instruction is fast.
		bool HasFastLZCNT;

/// True if the short functions should be padded to prevent		/// True if the short functions should be padded to prevent
/// a stall when returning too early.		/// a stall when returning too early.
bool PadShortFunctions;		bool PadShortFunctions;

/// True if the Calls with memory reference should be converted		/// True if the Calls with memory reference should be converted
/// to a register-based indirect call.		/// to a register-based indirect call.
bool CallRegIndirect;		bool CallRegIndirect;

▲ Show 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	public:
bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }		bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }
bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }		bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }		bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }
bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }		bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }
bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }		bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }
		bool hasFastLZCNT() const { return HasFastLZCNT; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 278 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
IsUAMem16Slow = false;		IsUAMem16Slow = false;
IsUAMem32Slow = false;		IsUAMem32Slow = false;
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasFastPartialYMMWrite = false;		HasFastPartialYMMWrite = false;
HasFastScalarFSQRT = false;		HasFastScalarFSQRT = false;
HasFastVectorFSQRT = false;		HasFastVectorFSQRT = false;
		HasFastLZCNT = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
stackAlignment = 4;		stackAlignment = 4;
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/lzcnt-zext-cmp.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; Test patterns which generates lzcnt instructions.
				; Eg: zext(or(setcc(cmp), setcc(cmp))) -> shr(or(lzcnt, lzcnt))
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=btver2 \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=btver2 -mattr=-fast-lzcnt \| FileCheck --check-prefix=NOFASTLZCNT %s

				; Test one 32-bit input, output is 32-bit, no transformations expected.
				define i32 @test_zext_cmp0(i32 %a) {
				; CHECK-LABEL: test_zext_cmp0:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: testl %edi, %edi
				; CHECK-NEXT: sete %al
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp0:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: xorl %eax, %eax
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%conv = zext i1 %cmp to i32
				ret i32 %conv
				}

				; Test two 32-bit inputs, output is 32-bit.
				define i32 @test_zext_cmp1(i32 %a, i32 %b) {
				; CHECK-LABEL: test_zext_cmp1:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntl %edi, %ecx
				; CHECK-NEXT: lzcntl %esi, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp1:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i32
				ret i32 %lor.ext
				}

				; Test two 64-bit inputs, output is 64-bit.
				define i64 @test_zext_cmp2(i64 %a, i64 %b) {
				; CHECK-LABEL: test_zext_cmp2:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntq %rdi, %rcx
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp2:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i64 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i64
				ret i64 %lor.ext
				}

				; Test two 16-bit inputs, output is 16-bit.
				; The transform is disabled for the 16-bit case, as we still have to clear the
				; upper 16-bits, adding one more instruction.
				define i16 @test_zext_cmp3(i16 %a, i16 %b) {
				; CHECK-LABEL: test_zext_cmp3:
				; CHECK: # BB#0:
				; CHECK-NEXT: testw %di, %di
				; CHECK-NEXT: sete %al
				; CHECK-NEXT: testw %si, %si
				; CHECK-NEXT: sete %cl
				; CHECK-NEXT: orb %al, %cl
				; CHECK-NEXT: movzbl %cl, %eax
				; CHECK-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp3:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testw %di, %di
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testw %si, %si
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i16 %a, 0
				%cmp1 = icmp eq i16 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i16
				ret i16 %lor.ext
				}

				; Test two 32-bit inputs, output is 64-bit.
				define i64 @test_zext_cmp4(i32 %a, i32 %b) {
				; CHECK-LABEL: test_zext_cmp4:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %ecx
				; CHECK-NEXT: lzcntl %esi, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp4:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%0 = or i1 %cmp, %cmp1
				%conv = zext i1 %0 to i64
				ret i64 %conv
				}

				; Test two 64-bit inputs, output is 32-bit.
				define i32 @test_zext_cmp5(i64 %a, i64 %b) {
				; CHECK-LABEL: test_zext_cmp5:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntq %rdi, %rcx
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp5:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i64 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%0 = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %0 to i32
				ret i32 %lor.ext
				}

				; Test three 32-bit inputs, output is 32-bit.
				define i32 @test_zext_cmp6(i32 %a, i32 %b, i32 %c) {
				; CHECK-LABEL: test_zext_cmp6:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %ecx
				; CHECK-NEXT: orl %eax, %ecx
				; CHECK-NEXT: lzcntl %edx, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp6:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %cl, %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp2 = icmp eq i32 %c, 0
				%.cmp2 = or i1 %or.cond, %cmp2
				%lor.ext = zext i1 %.cmp2 to i32
				ret i32 %lor.ext
				}

				; Test three 32-bit inputs, output is 32-bit, but compared to test_zext_cmp6 test,
				; %.cmp2 inputs' order is inverted.
				define i32 @test_zext_cmp7(i32 %a, i32 %b, i32 %c) {
				; CHECK-LABEL: test_zext_cmp7:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %ecx
				; CHECK-NEXT: orl %eax, %ecx
				; CHECK-NEXT: lzcntl %edx, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp7:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %cl, %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp2 = icmp eq i32 %c, 0
				%.cmp2 = or i1 %cmp2, %or.cond
				%lor.ext = zext i1 %.cmp2 to i32
				ret i32 %lor.ext
				}

				; Test four 32-bit inputs, output is 32-bit.
				define i32 @test_zext_cmp8(i32 %a, i32 %b, i32 %c, i32 %d) {
				; CHECK-LABEL: test_zext_cmp8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %esi
				; CHECK-NEXT: lzcntl %edx, %edx
				; CHECK-NEXT: orl %eax, %esi
				; CHECK-NEXT: lzcntl %ecx, %eax
				; CHECK-NEXT: orl %edx, %eax
				; CHECK-NEXT: orl %esi, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp8:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %dil
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %dil, %al
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %dl
				; NOFASTLZCNT-NEXT: testl %ecx, %ecx
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %dl, %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp3 = icmp eq i32 %c, 0
				%or.cond5 = or i1 %or.cond, %cmp3
				%cmp4 = icmp eq i32 %d, 0
				%.cmp4 = or i1 %or.cond5, %cmp4
				%lor.ext = zext i1 %.cmp4 to i32
				ret i32 %lor.ext
				}

				; Test one 32-bit input, one 64-bit input, output is 32-bit.
				define i32 @test_zext_cmp9(i32 %a, i64 %b) {
				; CHECK-LABEL: test_zext_cmp9:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: lzcntl %edi, %ecx
				; CHECK-NEXT: shrl $5, %ecx
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp9:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%0 = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %0 to i32
				ret i32 %lor.ext
				}

				; Test 2 128-bit inputs, output is 32-bit, no transformations expected.
				define i32 @test_zext_cmp10(i64 %a.coerce0, i64 %a.coerce1, i64 %b.coerce0, i64 %b.coerce1) {
				; CHECK-LABEL: test_zext_cmp10:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: orq %rsi, %rdi
				; CHECK-NEXT: sete %al
				; CHECK-NEXT: orq %rcx, %rdx
				; CHECK-NEXT: sete %cl
				; CHECK-NEXT: orb %al, %cl
				; CHECK-NEXT: movzbl %cl, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: test_zext_cmp10:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: orq %rsi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orq %rcx, %rdx
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%a.sroa.2.0.insert.ext = zext i64 %a.coerce1 to i128
				%a.sroa.2.0.insert.shift = shl nuw i128 %a.sroa.2.0.insert.ext, 64
				%a.sroa.0.0.insert.ext = zext i64 %a.coerce0 to i128
				%a.sroa.0.0.insert.insert = or i128 %a.sroa.2.0.insert.shift, %a.sroa.0.0.insert.ext
				%b.sroa.2.0.insert.ext = zext i64 %b.coerce1 to i128
				%b.sroa.2.0.insert.shift = shl nuw i128 %b.sroa.2.0.insert.ext, 64
				%b.sroa.0.0.insert.ext = zext i64 %b.coerce0 to i128
				%b.sroa.0.0.insert.insert = or i128 %b.sroa.2.0.insert.shift, %b.sroa.0.0.insert.ext
				%cmp = icmp eq i128 %a.sroa.0.0.insert.insert, 0
				%cmp3 = icmp eq i128 %b.sroa.0.0.insert.insert, 0
				%0 = or i1 %cmp, %cmp3
				%lor.ext = zext i1 %0 to i32
				ret i32 %lor.ext
				}