This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
ClosedPublic

Authored by pgousseau on Aug 12 2016, 5:27 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
andreadb

Commits

rGb6d652adb5b1: [X86] Take advantage of the lzcnt instruction on btver2 architectures when…
rL284248: [X86] Take advantage of the lzcnt instruction on btver2 architectures when…

Summary

Following the discussion on D22038, this change enables the setcc to srl(ctlz) transformation on the btver2 architecture.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.

This patch requires D23445

Diff Detail

Event Timeline

pgousseau updated this revision to Diff 67824.Aug 12 2016, 5:27 AM

pgousseau retitled this revision from to [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures..

pgousseau updated this object.

pgousseau added reviewers: qcolombet, andreadb, RKSimon, spatel.

pgousseau added a subscriber: llvm-commits.

Herald added a subscriber: nemanjai. · View Herald TranscriptAug 12 2016, 5:27 AM

Updating patch to reflect changes in D23445

spatel added inline comments.Aug 15 2016, 9:15 AM

lib/Target/X86/X86.td
182–184	I would move this down with the other 'fake' features (ie, the other fast/slow attributes). Someday, we may come up with a better way to distinguish performance "features" from architectural ones. It would also be good to explain exactly what we mean by "fast" in this context. Finally, use a hyphen to make this more readable: "fast-lzcnt".
lib/Target/X86/X86ISelLowering.cpp
31095–31111	I don't understand the need for this check, so at the least it needs a code comment to explain why it is here. Related: if we're matching the pattern starting from a zext, doesn't that miss the icmp/icmp/or patterns that you were originally hoping to optimize in D22038?
test/CodeGen/X86/lzcnt-zext-cmp.ll
6	Instead of specifying a different CPU, this RUN would be better if it also used btver2, but explicitly disabled the 'fast-lzcnt' attribute. That way we verify the codegen with and without the attribute while simultaneously verifying that btver2 has this attribute by default.
8	Please give each test a meaningful name and/or add a comment to explain exactly what each test is checking.
82–101	There should be at least one test of the 'HasInterestingUses' logic (if that logic really belongs in this patch).

spatel mentioned this in D23445: [x86] Refactor a PowerPC specific ctlz/srl transformation (NFC)..Aug 15 2016, 10:39 AM

pgousseau added inline comments.Aug 15 2016, 11:04 AM

lib/Target/X86/X86.td
182–184	Sounds good will do.
lib/Target/X86/X86ISelLowering.cpp
31095–31111	Sounds good, I will think of a better comment. I added this check to be conservative for now as I noticed several worst code gen occurences (around 50% of matches in openssl). I hope to address this and the icmp/icmp/OR case in another patch.
test/CodeGen/X86/lzcnt-zext-cmp.ll
82–101	Yes I meant to add those, will do thanks.

Rebased changes and following Sanjay's comments:

Move down feature declaration, add hyphen, add comment.
Remove hasInterestingUses check.
Use fast-lzcnt in test and add comments.

Looking again the codegen of openssl without the "hasInterestingUses" constraint the codegen does not seem worse in terms of speed, only the size is not as good as it could be but I think it is ok? Something must have gone wrong during my initial testing I suppose ...

Minor tweak request.

lib/Target/X86/X86InstrInfo.td
837	Move this down to the other fast/slow definitions.
lib/Target/X86/X86Subtarget.cpp
257	Move this down to the other fast/slow variables.
lib/Target/X86/X86Subtarget.h
428	Move this down to the other fast/slow functions.

Moving down fast-lzcnt feature changes following Simon's comments.

In D23446#516878, @pgousseau wrote:

I wasn't expecting a size difference. Can you provide more details about the size and perf changes that you see with this change? We may want to gate the transform based on 'optForSize'.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3594–3595	Remove "llvm::"
lib/Target/X86/X86.td
265–266	The description still seems too vague. The key point is that lzcnt must have the same latency and throughput as test/set, right? How about: "If lzcnt has equivalent latency/throughput to most simple integer ops, it can be used to replace test/set sequences."
test/CodeGen/X86/lzcnt-zext-cmp.ll
86–93	lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here?

Yes sounds like I should disable this if optForSize is enabled.

For example in openssl, out of 89 srl(lzcnt) optimisations, around 30 cases cause larger code, for example I see 25 of those:

3055:	31 ed                	xor    %ebp,%ebp
3057:	ff c2                	inc    %edx
3059:	40 0f 94 c5          	sete   %bpl
305d:	01 e9                	add    %ebp,%ecx

-> 10 bytes

309d:	ff c2                	inc    %edx
309f:	f3 0f bd ea          	lzcnt  %edx,%ebp
30a3:	c1 ed 05             	shr    $0x5,%ebp
30a6:	01 e9                	add    %ebp,%ecx

-> 12 bytes

But this should not affect performances.

Luckily openssl's libcrypto total size remains smaller as the other remaining matches result in fewer and as fast instructions.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3594–3595	Sounds good, will do.
lib/Target/X86/X86.td
265–266	Sounds good thanks, will do.
test/CodeGen/X86/lzcnt-zext-cmp.ll
86–93	I disabled the 16-bit case because this seems to lead to bigger code in general, for this test case we would get: lzcntw %di, %ax andl $16, %eax shrl $4, %eax retq instead of: xorl %eax, %eax testw %di, %di sete %al retq which seems bigger code for the same result.

Following Sanjay's comments:

Remove llvm namespace
Rewrite feature's comment
Disable transform if optForSize is true

Please can you add some tests with optsize enabled?

In D23446#519225, @pgousseau wrote:

Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

test/CodeGen/X86/lzcnt-zext-cmp.ll
87–94	Oh...yuck. So we really should've done 'xor %eax, %eax' ahead of the lzcnt in that case? Please add a note about why we don't do 16-bit in the code comments and also here in the test case.

Add tests for optForSize following Simon's comments.

In D23446#519626, @spatel wrote:

In D23446#519225, @pgousseau wrote:

Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Let me know if more detailed are required.

test/CodeGen/X86/lzcnt-zext-cmp.ll
88–95	Sounds good, will do thanks.

Add comments for the 16-bit case following Sanjay's comment.

In D23446#519879, @pgousseau wrote:
With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.
12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled
Here is an example from libcrypto where 2 instructions is saved:
f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

In D23446#520684, @spatel wrote:
In D23446#519879, @pgousseau wrote:
With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.
12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled
Here is an example from libcrypto where 2 instructions is saved:
f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d
This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

Sounds good, will remove the optForSize.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1
Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

Yes this benchmark seems to show those kind of improvements, for example BM_f1_new/8 is ~25% faster than BM_f1_old/8.
Although now I reran the SPEC h264ref benchmark and unfortunately, I am seeing some small (less than 1%) but consistent performance degradation with h264ref that I wasn't seeing with the initial tablegen patch.
Why the micro-benchmark does not show this I am not sure, one hypothesis is that the micro-benchmark is only comparing the 32-bit "register, register" variant of the change, so it might be that I need to restrict the transformation to this pattern.
Or the micro-benchmark is not representative of the performance and in that case this change is probably not worth pursuing. Will come back with the findings.

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

Any update on the performance investigations?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3624	These can be replaced with DAG.getZExtOrTrunc(Scc, dl, ExtTy);

In D23446#542525, @RKSimon wrote:

Any update on the performance investigations?

Hi Simon/Sanjay,

Sorry for the delayed follow-up!
I have ran more tests and it seems the regressions in performances I was observing with SPEC's h264 are within the noise now so I cant tell if this patch is improving or degrading perfomances in SPEC's h264 benchmark.
I am more confident the OR case brings better performances because we will be replacing

48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret

by this:

F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret

My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

Thanks!

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3624	Will do thanks!

In D23446#544797, @pgousseau wrote:
I am more confident the OR case brings better performances because we will be replacing
48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret
by this:
F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret
My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

The OR case certainly looks better in isolation (less instructions, less code size). If you are measuring perf improvements from that alone, I think we can be more confident that the transform to lzcnt is the source of that improvement. It's still not clear to me how the micro-benchmark was improved so much for the simpler case.

To match the OR pattern, I think you would either add some code to visitOr() or add tablegen patterns if it is possible to match the DAG nodes that way.

Hi Simon/Sanjay,

Following up with with the previous comments, this patch contains:

Use of DAG.getZextOrTrunc as per Simon's comment.
Removed support for the simple case.
Added handling of OR based patterns.
Added a case for SRL nodes in 'isTypeDesirableForOp()' as to favor 32 bits encoding when targetting X86.
Added support for multiple OR patterns eg: (a1||a2||a3||a4||a5)

Let me know what you think,

Thanks!

Some quick remarks - I'll do more of review later after others have had a chance to look at it.

lib/Target/X86/X86ISelLowering.cpp
29016	Should we return early if !Subtarget.isCtlzFast()?
29022	Are we letting vector types through here?
31797	Comment this. Should this be part of a separate patch with its own tests? Its fine if its only exposed by this patch to leave it here, but it should have a comment either way.
test/CodeGen/X86/lzcnt-zext-cmp.ll
7	Test single input version to make sure it isn't being used.

pgousseau added inline comments.Oct 5 2016, 10:05 AM

lib/Target/X86/X86ISelLowering.cpp
29016	Yes that makes sense.
29022	Yes it seems that way, will have to restrict to 32-bit and 64-bit integers, thanks for spotting this.
31797	Yes would make sense to have a separate patch, will try to find an associated test case.
test/CodeGen/X86/lzcnt-zext-cmp.ll
7	Makes sense.

Following Simon's comments:

Add an early return in case isCtlzFast is false
Add a test checking that no transformations occurs on single input cases.
Prevent transformation on 128-bit cases: int foo(int128_t a, int128_t b) { return a || b;} as this pessimized the codegen.
- This requires constraining the pattern to the X86 form of an equality comparison with 0.
- This remove the need from the earlier patch to modify isTypeDesirableForOp().

Possibly better test names than barXXXX, other than that I'm happy with this - @spatel ?

spatel added inline comments.Oct 13 2016, 9:01 AM

lib/Target/X86/X86ISelLowering.cpp
29037–29042	The pattern must always begin with zext. Is there some reason not to start in combineZext() rather than combineOr()?

Following Simon and Sanjay comments:

Renamed tests to 'test_zext_cmpXX'
Start pattern matching from combineZext instead of combineOR.

Also added missing "hasOneUse" checks to OR nodes.
Removed one unneeded check for "isSetCCCandidate"

Let me know what you guys think,

Thanks,

Pierre

LGTM. See inline for nits.

lib/Target/X86/X86ISelLowering.cpp
4191	Could remove or assert hasLZCNT()?
29105–29107	No need for braces here.
test/CodeGen/X86/lzcnt-zext-cmp.ll
202	Update name: "bar6"

This revision is now accepted and ready to land.Oct 14 2016, 8:04 AM

pgousseau added inline comments.Oct 14 2016, 8:59 AM

lib/Target/X86/X86ISelLowering.cpp
4191	Yes makes sense, will remove before commit.
29105–29107	Will remove before commit.
test/CodeGen/X86/lzcnt-zext-cmp.ll
202	Thanks for spotting this, will fix before commit.

Closed by commit rL284248: [X86] Take advantage of the lzcnt instruction on btver2 architectures when… (authored by pgousseau). · Explain WhyOct 14 2016, 9:50 AM

This revision was automatically updated to reflect the committed changes.

RKSimon mentioned this in D22038: [X86] Transform zext+seteq+cmp into shr+lzcnt on btver2 architecture..Oct 17 2016, 7:48 AM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

3 lines

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

15 lines

Target/

PowerPC/

PPCISelLowering.cpp

2 lines

X86/

7 lines

2 lines

86 lines

1 line

4 lines

1 line

test/

CodeGen/

X86/

lzcnt-zext-cmp.ll

283 lines

Diff 73497

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 3,062 Lines • ▼ Show 20 Lines	public:
/// Lower TLS global address SDNode for target independent emulated TLS model.		/// Lower TLS global address SDNode for target independent emulated TLS model.
virtual SDValue LowerToTLSEmulatedModel(const GlobalAddressSDNode *GA,		virtual SDValue LowerToTLSEmulatedModel(const GlobalAddressSDNode *GA,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

// seteq(x, 0) -> truncate(srl(ctlz(zext(x)), log2(#bits)))		// seteq(x, 0) -> truncate(srl(ctlz(zext(x)), log2(#bits)))
// If we're comparing for equality to zero and isCtlzFast is true, expose the		// If we're comparing for equality to zero and isCtlzFast is true, expose the
// fact that this can be implemented as a ctlz/srl pair, so that the dag		// fact that this can be implemented as a ctlz/srl pair, so that the dag
// combiner can fold the new nodes.		// combiner can fold the new nodes.
SDValue lowerCmpEqZeroToCtlzSrl(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerCmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy,
		SelectionDAG &DAG) const;

private:		private:
SDValue simplifySetCCWithAnd(EVT VT, SDValue N0, SDValue N1,		SDValue simplifySetCCWithAnd(EVT VT, SDValue N0, SDValue N1,
ISD::CondCode Cond, DAGCombinerInfo &DCI,		ISD::CondCode Cond, DAGCombinerInfo &DCI,
const SDLoc &DL) const;		const SDLoc &DL) const;
};		};

/// Given an LLVM IR type and return type attributes, compute the return value		/// Given an LLVM IR type and return type attributes, compute the return value
Show All 9 Lines

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 3,584 Lines • ▼ Show 20 Lines	SDValue TargetLowering::LowerToTLSEmulatedModel(const GlobalAddressSDNode *GA,
MFI.setAdjustsStack(true); // Is this only for X86 target?		MFI.setAdjustsStack(true); // Is this only for X86 target?
MFI.setHasCalls(true);		MFI.setHasCalls(true);

assert((GA->getOffset() == 0) &&		assert((GA->getOffset() == 0) &&
"Emulated TLS must have zero offset in GlobalAddressSDNode");		"Emulated TLS must have zero offset in GlobalAddressSDNode");
return CallResult.first;		return CallResult.first;
}		}

SDValue TargetLowering::lowerCmpEqZeroToCtlzSrl(SDValue Op,		SDValue TargetLowering::lowerCmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
assert((Op->getOpcode() == ISD::SETCC) && "Input has to be a SETCC node.");		assert((Op->getOpcode() == ISD::SETCC) && "Input has to be a SETCC node.");
		spatelUnsubmitted Not Done Reply Inline Actions Remove "llvm::" spatel: Remove "llvm::"
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Sounds good, will do. pgousseau: Sounds good, will do.
if (!isCtlzFast())		if (!isCtlzFast())
return SDValue();		return SDValue();
ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(2))->get();		ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(2))->get();
SDLoc dl(Op);		SDLoc dl(Op);
if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {		if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
if (C->isNullValue() && CC == ISD::SETEQ) {		if (C->isNullValue() && CC == ISD::SETEQ) {
EVT VT = Op.getOperand(0).getValueType();		EVT VT = Op.getOperand(0).getValueType();
SDValue Zext = Op.getOperand(0);		SDValue Zext = Op.getOperand(0);
if (VT.bitsLT(MVT::i32)) {		if (VT.bitsLT(MVT::i32)) {
VT = MVT::i32;		VT = MVT::i32;
Zext = DAG.getNode(ISD::ZERO_EXTEND, dl, VT, Op.getOperand(0));		Zext = DAG.getNode(ISD::ZERO_EXTEND, dl, VT, Op.getOperand(0));
}		}
unsigned Log2b = Log2_32(VT.getSizeInBits());		unsigned Log2b = Log2_32(VT.getSizeInBits());
SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Zext);		SDValue Clz = DAG.getNode(ISD::CTLZ, dl, VT, Zext);
SDValue Scc = DAG.getNode(ISD::SRL, dl, VT, Clz,		// The result of the shift is true or false, and on X86, the 32-bit
		// encoding of shr and lzcnt is more desirable.
		EVT SccTy = VT;
		SDValue Trunc = Clz;
		if (!isTypeDesirableForOp(ISD::SRL, VT) &&
		isTypeDesirableForOp(ISD::SRL, MVT::i32)) {
		SccTy = MVT::i32;
		Trunc = DAG.getNode(ISD::TRUNCATE, dl, SccTy, Clz);
		}
		SDValue Scc = DAG.getNode(ISD::SRL, dl, SccTy, Trunc,
DAG.getConstant(Log2b, dl, MVT::i32));		DAG.getConstant(Log2b, dl, MVT::i32));
return DAG.getNode(ISD::TRUNCATE, dl, MVT::i32, Scc);		return DAG.getZExtOrTrunc(Scc, dl, ExtTy);
}		}
}		}
return SDValue();		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions These can be replaced with DAG.getZExtOrTrunc(Scc, dl, ExtTy); RKSimon: These can be replaced with DAG.getZExtOrTrunc(Scc, dl, ExtTy);
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Will do thanks! pgousseau: Will do thanks!
}		}

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,358 Lines • ▼ Show 20 Lines	if (Op.getValueType() == MVT::v2i64) {

// We handle most of these in the usual way.		// We handle most of these in the usual way.
return Op;		return Op;
}		}

// If we're comparing for equality to zero, expose the fact that this is		// If we're comparing for equality to zero, expose the fact that this is
// implemented as a ctlz/srl pair on ppc, so that the dag combiner can		// implemented as a ctlz/srl pair on ppc, so that the dag combiner can
// fold the new nodes.		// fold the new nodes.
if (SDValue V = lowerCmpEqZeroToCtlzSrl(Op, DAG))		if (SDValue V = lowerCmpEqZeroToCtlzSrl(Op, MVT::i32, DAG))
return V;		return V;

if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {		if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op.getOperand(1))) {
// Leave comparisons against 0 and -1 alone for now, since they're usually		// Leave comparisons against 0 and -1 alone for now, since they're usually
// optimized. FIXME: revisit this when we can custom lower all setcc		// optimized. FIXME: revisit this when we can custom lower all setcc
// optimizations.		// optimizations.
if (C->isAllOnesValue() \|\| C->isNullValue())		if (C->isAllOnesValue() \|\| C->isNullValue())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 9,859 Lines • Show Last 20 Lines

lib/Target/X86/X86.td

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	def FeatureRDRAND : SubtargetFeature<"rdrnd", "HasRDRAND", "true",
"Support RDRAND instruction">;		"Support RDRAND instruction">;
def FeatureF16C : SubtargetFeature<"f16c", "HasF16C", "true",		def FeatureF16C : SubtargetFeature<"f16c", "HasF16C", "true",
"Support 16-bit floating point conversion instructions",		"Support 16-bit floating point conversion instructions",
[FeatureAVX]>;		[FeatureAVX]>;
def FeatureFSGSBase : SubtargetFeature<"fsgsbase", "HasFSGSBase", "true",		def FeatureFSGSBase : SubtargetFeature<"fsgsbase", "HasFSGSBase", "true",
"Support FS/GS Base instructions">;		"Support FS/GS Base instructions">;
def FeatureLZCNT : SubtargetFeature<"lzcnt", "HasLZCNT", "true",		def FeatureLZCNT : SubtargetFeature<"lzcnt", "HasLZCNT", "true",
"Support LZCNT instruction">;		"Support LZCNT instruction">;
def FeatureBMI : SubtargetFeature<"bmi", "HasBMI", "true",		def FeatureBMI : SubtargetFeature<"bmi", "HasBMI", "true",
"Support BMI instructions">;		"Support BMI instructions">;
def FeatureBMI2 : SubtargetFeature<"bmi2", "HasBMI2", "true",		def FeatureBMI2 : SubtargetFeature<"bmi2", "HasBMI2", "true",
		spatelUnsubmitted Not Done Reply Inline Actions I would move this down with the other 'fake' features (ie, the other fast/slow attributes). Someday, we may come up with a better way to distinguish performance "features" from architectural ones. It would also be good to explain exactly what we mean by "fast" in this context. Finally, use a hyphen to make this more readable: "fast-lzcnt". spatel: I would move this down with the other 'fake' features (ie, the other fast/slow attributes).
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Sounds good will do. pgousseau: Sounds good will do.
"Support BMI2 instructions">;		"Support BMI2 instructions">;
def FeatureRTM : SubtargetFeature<"rtm", "HasRTM", "true",		def FeatureRTM : SubtargetFeature<"rtm", "HasRTM", "true",
"Support RTM instructions">;		"Support RTM instructions">;
def FeatureHLE : SubtargetFeature<"hle", "HasHLE", "true",		def FeatureHLE : SubtargetFeature<"hle", "HasHLE", "true",
"Support HLE">;		"Support HLE">;
def FeatureADX : SubtargetFeature<"adx", "HasADX", "true",		def FeatureADX : SubtargetFeature<"adx", "HasADX", "true",
"Support ADX instructions">;		"Support ADX instructions">;
def FeatureSHA : SubtargetFeature<"sha", "HasSHA", "true",		def FeatureSHA : SubtargetFeature<"sha", "HasSHA", "true",
▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
// But if the code is scalar that probably means that the code has some kind of		// But if the code is scalar that probably means that the code has some kind of
// dependency and we should care more about reducing the latency.		// dependency and we should care more about reducing the latency.
def FeatureFastScalarFSQRT		def FeatureFastScalarFSQRT
: SubtargetFeature<"fast-scalar-fsqrt", "HasFastScalarFSQRT",		: SubtargetFeature<"fast-scalar-fsqrt", "HasFastScalarFSQRT",
"true", "Scalar SQRT is fast (disable Newton-Raphson)">;		"true", "Scalar SQRT is fast (disable Newton-Raphson)">;
def FeatureFastVectorFSQRT		def FeatureFastVectorFSQRT
: SubtargetFeature<"fast-vector-fsqrt", "HasFastVectorFSQRT",		: SubtargetFeature<"fast-vector-fsqrt", "HasFastVectorFSQRT",
"true", "Vector SQRT is fast (disable Newton-Raphson)">;		"true", "Vector SQRT is fast (disable Newton-Raphson)">;
		// If lzcnt has equivalent latency/throughput to most simple integer ops, it can
		// be used to replace test/set sequences.
		spatelUnsubmitted Not Done Reply Inline Actions The description still seems too vague. The key point is that lzcnt must have the same latency and throughput as test/set, right? How about: "If lzcnt has equivalent latency/throughput to most simple integer ops, it can be used to replace test/set sequences." spatel: The description still seems too vague. The key point is that lzcnt must have the same latency…
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Sounds good thanks, will do. pgousseau: Sounds good thanks, will do.
		def FeatureFastLZCNT
		: SubtargetFeature<
		"fast-lzcnt", "HasFastLZCNT", "true",
		"LZCNT instructions are as fast as most simple integer ops">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 368 Lines • ▼ Show 20 Lines	def : ProcessorModel<"btver2", BtVer2Model, [
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeaturePRFCHW,		FeaturePRFCHW,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureBMI,		FeatureBMI,
FeatureF16C,		FeatureF16C,
FeatureMOVBE,		FeatureMOVBE,
FeatureLZCNT,		FeatureLZCNT,
		FeatureFastLZCNT,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureSlowSHLD,		FeatureSlowSHLD,
FeatureLAHFSAHF,		FeatureLAHFSAHF,
FeatureFastPartialYMMWrite		FeatureFastPartialYMMWrite
]>;		]>;

▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 754 Lines • ▼ Show 20 Lines	public:

/// This method returns the name of a target specific DAG node.		/// This method returns the name of a target specific DAG node.
const char *getTargetNodeName(unsigned Opcode) const override;		const char *getTargetNodeName(unsigned Opcode) const override;

bool isCheapToSpeculateCttz() const override;		bool isCheapToSpeculateCttz() const override;

bool isCheapToSpeculateCtlz() const override;		bool isCheapToSpeculateCtlz() const override;

		bool isCtlzFast() const override;

bool hasBitPreservingFPLogic(EVT VT) const override {		bool hasBitPreservingFPLogic(EVT VT) const override {
return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();		return VT == MVT::f32 \|\| VT == MVT::f64 \|\| VT.isVector();
}		}

bool isMultiStoresCheaperThanBitsMerge(SDValue Lo,		bool isMultiStoresCheaperThanBitsMerge(SDValue Lo,
SDValue Hi) const override {		SDValue Hi) const override {
// If the pair to store is a mixture of float and int values, we will		// If the pair to store is a mixture of float and int values, we will
// save two bitwise instructions and one float-to-int instruction and		// save two bitwise instructions and one float-to-int instruction and
▲ Show 20 Lines • Show All 494 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,181 Lines • ▼ Show 20 Lines	bool X86TargetLowering::isCheapToSpeculateCttz() const {
return Subtarget.hasBMI();		return Subtarget.hasBMI();
}		}

bool X86TargetLowering::isCheapToSpeculateCtlz() const {		bool X86TargetLowering::isCheapToSpeculateCtlz() const {
// Speculate ctlz only if we can directly use LZCNT.		// Speculate ctlz only if we can directly use LZCNT.
return Subtarget.hasLZCNT();		return Subtarget.hasLZCNT();
}		}

		bool X86TargetLowering::isCtlzFast() const {
		return Subtarget.hasLZCNT() && Subtarget.hasFastLZCNT();
		spatelUnsubmitted Not Done Reply Inline Actions Could remove or assert hasLZCNT()? spatel: Could remove or assert hasLZCNT()?
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Yes makes sense, will remove before commit. pgousseau: Yes makes sense, will remove before commit.
		}

bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {		bool X86TargetLowering::hasAndNotCompare(SDValue Y) const {
if (!Subtarget.hasBMI())		if (!Subtarget.hasBMI())
return false;		return false;

// There are only 32-bit and 64-bit forms for 'andn'.		// There are only 32-bit and 64-bit forms for 'andn'.
EVT VT = Y.getValueType();		EVT VT = Y.getValueType();
if (VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i32 && VT != MVT::i64)
return false;		return false;
▲ Show 20 Lines • Show All 24,794 Lines • ▼ Show 20 Lines	static SDValue combineLogicBlendIntoPBLENDV(SDNode *N, SelectionDAG &DAG,

X = DAG.getBitcast(BlendVT, X);		X = DAG.getBitcast(BlendVT, X);
Y = DAG.getBitcast(BlendVT, Y);		Y = DAG.getBitcast(BlendVT, Y);
Mask = DAG.getBitcast(BlendVT, Mask);		Mask = DAG.getBitcast(BlendVT, Mask);
Mask = DAG.getNode(ISD::VSELECT, DL, BlendVT, Mask, Y, X);		Mask = DAG.getNode(ISD::VSELECT, DL, BlendVT, Mask, Y, X);
return DAG.getBitcast(VT, Mask);		return DAG.getBitcast(VT, Mask);
}		}

		// Try to transform:
		// zext(or(setcc (x, 0, eq), setcc (y, 0, eq))
		// into:
		// srl(or(ctlz(x), ctlz(y)), log2(bitsize(x))
		// Will also attempt to match more generic cases, eg:
		// zext(or(or(setcc, setcc), setcc))
		// Only applies if the target supports the FastLZCNT feature.
		static SDValue combineOrCmpEqZeroToCtlzSrl(SDNode *N, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
		if (DCI.isBeforeLegalize())
		return SDValue();

		RKSimonUnsubmitted Not Done Reply Inline Actions Should we return early if !Subtarget.isCtlzFast()? RKSimon: Should we return early if !Subtarget.isCtlzFast()?
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Yes that makes sense. pgousseau: Yes that makes sense.
		// Check the OR user is a zero extend and that it is extending to 32-bit or
		// more. The code generated by srl(ctlz) for 16-bit or less variants of the
		// pattern would require extra instructions to clear the upper bits.
		if (!N->hasOneUse() \|\| !(N->use_begin()->getOpcode() == ISD::ZERO_EXTEND) \|\|
		!N->use_begin()->getSimpleValueType(0).bitsGE(MVT::i32))
		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions Are we letting vector types through here? RKSimon: Are we letting vector types through here?
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Yes it seems that way, will have to restrict to 32-bit and 64-bit integers, thanks for spotting this. pgousseau: Yes it seems that way, will have to restrict to 32-bit and 64-bit integers, thanks for spotting…

		auto isSetCCCandidate = [](SDValue N) {
		return N->getOpcode() == ISD::SETCC && N->hasOneUse() &&
		N->getOperand(0).getValueType().bitsGE(MVT::i32);
		};

		SDNode *OR = N;
		SDValue LHS = OR->getOperand(0);
		SDValue RHS = OR->getOperand(1);

		// Save nodes matching or(or, setcc).
		SmallVector<SDNode *, 2> ORNodes;
		while (((LHS->getOpcode() == ISD::OR && isSetCCCandidate(RHS)) \|\|
		(RHS.getOpcode() == ISD::OR && isSetCCCandidate(LHS)))) {
		ORNodes.push_back(OR);
		OR = (LHS->getOpcode() == ISD::OR) ? LHS.getNode() : RHS.getNode();
		LHS = OR->getOperand(0);
		RHS = OR->getOperand(1);
		}

		spatelUnsubmitted Not Done Reply Inline Actions The pattern must always begin with zext. Is there some reason not to start in combineZext() rather than combineOr()? spatel: The pattern must always begin with zext. Is there some reason not to start in combineZext()…
		// The last OR node should match or(setcc, setcc).
		if (!(isSetCCCandidate(LHS) && isSetCCCandidate(RHS)) \|\|
		OR->getOpcode() != ISD::OR)
		return SDValue();

		// We have a or(setcc, setcc) pattern, try to lower it to
		// or(srl(ctlz),srl(ctlz)). The dag combiner can then fold it into:
		// srl(or(ctlz, ctlz)).
		EVT VT = N->getValueType(0);
		SDValue NewLHS =
		Subtarget.getTargetLowering()->lowerCmpEqZeroToCtlzSrl(LHS, VT, DAG);
		SDValue Ret, NewRHS;
		if (NewLHS && (NewRHS = Subtarget.getTargetLowering()->lowerCmpEqZeroToCtlzSrl(
		RHS, VT, DAG)))
		Ret = DAG.getNode(ISD::OR, SDLoc(OR), VT, NewLHS, NewRHS);

		if (!Ret)
		return SDValue();

		// Try to lower nodes matching the or(or, setcc) pattern.
		while (ORNodes.size() > 0) {
		OR = ORNodes.pop_back_val();
		LHS = OR->getOperand(0);
		RHS = OR->getOperand(1);
		// Swap rhs with lhs to match or(setcc, or).
		if (RHS->getOpcode() == ISD::OR && isSetCCCandidate(LHS))
		std::swap(LHS, RHS);
		EVT VT = OR->getValueType(0);
		SDValue NewRHS =
		Subtarget.getTargetLowering()->lowerCmpEqZeroToCtlzSrl(RHS, VT, DAG);
		if (!NewRHS)
		return SDValue();
		Ret = DAG.getNode(ISD::OR, SDLoc(N), VT, Ret, NewRHS);
		}

		return Ret;
		}

static SDValue combineOr(SDNode *N, SelectionDAG &DAG,		static SDValue combineOr(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		if (SDValue R = combineOrCmpEqZeroToCtlzSrl(N, DAG, DCI, Subtarget))
		return R;

if (DCI.isBeforeLegalizeOps())		if (DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))		if (SDValue R = combineCompareEqual(N, DAG, DCI, Subtarget))
return R;		return R;

if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, Subtarget))		if (SDValue FPLogic = convertIntLogicToFPLogic(N, DAG, Subtarget))
return FPLogic;		return FPLogic;

if (SDValue R = combineLogicBlendIntoPBLENDV(N, DAG, Subtarget))		if (SDValue R = combineLogicBlendIntoPBLENDV(N, DAG, Subtarget))
return R;		return R;

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)
return SDValue();		return SDValue();

// fold (or (x << c) \| (y >> (64 - c))) ==> (shld64 x, y, c)		// fold (or (x << c) \| (y >> (64 - c))) ==> (shld64 x, y, c)
bool OptForSize = DAG.getMachineFunction().getFunction()->optForSize();		bool OptForSize = DAG.getMachineFunction().getFunction()->optForSize();
		spatelUnsubmitted Not Done Reply Inline Actions No need for braces here. spatel: No need for braces here.
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Will remove before commit. pgousseau: Will remove before commit.

// SHLD/SHRD instructions have lower register pressure, but on some		// SHLD/SHRD instructions have lower register pressure, but on some
// platforms they have higher latency than the equivalent		// platforms they have higher latency than the equivalent
// series of shifts/or that would otherwise be generated.		// series of shifts/or that would otherwise be generated.
// Don't fold (or (x << c) \| (y >> (64 - c))) if SHLD/SHRD instructions		// Don't fold (or (x << c) \| (y >> (64 - c))) if SHLD/SHRD instructions
// have higher latencies and we are not optimizing for size.		// have higher latencies and we are not optimizing for size.
if (!OptForSize && Subtarget.isSHLDSlow())		if (!OptForSize && Subtarget.isSHLDSlow())
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 1,971 Lines • ▼ Show 20 Lines	static SDValue combineZext(SDNode *N, SelectionDAG &DAG,

if (VT.is256BitVector())		if (VT.is256BitVector())
if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))		if (SDValue R = WidenMaskArithmetic(N, DAG, DCI, Subtarget))
return R;		return R;

if (SDValue DivRem8 = getDivRem8(N, DAG))		if (SDValue DivRem8 = getDivRem8(N, DAG))
return DivRem8;		return DivRem8;

if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))		if (SDValue NewAdd = promoteExtBeforeAdd(N, DAG, Subtarget))
return NewAdd;		return NewAdd;

return SDValue();		return SDValue();
}		}

/// Optimize x == -y --> x+y == 0		/// Optimize x == -y --> x+y == 0
/// x != -y --> x+y != 0		/// x != -y --> x+y != 0
static SDValue combineSetCC(SDNode *N, SelectionDAG &DAG,		static SDValue combineSetCC(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(2))->get();		ISD::CondCode CC = cast<CondCodeSDNode>(N->getOperand(2))->get();
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);

if ((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && LHS.getOpcode() == ISD::SUB)		if ((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && LHS.getOpcode() == ISD::SUB)
		spatelUnsubmitted Not Done Reply Inline Actions I don't understand the need for this check, so at the least it needs a code comment to explain why it is here. Related: if we're matching the pattern starting from a zext, doesn't that miss the icmp/icmp/or patterns that you were originally hoping to optimize in D22038? spatel: I don't understand the need for this check, so at the least it needs a code comment to explain…
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Sounds good, I will think of a better comment. I added this check to be conservative for now as I noticed several worst code gen occurences (around 50% of matches in openssl). I hope to address this and the icmp/icmp/OR case in another patch. pgousseau: Sounds good, I will think of a better comment. I added this check to be conservative for now as…
if (isNullConstant(LHS.getOperand(0)) && LHS.hasOneUse()) {		if (isNullConstant(LHS.getOperand(0)) && LHS.hasOneUse()) {
SDValue addV = DAG.getNode(ISD::ADD, DL, LHS.getValueType(), RHS,		SDValue addV = DAG.getNode(ISD::ADD, DL, LHS.getValueType(), RHS,
LHS.getOperand(1));		LHS.getOperand(1));
return DAG.getSetCC(DL, N->getValueType(0), addV,		return DAG.getSetCC(DL, N->getValueType(0), addV,
DAG.getConstant(0, DL, addV.getValueType()), CC);		DAG.getConstant(0, DL, addV.getValueType()), CC);
}		}
if ((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && RHS.getOpcode() == ISD::SUB)		if ((CC == ISD::SETNE \|\| CC == ISD::SETEQ) && RHS.getOpcode() == ISD::SUB)
if (isNullConstant(RHS.getOperand(0)) && RHS.hasOneUse()) {		if (isNullConstant(RHS.getOperand(0)) && RHS.hasOneUse()) {
▲ Show 20 Lines • Show All 668 Lines • ▼ Show 20 Lines

/// Return true if the target has native support for the specified value type		/// Return true if the target has native support for the specified value type
/// and it is 'desirable' to use the type for the given node type. e.g. On x86		/// and it is 'desirable' to use the type for the given node type. e.g. On x86
/// i16 is legal, but undesirable since i16 instruction encodings are longer and		/// i16 is legal, but undesirable since i16 instruction encodings are longer and
/// some i16 instructions are slow.		/// some i16 instructions are slow.
bool X86TargetLowering::isTypeDesirableForOp(unsigned Opc, EVT VT) const {		bool X86TargetLowering::isTypeDesirableForOp(unsigned Opc, EVT VT) const {
if (!isTypeLegal(VT))		if (!isTypeLegal(VT))
return false;		return false;
		if(Opc == ISD::SRL && VT != MVT::i32 && VT != MVT::i8)
		return false;
		RKSimonUnsubmitted Not Done Reply Inline Actions Comment this. Should this be part of a separate patch with its own tests? Its fine if its only exposed by this patch to leave it here, but it should have a comment either way. RKSimon: Comment this. Should this be part of a separate patch with its own tests? Its fine if its only…
		pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Yes would make sense to have a separate patch, will try to find an associated test case. pgousseau: Yes would make sense to have a separate patch, will try to find an associated test case.
if (VT != MVT::i16)		if (VT != MVT::i16)
return true;		return true;

switch (Opc) {		switch (Opc) {
default:		default:
return true;		return true;
case ISD::LOAD:		case ISD::LOAD:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
▲ Show 20 Lines • Show All 840 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.td

	Show First 20 Lines • Show All 828 Lines • ▼ Show 20 Lines
	def HasFMA4 : Predicate<"Subtarget->hasFMA4()">;			def HasFMA4 : Predicate<"Subtarget->hasFMA4()">;
	def HasXOP : Predicate<"Subtarget->hasXOP()">;			def HasXOP : Predicate<"Subtarget->hasXOP()">;
	def HasTBM : Predicate<"Subtarget->hasTBM()">;			def HasTBM : Predicate<"Subtarget->hasTBM()">;
	def HasMOVBE : Predicate<"Subtarget->hasMOVBE()">;			def HasMOVBE : Predicate<"Subtarget->hasMOVBE()">;
	def HasRDRAND : Predicate<"Subtarget->hasRDRAND()">;			def HasRDRAND : Predicate<"Subtarget->hasRDRAND()">;
	def HasF16C : Predicate<"Subtarget->hasF16C()">;			def HasF16C : Predicate<"Subtarget->hasF16C()">;
	def HasFSGSBase : Predicate<"Subtarget->hasFSGSBase()">;			def HasFSGSBase : Predicate<"Subtarget->hasFSGSBase()">;
	def HasLZCNT : Predicate<"Subtarget->hasLZCNT()">;			def HasLZCNT : Predicate<"Subtarget->hasLZCNT()">;
	def HasBMI : Predicate<"Subtarget->hasBMI()">;			def HasBMI : Predicate<"Subtarget->hasBMI()">;
				RKSimonUnsubmitted Not Done Reply Inline Actions Move this down to the other fast/slow definitions. RKSimon: Move this down to the other fast/slow definitions.
	def HasBMI2 : Predicate<"Subtarget->hasBMI2()">;			def HasBMI2 : Predicate<"Subtarget->hasBMI2()">;
	def HasVBMI : Predicate<"Subtarget->hasVBMI()">,			def HasVBMI : Predicate<"Subtarget->hasVBMI()">,
	AssemblerPredicate<"FeatureVBMI", "AVX-512 VBMI ISA">;			AssemblerPredicate<"FeatureVBMI", "AVX-512 VBMI ISA">;
	def HasIFMA : Predicate<"Subtarget->hasIFMA()">,			def HasIFMA : Predicate<"Subtarget->hasIFMA()">,
	AssemblerPredicate<"FeatureIFMA", "AVX-512 IFMA ISA">;			AssemblerPredicate<"FeatureIFMA", "AVX-512 IFMA ISA">;
	def HasRTM : Predicate<"Subtarget->hasRTM()">;			def HasRTM : Predicate<"Subtarget->hasRTM()">;
	def HasHLE : Predicate<"Subtarget->hasHLE()">;			def HasHLE : Predicate<"Subtarget->hasHLE()">;
	def HasTSX : Predicate<"Subtarget->hasRTM() \|\| Subtarget->hasHLE()">;			def HasTSX : Predicate<"Subtarget->hasRTM() \|\| Subtarget->hasHLE()">;
	Show All 38 Lines
	def OptForSize : Predicate<"OptForSize">;			def OptForSize : Predicate<"OptForSize">;
	def OptForMinSize : Predicate<"OptForMinSize">;			def OptForMinSize : Predicate<"OptForMinSize">;
	def OptForSpeed : Predicate<"!OptForSize">;			def OptForSpeed : Predicate<"!OptForSize">;
	def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;			def FastBTMem : Predicate<"!Subtarget->isBTMemSlow()">;
	def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;			def CallImmAddr : Predicate<"Subtarget->isLegalToCallImmediateAddr()">;
	def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;			def FavorMemIndirectCall : Predicate<"!Subtarget->callRegIndirect()">;
	def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;			def NotSlowIncDec : Predicate<"!Subtarget->slowIncDec()">;
	def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;			def HasFastMem32 : Predicate<"!Subtarget->isUnalignedMem32Slow()">;
				def HasFastLZCNT : Predicate<"Subtarget->hasFastLZCNT()">;
	def HasMFence : Predicate<"Subtarget->hasMFence()">;			def HasMFence : Predicate<"Subtarget->hasMFence()">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// X86 Instruction Format Definitions.			// X86 Instruction Format Definitions.
	//			//

	include "X86InstrFormats.td"			include "X86InstrFormats.td"

	▲ Show 20 Lines • Show All 2,208 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	protected:
/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 16-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

		/// True if LZCNT instruction is fast.
		bool HasFastLZCNT;

/// True if the short functions should be padded to prevent		/// True if the short functions should be padded to prevent
/// a stall when returning too early.		/// a stall when returning too early.
bool PadShortFunctions;		bool PadShortFunctions;

/// True if the Calls with memory reference should be converted		/// True if the Calls with memory reference should be converted
/// to a register-based indirect call.		/// to a register-based indirect call.
bool CallRegIndirect;		bool CallRegIndirect;

▲ Show 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	public:
bool hasAnyFMA() const { return hasFMA() \|\| hasFMA4() \|\| hasAVX512(); }		bool hasAnyFMA() const { return hasFMA() \|\| hasFMA4() \|\| hasAVX512(); }
bool hasXOP() const { return HasXOP; }		bool hasXOP() const { return HasXOP; }
bool hasTBM() const { return HasTBM; }		bool hasTBM() const { return HasTBM; }
bool hasMOVBE() const { return HasMOVBE; }		bool hasMOVBE() const { return HasMOVBE; }
bool hasRDRAND() const { return HasRDRAND; }		bool hasRDRAND() const { return HasRDRAND; }
bool hasF16C() const { return HasF16C; }		bool hasF16C() const { return HasF16C; }
bool hasFSGSBase() const { return HasFSGSBase; }		bool hasFSGSBase() const { return HasFSGSBase; }
bool hasLZCNT() const { return HasLZCNT; }		bool hasLZCNT() const { return HasLZCNT; }
bool hasBMI() const { return HasBMI; }		bool hasBMI() const { return HasBMI; }
		RKSimonUnsubmitted Not Done Reply Inline Actions Move this down to the other fast/slow functions. RKSimon: Move this down to the other fast/slow functions.
bool hasBMI2() const { return HasBMI2; }		bool hasBMI2() const { return HasBMI2; }
bool hasVBMI() const { return HasVBMI; }		bool hasVBMI() const { return HasVBMI; }
bool hasIFMA() const { return HasIFMA; }		bool hasIFMA() const { return HasIFMA; }
bool hasRTM() const { return HasRTM; }		bool hasRTM() const { return HasRTM; }
bool hasHLE() const { return HasHLE; }		bool hasHLE() const { return HasHLE; }
bool hasADX() const { return HasADX; }		bool hasADX() const { return HasADX; }
bool hasSHA() const { return HasSHA; }		bool hasSHA() const { return HasSHA; }
bool hasPRFCHW() const { return HasPRFCHW; }		bool hasPRFCHW() const { return HasPRFCHW; }
bool hasRDSEED() const { return HasRDSEED; }		bool hasRDSEED() const { return HasRDSEED; }
bool hasLAHFSAHF() const { return HasLAHFSAHF; }		bool hasLAHFSAHF() const { return HasLAHFSAHF; }
bool hasMWAITX() const { return HasMWAITX; }		bool hasMWAITX() const { return HasMWAITX; }
bool isBTMemSlow() const { return IsBTMemSlow; }		bool isBTMemSlow() const { return IsBTMemSlow; }
bool isSHLDSlow() const { return IsSHLDSlow; }		bool isSHLDSlow() const { return IsSHLDSlow; }
bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }		bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }
bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }		bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }		bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }
bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }		bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }
bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }		bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }
		bool hasFastLZCNT() const { return HasFastLZCNT; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasFMA4 = false;		HasFMA4 = false;
HasXOP = false;		HasXOP = false;
HasTBM = false;		HasTBM = false;
HasMOVBE = false;		HasMOVBE = false;
HasRDRAND = false;		HasRDRAND = false;
HasF16C = false;		HasF16C = false;
HasFSGSBase = false;		HasFSGSBase = false;
HasLZCNT = false;		HasLZCNT = false;
HasBMI = false;		HasBMI = false;
		RKSimonUnsubmitted Not Done Reply Inline Actions Move this down to the other fast/slow variables. RKSimon: Move this down to the other fast/slow variables.
HasBMI2 = false;		HasBMI2 = false;
HasVBMI = false;		HasVBMI = false;
HasIFMA = false;		HasIFMA = false;
HasRTM = false;		HasRTM = false;
HasHLE = false;		HasHLE = false;
HasERI = false;		HasERI = false;
HasCDI = false;		HasCDI = false;
HasPFI = false;		HasPFI = false;
Show All 13 Lines	void X86Subtarget::initializeEnvironment() {
IsUAMem16Slow = false;		IsUAMem16Slow = false;
IsUAMem32Slow = false;		IsUAMem32Slow = false;
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasFastPartialYMMWrite = false;		HasFastPartialYMMWrite = false;
HasFastScalarFSQRT = false;		HasFastScalarFSQRT = false;
HasFastVectorFSQRT = false;		HasFastVectorFSQRT = false;
		HasFastLZCNT = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
stackAlignment = 4;		stackAlignment = 4;
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

test/CodeGen/X86/lzcnt-zext-cmp.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; Test patterns which generates lzcnt instructions.
				; Eg: zext(or(setcc(cmp), setcc(cmp))) -> shr(or(lzcnt, lzcnt))
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=btver2 \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-pc-linux -mcpu=btver2 -mattr=-fast-lzcnt \| FileCheck --check-prefix=NOFASTLZCNT %s

				spatelUnsubmitted Not Done Reply Inline Actions Instead of specifying a different CPU, this RUN would be better if it also used btver2, but explicitly disabled the 'fast-lzcnt' attribute. That way we verify the codegen with and without the attribute while simultaneously verifying that btver2 has this attribute by default. spatel: Instead of specifying a different CPU, this RUN would be better if it also used btver2, but…
				; Test two 32-bit inputs, output is 32-bit.
				RKSimonUnsubmitted Not Done Reply Inline Actions Test single input version to make sure it isn't being used. RKSimon: Test single input version to make sure it isn't being used.
				pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Makes sense. pgousseau: Makes sense.
				define i32 @bar1(i32 %a, i32 %b) {
				spatelUnsubmitted Not Done Reply Inline Actions Please give each test a meaningful name and/or add a comment to explain exactly what each test is checking. spatel: Please give each test a meaningful name and/or add a comment to explain exactly what each test…
				; CHECK-LABEL: bar1:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntl %edi, %ecx
				; CHECK-NEXT: lzcntl %esi, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar1:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i32
				ret i32 %lor.ext
				}

				; Test two 64-bit inputs, output is 64-bit.
				define i64 @bar2(i64 %a, i64 %b) {
				; CHECK-LABEL: bar2:
				; CHECK: # BB#0:
				; CHECK-NEXT: lzcntq %rdi, %rcx
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar2:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i64 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i64
				ret i64 %lor.ext
				}

				; Test two 16-bit inputs, output is 16-bit.
				; The transform is disabled for the 16-bit case, as we still have to clear the
				; upper 16-bits, adding one more instruction.
				define i16 @bar3(i16 %a, i16 %b) {
				; CHECK-LABEL: bar3:
				; CHECK: # BB#0:
				; CHECK-NEXT: testw %di, %di
				; CHECK-NEXT: sete %al
				; CHECK-NEXT: testw %si, %si
				; CHECK-NEXT: sete %cl
				; CHECK-NEXT: orb %al, %cl
				; CHECK-NEXT: movzbl %cl, %eax
				; CHECK-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar3:
				; NOFASTLZCNT: # BB#0:
				; NOFASTLZCNT-NEXT: testw %di, %di
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testw %si, %si
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
				; NOFASTLZCNT-NEXT: retq
				%cmp = icmp eq i16 %a, 0
				%cmp1 = icmp eq i16 %b, 0
				%or = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %or to i16
				ret i16 %lor.ext
				}

				; Test two 32-bit inputs, output is 64-bit.
				define i64 @bar4(i32 %a, i32 %b) {
				; CHECK-LABEL: bar4:
				spatelUnsubmitted Not Done Reply Inline Actions lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here? spatel: lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here?
				pgousseauAuthorUnsubmitted Not Done Reply Inline Actions I disabled the 16-bit case because this seems to lead to bigger code in general, for this test case we would get: lzcntw %di, %ax andl $16, %eax shrl $4, %eax retq instead of: xorl %eax, %eax testw %di, %di sete %al retq which seems bigger code for the same result. pgousseau: I disabled the 16-bit case because this seems to lead to bigger code in general, for this test…
				; CHECK: # BB#0: # %entry
				spatelUnsubmitted Not Done Reply Inline Actions Oh...yuck. So we really should've done 'xor %eax, %eax' ahead of the lzcnt in that case? Please add a note about why we don't do 16-bit in the code comments and also here in the test case. spatel: Oh...yuck. So we really should've done 'xor %eax, %eax' ahead of the lzcnt in that case? Please…
				; CHECK-NEXT: lzcntl %edi, %ecx
				pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Sounds good, will do thanks. pgousseau: Sounds good, will do thanks.
				; CHECK-NEXT: lzcntl %esi, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar4:
				spatelUnsubmitted Not Done Reply Inline Actions There should be at least one test of the 'HasInterestingUses' logic (if that logic really belongs in this patch). spatel: There should be at least one test of the 'HasInterestingUses' logic (if that logic really…
				pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Yes I meant to add those, will do thanks. pgousseau: Yes I meant to add those, will do thanks.
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%0 = or i1 %cmp, %cmp1
				%conv = zext i1 %0 to i64
				ret i64 %conv
				}

				; Test two 64-bit inputs, output is 32-bit.
				define i32 @bar5(i64 %a, i64 %b) {
				; CHECK-LABEL: bar5:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntq %rdi, %rcx
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar5:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testq %rdi, %rdi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i64 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%0 = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %0 to i32
				ret i32 %lor.ext
				}

				; Test three 32-bit inputs, output is 32-bit.
				define i32 @bar6(i32 %a, i32 %b, i32 %c) {
				; CHECK-LABEL: bar6:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %ecx
				; CHECK-NEXT: orl %eax, %ecx
				; CHECK-NEXT: lzcntl %edx, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar6:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %cl, %al
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp2 = icmp eq i32 %c, 0
				%.cmp2 = or i1 %or.cond, %cmp2
				%lor.ext = zext i1 %.cmp2 to i32
				ret i32 %lor.ext
				}

				; Test three 32-bit inputs, output is 32-bit, but compared to bar6 test,
				; %.cmp2 inputs' order is inverted.
				define i32 @bar7(i32 %a, i32 %b, i32 %c) {
				; CHECK-LABEL: bar7:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %ecx
				; CHECK-NEXT: orl %eax, %ecx
				; CHECK-NEXT: lzcntl %edx, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar7:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %cl, %al
				spatelUnsubmitted Not Done Reply Inline Actions Update name: "bar6" spatel: Update name: "bar6"
				pgousseauAuthorUnsubmitted Not Done Reply Inline Actions Thanks for spotting this, will fix before commit. pgousseau: Thanks for spotting this, will fix before commit.
				; NOFASTLZCNT-NEXT: movzbl %al, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp2 = icmp eq i32 %c, 0
				%.cmp2 = or i1 %cmp2, %or.cond
				%lor.ext = zext i1 %.cmp2 to i32
				ret i32 %lor.ext
				}

				; Test four 32-bit inputs, output is 32-bit.
				define i32 @bar8(i32 %a, i32 %b, i32 %c, i32 %d) {
				; CHECK-LABEL: bar8:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntl %edi, %eax
				; CHECK-NEXT: lzcntl %esi, %esi
				; CHECK-NEXT: lzcntl %edx, %edx
				; CHECK-NEXT: orl %eax, %esi
				; CHECK-NEXT: lzcntl %ecx, %eax
				; CHECK-NEXT: orl %edx, %eax
				; CHECK-NEXT: orl %esi, %eax
				; CHECK-NEXT: shrl $5, %eax
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar8:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %dil
				; NOFASTLZCNT-NEXT: testl %esi, %esi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: orb %dil, %al
				; NOFASTLZCNT-NEXT: testl %edx, %edx
				; NOFASTLZCNT-NEXT: sete %dl
				; NOFASTLZCNT-NEXT: testl %ecx, %ecx
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %dl, %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i32 %b, 0
				%or.cond = or i1 %cmp, %cmp1
				%cmp3 = icmp eq i32 %c, 0
				%or.cond5 = or i1 %or.cond, %cmp3
				%cmp4 = icmp eq i32 %d, 0
				%.cmp4 = or i1 %or.cond5, %cmp4
				%lor.ext = zext i1 %.cmp4 to i32
				ret i32 %lor.ext
				}

				; Test one 32-bit input, one 64-bit input, output is 32-bit.
				define i32 @bar9(i32 %a, i64 %b) {
				; CHECK-LABEL: bar9:
				; CHECK: # BB#0: # %entry
				; CHECK-NEXT: lzcntq %rsi, %rax
				; CHECK-NEXT: lzcntl %edi, %ecx
				; CHECK-NEXT: shrl $5, %ecx
				; CHECK-NEXT: shrl $6, %eax
				; CHECK-NEXT: orl %ecx, %eax
				; CHECK-NEXT: # kill: %EAX<def> %EAX<kill> %RAX<kill>
				; CHECK-NEXT: retq
				;
				; NOFASTLZCNT-LABEL: bar9:
				; NOFASTLZCNT: # BB#0: # %entry
				; NOFASTLZCNT-NEXT: testl %edi, %edi
				; NOFASTLZCNT-NEXT: sete %al
				; NOFASTLZCNT-NEXT: testq %rsi, %rsi
				; NOFASTLZCNT-NEXT: sete %cl
				; NOFASTLZCNT-NEXT: orb %al, %cl
				; NOFASTLZCNT-NEXT: movzbl %cl, %eax
				; NOFASTLZCNT-NEXT: retq
				entry:
				%cmp = icmp eq i32 %a, 0
				%cmp1 = icmp eq i64 %b, 0
				%0 = or i1 %cmp, %cmp1
				%lor.ext = zext i1 %0 to i32
				ret i32 %lor.ext
				}