This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
ClosedPublic

Authored by pgousseau on Aug 12 2016, 5:27 AM.

Details

Summary

Following the discussion on D22038, this change enables the setcc to srl(ctlz) transformation on the btver2 architecture.
This optimisation is beneficial on Jaguar architecture only, where the lzcnt has a good reciprocal throughput.
Other architectures such as Intel's Haswell/Broadwell or AMD's Bulldozer/PileDriver do not benefit from it.
For this reason the change also add a "HasFastLZCNT" feature which gets enabled for Jaguar.

This patch requires D23445

Diff Detail

Repository
rL LLVM

Event Timeline

pgousseau updated this revision to Diff 67824.Aug 12 2016, 5:27 AM
pgousseau retitled this revision from to [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures..
pgousseau updated this object.
pgousseau added a subscriber: llvm-commits.
pgousseau updated this revision to Diff 67858.Aug 12 2016, 10:46 AM

Updating patch to reflect changes in D23445

spatel added inline comments.Aug 15 2016, 9:15 AM
lib/Target/X86/X86.td
182–184 ↗(On Diff #67858)

I would move this down with the other 'fake' features (ie, the other fast/slow attributes). Someday, we may come up with a better way to distinguish performance "features" from architectural ones.

It would also be good to explain exactly what we mean by "fast" in this context.

Finally, use a hyphen to make this more readable: "fast-lzcnt".

lib/Target/X86/X86ISelLowering.cpp
30780–30796 ↗(On Diff #67858)

I don't understand the need for this check, so at the least it needs a code comment to explain why it is here.
Related: if we're matching the pattern starting from a zext, doesn't that miss the icmp/icmp/or patterns that you were originally hoping to optimize in D22038?

test/CodeGen/X86/lzcnt-zext-cmp.ll
5 ↗(On Diff #67858)

Instead of specifying a different CPU, this RUN would be better if it also used btver2, but explicitly disabled the 'fast-lzcnt' attribute. That way we verify the codegen with and without the attribute while simultaneously verifying that btver2 has this attribute by default.

7 ↗(On Diff #67858)

Please give each test a meaningful name and/or add a comment to explain exactly what each test is checking.

81–100 ↗(On Diff #67858)

There should be at least one test of the 'HasInterestingUses' logic (if that logic really belongs in this patch).

pgousseau added inline comments.Aug 15 2016, 11:04 AM
lib/Target/X86/X86.td
182–184 ↗(On Diff #67858)

Sounds good will do.

lib/Target/X86/X86ISelLowering.cpp
30780–30796 ↗(On Diff #67858)

Sounds good, I will think of a better comment.
I added this check to be conservative for now as I noticed several worst code gen occurences (around 50% of matches in openssl).
I hope to address this and the icmp/icmp/OR case in another patch.

test/CodeGen/X86/lzcnt-zext-cmp.ll
81–100 ↗(On Diff #67858)

Yes I meant to add those, will do thanks.

pgousseau updated this revision to Diff 68209.Aug 16 2016, 10:21 AM

Rebased changes and following Sanjay's comments:

  • Move down feature declaration, add hyphen, add comment.
  • Remove hasInterestingUses check.
  • Use fast-lzcnt in test and add comments.

Looking again the codegen of openssl without the "hasInterestingUses" constraint the codegen does not seem worse in terms of speed, only the size is not as good as it could be but I think it is ok? Something must have gone wrong during my initial testing I suppose ...

RKSimon edited edge metadata.Aug 17 2016, 3:35 AM

Minor tweak request.

lib/Target/X86/X86InstrInfo.td
837 ↗(On Diff #68209)

Move this down to the other fast/slow definitions.

lib/Target/X86/X86Subtarget.cpp
257 ↗(On Diff #68209)

Move this down to the other fast/slow variables.

lib/Target/X86/X86Subtarget.h
428 ↗(On Diff #68209)

Move this down to the other fast/slow functions.

pgousseau updated this revision to Diff 68336.Aug 17 2016, 4:56 AM
pgousseau edited edge metadata.

Moving down fast-lzcnt feature changes following Simon's comments.

spatel edited edge metadata.Aug 17 2016, 7:44 AM

Looking again the codegen of openssl without the "hasInterestingUses" constraint the codegen does not seem worse in terms of speed, only the size is not as good as it could be but I think it is ok? Something must have gone wrong during my initial testing I suppose ...

I wasn't expecting a size difference. Can you provide more details about the size and perf changes that you see with this change? We may want to gate the transform based on 'optForSize'.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3566–3567 ↗(On Diff #68336)

Remove "llvm::"

lib/Target/X86/X86.td
265–266 ↗(On Diff #68336)

The description still seems too vague. The key point is that lzcnt must have the same latency and throughput as test/set, right?

How about: "If lzcnt has equivalent latency/throughput to most simple integer ops, it can be used to replace test/set sequences."

test/CodeGen/X86/lzcnt-zext-cmp.ll
85–92 ↗(On Diff #68336)

lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here?

Yes sounds like I should disable this if optForSize is enabled.

For example in openssl, out of 89 srl(lzcnt) optimisations, around 30 cases cause larger code, for example I see 25 of those:

3055:	31 ed                	xor    %ebp,%ebp
3057:	ff c2                	inc    %edx
3059:	40 0f 94 c5          	sete   %bpl
305d:	01 e9                	add    %ebp,%ecx

-> 10 bytes

309d:	ff c2                	inc    %edx
309f:	f3 0f bd ea          	lzcnt  %edx,%ebp
30a3:	c1 ed 05             	shr    $0x5,%ebp
30a6:	01 e9                	add    %ebp,%ecx

-> 12 bytes

But this should not affect performances.

Luckily openssl's libcrypto total size remains smaller as the other remaining matches result in fewer and as fast instructions.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3566–3567 ↗(On Diff #68336)

Sounds good, will do.

lib/Target/X86/X86.td
265–266 ↗(On Diff #68336)

Sounds good thanks, will do.

test/CodeGen/X86/lzcnt-zext-cmp.ll
85–92 ↗(On Diff #68336)

I disabled the 16-bit case because this seems to lead to bigger code in general, for this test case we would get:

lzcntw  %di, %ax
andl    $16, %eax
shrl    $4, %eax
retq

instead of:

xorl    %eax, %eax
testw   %di, %di
sete    %al
retq

which seems bigger code for the same result.

pgousseau updated this revision to Diff 68506.Aug 18 2016, 3:42 AM
pgousseau edited edge metadata.

Following Sanjay's comments:

  • Remove llvm namespace
  • Rewrite feature's comment
  • Disable transform if optForSize is true

Please can you add some tests with optsize enabled?

  • Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

test/CodeGen/X86/lzcnt-zext-cmp.ll
86–93 ↗(On Diff #68506)

Oh...yuck. So we really should've done 'xor %eax, %eax' ahead of the lzcnt in that case? Please add a note about why we don't do 16-bit in the code comments and also here in the test case.

pgousseau updated this revision to Diff 68577.Aug 18 2016, 11:18 AM

Add tests for optForSize following Simon's comments.

  • Disable transform if optForSize is true

optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".

Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.

Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Let me know if more detailed are required.

test/CodeGen/X86/lzcnt-zext-cmp.ll
87–94 ↗(On Diff #68577)

Sounds good, will do thanks.

pgousseau updated this revision to Diff 68669.Aug 19 2016, 3:51 AM

Add comments for the 16-bit case following Sanjay's comment.

With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

f3 45 0f bd f6       	lzcnt  %r14d,%r14d
41 c1 ee 05          	shr    $0x5,%r14d

31 c0                	xor    %eax,%eax
45 85 f6             	test   %r14d,%r14d
0f 94 c0             	sete   %al
41 89 c6             	mov    %eax,%r14d

This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

Sounds good, will remove the optForSize.

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark                  Time           CPU Iterations
--------------------------------------------------------
BM_f1_old/8              784 ns        784 ns     893325
BM_f1_old/512          49911 ns      49911 ns      14025
BM_f1_old/8k          798898 ns     798898 ns        876
BM_f1_new/8              585 ns        585 ns    1196970
BM_f1_new/512          37170 ns      37170 ns      18830
BM_f1_new/8k          595136 ns     595135 ns       1175
BM_f2_old/8/8          13573 ns      13574 ns      51548
BM_f2_old/512/512   55446038 ns   55446001 ns         13
BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
BM_f2_new/8/8           9126 ns       9127 ns      76692
BM_f2_new/512/512   37212798 ns   37212874 ns         19
BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

Yes this benchmark seems to show those kind of improvements, for example BM_f1_new/8 is ~25% faster than BM_f1_old/8.
Although now I reran the SPEC h264ref benchmark and unfortunately, I am seeing some small (less than 1%) but consistent performance degradation with h264ref that I wasn't seeing with the initial tablegen patch.
Why the micro-benchmark does not show this I am not sure, one hypothesis is that the micro-benchmark is only comparing the 32-bit "register, register" variant of the change, so it might be that I need to restrict the transformation to this pattern.
Or the micro-benchmark is not representative of the performance and in that case this change is probably not worth pursuing. Will come back with the findings.

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

Any update on the performance investigations?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3588 ↗(On Diff #68669)

These can be replaced with DAG.getZExtOrTrunc(Scc, dl, ExtTy);

Any update on the performance investigations?

Hi Simon/Sanjay,

Sorry for the delayed follow-up!
I have ran more tests and it seems the regressions in performances I was observing with SPEC's h264 are within the noise now so I cant tell if this patch is improving or degrading perfomances in SPEC's h264 benchmark.
I am more confident the OR case brings better performances because we will be replacing

48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret

by this:

F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret

My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

Thanks!

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3588 ↗(On Diff #68669)

Will do thanks!

I am more confident the OR case brings better performances because we will be replacing

48 85 FF                test     rdi,rdi
0F 94 C0                sete     al
48 85 F6                test     rsi,rsi
0F 94 C1                sete     cl
08 C1                   or       cl,al
0F B6 C1                movzx    eax,cl
C3                      ret

by this:

F3 0F BD CE             lzcnt    ecx,esi
F3 0F BD C7             lzcnt    eax,edi
09 C8                   or       eax,ecx
C1 E8 05                shr      eax,5
C3                      ret

My plan now is to make the patch to handle the OR case only, what do you guys think?
Would X86ISelLowering still be the best place if only supporting the OR case?

The OR case certainly looks better in isolation (less instructions, less code size). If you are measuring perf improvements from that alone, I think we can be more confident that the transform to lzcnt is the source of that improvement. It's still not clear to me how the micro-benchmark was improved so much for the simpler case.

To match the OR pattern, I think you would either add some code to visitOr() or add tablegen patterns if it is possible to match the DAG nodes that way.

pgousseau updated this revision to Diff 73497.Oct 4 2016, 9:44 AM

Hi Simon/Sanjay,

Following up with with the previous comments, this patch contains:

  • Use of DAG.getZextOrTrunc as per Simon's comment.
  • Removed support for the simple case.
  • Added handling of OR based patterns.
  • Added a case for SRL nodes in 'isTypeDesirableForOp()' as to favor 32 bits encoding when targetting X86.
  • Added support for multiple OR patterns eg: (a1||a2||a3||a4||a5)

Let me know what you think,

Thanks!

Some quick remarks - I'll do more of review later after others have had a chance to look at it.

lib/Target/X86/X86ISelLowering.cpp
29016 ↗(On Diff #73497)

Should we return early if !Subtarget.isCtlzFast()?

29022 ↗(On Diff #73497)

Are we letting vector types through here?

31797 ↗(On Diff #73497)

Comment this. Should this be part of a separate patch with its own tests? Its fine if its only exposed by this patch to leave it here, but it should have a comment either way.

test/CodeGen/X86/lzcnt-zext-cmp.ll
7 ↗(On Diff #73497)

Test single input version to make sure it isn't being used.

pgousseau added inline comments.Oct 5 2016, 10:05 AM
lib/Target/X86/X86ISelLowering.cpp
29016 ↗(On Diff #73497)

Yes that makes sense.

29022 ↗(On Diff #73497)

Yes it seems that way, will have to restrict to 32-bit and 64-bit integers, thanks for spotting this.

31797 ↗(On Diff #73497)

Yes would make sense to have a separate patch, will try to find an associated test case.

test/CodeGen/X86/lzcnt-zext-cmp.ll
7 ↗(On Diff #73497)

Makes sense.

pgousseau updated this revision to Diff 74136.Oct 10 2016, 9:25 AM

Following Simon's comments:

  • Add an early return in case isCtlzFast is false
  • Add a test checking that no transformations occurs on single input cases.
  • Prevent transformation on 128-bit cases: int foo(int128_t a, int128_t b) { return a || b;} as this pessimized the codegen.
    • This requires constraining the pattern to the X86 form of an equality comparison with 0.
    • This remove the need from the earlier patch to modify isTypeDesirableForOp().

Possibly better test names than barXXXX, other than that I'm happy with this - @spatel ?

spatel added inline comments.Oct 13 2016, 9:01 AM
lib/Target/X86/X86ISelLowering.cpp
29037–29042 ↗(On Diff #74136)

The pattern must always begin with zext. Is there some reason not to start in combineZext() rather than combineOr()?

pgousseau updated this revision to Diff 74661.Oct 14 2016, 5:09 AM

Following Simon and Sanjay comments:

  • Renamed tests to 'test_zext_cmpXX'
  • Start pattern matching from combineZext instead of combineOR.

Also added missing "hasOneUse" checks to OR nodes.
Removed one unneeded check for "isSetCCCandidate"

Let me know what you guys think,

Thanks,

Pierre

spatel accepted this revision.Oct 14 2016, 8:04 AM
spatel edited edge metadata.

LGTM. See inline for nits.

lib/Target/X86/X86ISelLowering.cpp
4191 ↗(On Diff #74661)

Could remove or assert hasLZCNT()?

29105–29107 ↗(On Diff #74661)

No need for braces here.

test/CodeGen/X86/lzcnt-zext-cmp.ll
201 ↗(On Diff #74661)

Update name: "bar6"

This revision is now accepted and ready to land.Oct 14 2016, 8:04 AM
pgousseau added inline comments.Oct 14 2016, 8:59 AM
lib/Target/X86/X86ISelLowering.cpp
4191 ↗(On Diff #74661)

Yes makes sense, will remove before commit.

29105–29107 ↗(On Diff #74661)

Will remove before commit.

test/CodeGen/X86/lzcnt-zext-cmp.ll
201 ↗(On Diff #74661)

Thanks for spotting this, will fix before commit.

This revision was automatically updated to reflect the committed changes.