This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
-
X86.td
-
X86CondBrFolding.cpp
-
X86Subtarget.h
1
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
O3-pipeline.ll
-
condbr_if.ll
-
condbr_switch.ll

Differential D46662

[X86] condition branches folding for three-way conditional codes
ClosedPublic

Authored by xur on May 9 2018, 2:20 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
andreadb
gbedwell

Commits

rG67b1b328f702: [X86] condition branches folding for three-way conditional codes
rL343993: [X86] condition branches folding for three-way conditional codes

Summary

This file defines a pass that optimizes condition branches on x86 by taking advantage of the three-way conditional code generated by compare instructions.

Currently, it tries to hoisting EQ and NE conditional branch to a dominant conditional branch condition where the same EQ/NE conditional code is computed. An example:

`bb_0:
  cmp %0, 19
  jg bb_1
  jmp bb_2
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_4
bb_4:
  cmp %0, 20
  je bb_5
  jmp bb_6

Here we could combine the two compares in bb_0 and bb_4 and have the
following code:

bb_0:
  cmp %0, 20
  jg bb_1
  jl bb_2
  jmp bb_5
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_6`

For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control height for bb_6 is also reduced. bb_4 is gone after the optimization.

This optimization is motivated by the branch pattern generated by the switch lowering: we always have pivot-1 compare for the inner nodes and we do a pivot compare again the leaf (like above pattern).

My test on Haswell shows that this optimization improves the tight nest loop that consisted of a evenly distributed switch cases by 8% to 10%.

Thanks,

-Rong

Diff Detail

Repository: rL LLVM

Event Timeline

xur created this revision.May 9 2018, 2:20 PM

Herald added a subscriber: mgorny. · View Herald TranscriptMay 9 2018, 2:20 PM

For a reference: there are total of 1301 instances this optimization for a bootstrap build of llvm:

grep "Found one" bootstrap_buildlog |sort |uniq -c

1094 Found one path (len=1):
  176 Found one path (len=2):
   30 Found one path (len=3):
    1 Found one path (len=4):

davidxl removed a reviewer: davidxl.May 29 2018, 1:56 PM

davidxl added a subscriber: davidxl.

This has been a while. Kindly ping.

xur added a reviewer: RKSimon.May 29 2018, 2:13 PM

xbolva00 added a subscriber: xbolva00.May 29 2018, 2:23 PM

Please can you give more details on what range of tests you performed and whether you tested on anything other than Haswell - bunching conditional branches like this can make things very difficult for some branch prediction units..

Hi Simon,
This patch was motivated by the code patterns generated in switch lowering. The tests I used were switch statements. I tested 4 cases and 15 cases, and all the cases are evenly distributed.

My initial test was on ixion-haswell.

Upton getting your review comment, I tested some other platforms:
iota-sandybridge:

both 4case and 15 case have no performance difference

iota-ivybridge

4case has 5% performance gain. 15 case has not performance difference.

ixion-broadwell:

4case has -1%-2% performance loss (might be noise). 15case has 8% performance gain

indus-skylake:

4case has 7% performance gain. 15case has 3% performance gain.

Would it be possible for you to put the tests somewhere so we can compare the effect on some AMD and older Intel machines?

I attached my test cases to this email.

They are slight different from the ones I used -- They were using out
internal test infrastructure and now I made them standalone.

I don't have access to the machines other than the ones I mentioned in
previous message. Simon: Could you test on other machines?

Let me know if you prefer I put the test a part of the patch.

msg-22811-131.txt162 BDownload
4evencases.cc1 KBDownload
15evencases.cc2 KBDownload

RKSimon added reviewers: andreadb, gbedwell.Jun 7 2018, 2:46 AM

Kindly ping.

xbolva00 added inline comments.Jun 23 2018, 5:09 AM

lib/Target/X86/X86CondBrFolding.cpp
201 ↗	(On Diff #146001)	!TBB etc..

Sync to the latest compiler and integrated xbolva00's comments.

Simpon: have you got chance to measure the performance on other platforms?

Thanks,

-Rong

xbolva00 added inline comments.Sep 25 2018, 1:57 AM

test/CodeGen/X86/condbr_switch.ll
17 ↗	(On Diff #166733)	Maybe remove "local_unnamed_addr", "dso_local", .. ?

Did you bootstrap clang/llvm with this patch? maybe also SPEC or similar benchmark?

In D46662#1244459, @xbolva00 wrote:

Did you bootstrap clang/llvm with this patch? maybe also SPEC or similar benchmark?

Yes. Bootstrapped clang/llvm
And tested SPEC2006 and Google internal benchmarks.

Nice! This looks good, ping @RKSimon for more suggestions, if any :)

craig.topper added inline comments.Sep 25 2018, 12:08 PM

lib/Target/X86/X86CondBrFolding.cpp
465 ↗	(On Diff #166733)	CmpVlaue->CmpValue
522 ↗	(On Diff #166733)	This might just be me, but I kind of feel like analyzeMBB should be part of the X86condBrFolding and the TargetMBBInfo struct should only be created if its needed with the information the analysis collected. Right now we speculatively create a struct we have to delete if we fail the checks.
568 ↗	(On Diff #166733)	Minor, add blank line between the functions here.
test/CodeGen/X86/switch-bt.ll
142 ↗	(On Diff #166733)	Is this comment about 29 stale even in the original code?

In D46662#1245501, @xbolva00 wrote:

Nice! This looks good, ping @RKSimon for more suggestions, if any :)

Hi Rong,

Sorry for the late reply.
I can help with testing your patch on AMD Jaguar. Tomorrow morning I will post my findings (I am not at work now).

-Andrea

craig.topper added inline comments.Sep 25 2018, 12:33 PM

lib/Target/X86/X86CondBrFolding.cpp
275 ↗	(On Diff #166733)	&* isn't needed
349 ↗	(On Diff #166733)	&* isn't needed
412 ↗	(On Diff #166733)	I don't think this "&*" is neededed. get() should return a pointer. you shouldn't need to dereference that and take the address of it.
414 ↗	(On Diff #166733)	This assert doesn't really accomplish anything. MBBInfo was already dereferenced on the line before so we crashed before we got to the assert.
419 ↗	(On Diff #166733)	&* isn't needed
430 ↗	(On Diff #166733)	&* isn't needed.
548 ↗	(On Diff #166733)	&* isn't needed

Thanks for Craig.'s review. Here is the new patch that integrated most of his comments.
I will post another version that moves analyzeMBB() to X86CondBrFolding, as suggested by Craig.

lib/Target/X86/X86CondBrFolding.cpp
414 ↗	(On Diff #166733)	moved before the deference.
522 ↗	(On Diff #166733)	You have a good point. Current way is easier for sharing the analysis result but the structure is not the best. I'll restructure this in a later patch.
test/CodeGen/X86/switch-bt.ll
142 ↗	(On Diff #166733)	You are right! I did not pay attention to this. I deleted the stale comments.

New patch that integrated Craig's review comments.

This patch integrated Craig's suggestion to move analyzeMBB() to X86CondBrFolding class.
Also reformatted the code using clang-format.

craig.topper added inline comments.Sep 25 2018, 5:25 PM

lib/Target/X86/X86CondBrFolding.cpp
141 ↗	(On Diff #167030)	Drop the &*
395 ↗	(On Diff #167030)	No need for a condition here. If you return a null std::unique_ptr it can still be moved into the map. Though since you've create an entry in the map for every basic block you might be able to use a vector indexed by MBB_>getNumber. You can get the total number of indices to intially size the vetor from MF->getNumBlockIDs(). Then you can use operator[] on the vector instead of find. This should work as long as no basic blocks are created by this pass.
404 ↗	(On Diff #167030)	No need for the * and & here.
499 ↗	(On Diff #167030)	Can you just return nullptr instead of default constructed std::unique_ptr?

Hi Rong,

On Jaguar, this pass may increase branch density to the point where it hurts the performance of the branch predictor.

Branch prediction in Jaguar is affected by branch density.
To give you a bit of background: Jaguar's BTB is logically partitioned in two levels. A first level, which is specialized in sparse branches; a second level which is specialized in dense branches, and it is dynamically allocated (when there are more than 2 branches per cache line).
L2 is a bit slower (dynamically allocated), and tends to have a lower throughput thant the L1. So, ideally, L1 should be used as much as possible.

This patch increases branch density to the point where the L2 BTB usage increases, and the efficiency of the branch predictor decreases.

Bench: 4evencases.cc

Without your patch (10 runs):

Each iteration uses 902058 nano seconds
Case counts: 0 261000000 250000000 246000000 243000000
Each iteration uses 887837 nano seconds
Case counts: 0 281000000 253000000 227000000 239000000
Each iteration uses 887856 nano seconds
Case counts: 0 256000000 254000000 236000000 254000000
Each iteration uses 880632 nano seconds
Case counts: 0 279000000 236000000 244000000 241000000
Each iteration uses 1.03057e+06 nano seconds
Case counts: 0 258000000 257000000 243000000 242000000
Each iteration uses 883759 nano seconds
Case counts: 0 248000000 262000000 278000000 212000000
Each iteration uses 910438 nano seconds
Case counts: 0 248000000 254000000 243000000 255000000
Each iteration uses 885671 nano seconds
Case counts: 0 258000000 266000000 231000000 245000000
Each iteration uses 912325 nano seconds
Case counts: 0 225000000 264000000 270000000 241000000
Each iteration uses 904952 nano seconds
Case counts: 0 261000000 240000000 241000000 258000000

With your patch (10 runs):

Each iteration uses 916110 nano seconds
Case counts: 0 223000000 266000000 263000000 248000000
Each iteration uses 918773 nano seconds
Case counts: 0 266000000 230000000 236000000 268000000
Each iteration uses 903100 nano seconds
Case counts: 0 250000000 249000000 231000000 270000000
Each iteration uses 923196 nano seconds
Case counts: 0 241000000 243000000 276000000 240000000
Each iteration uses 911282 nano seconds
Case counts: 0 241000000 239000000 266000000 254000000
Each iteration uses 910201 nano seconds
Case counts: 0 210000000 263000000 260000000 267000000
Each iteration uses 925672 nano seconds
Case counts: 0 245000000 265000000 236000000 254000000
Each iteration uses 932643 nano seconds
Case counts: 0 235000000 259000000 256000000 250000000
Each iteration uses 937735 nano seconds
Case counts: 0 261000000 242000000 259000000 238000000
Each iteration uses 954895 nano seconds
Case counts: 0 254000000 239000000 271000000 236000000

Overall, 4evencases.cc is ~2% slower with this patch.

Bench: 15evencases.cc

Without your patch (10 runs):

Each iteration uses 1.10148e+06 nano seconds
Case counts: 0 56000000 60000000 68000000 61000000 69000000 64000000 80000000 64000000 68000000 66000000 83000000 74000000 50000000 73000000 64000000
Each iteration uses 1.0648e+06 nano seconds
Case counts: 0 71000000 59000000 55000000 64000000 73000000 57000000 55000000 74000000 76000000 67000000 77000000 57000000 82000000 54000000 79000000
Each iteration uses 1.06872e+06 nano seconds
Case counts: 0 55000000 80000000 59000000 45000000 70000000 61000000 68000000 72000000 77000000 67000000 88000000 63000000 61000000 77000000 57000000
Each iteration uses 1.04146e+06 nano seconds
Case counts: 0 68000000 61000000 67000000 50000000 70000000 68000000 73000000 69000000 61000000 78000000 69000000 64000000 67000000 75000000 60000000
Each iteration uses 1.0549e+06 nano seconds
Case counts: 0 66000000 75000000 64000000 64000000 74000000 78000000 63000000 64000000 67000000 57000000 65000000 63000000 74000000 66000000 60000000
Each iteration uses 1.04246e+06 nano seconds
Case counts: 0 66000000 69000000 63000000 76000000 66000000 78000000 44000000 66000000 61000000 75000000 66000000 70000000 67000000 64000000 69000000
Each iteration uses 1.07907e+06 nano seconds
Case counts: 0 63000000 66000000 81000000 68000000 56000000 71000000 71000000 68000000 58000000 65000000 64000000 75000000 63000000 71000000 60000000
Each iteration uses 1.05432e+06 nano seconds
Case counts: 0 66000000 67000000 70000000 65000000 57000000 53000000 62000000 62000000 63000000 74000000 68000000 81000000 70000000 77000000 65000000
Each iteration uses 1.04041e+06 nano seconds
Case counts: 0 71000000 71000000 65000000 69000000 77000000 67000000 52000000 60000000 73000000 80000000 76000000 66000000 55000000 49000000 69000000
Each iteration uses 1.07782e+06 nano seconds
Case counts: 0 68000000 76000000 63000000 79000000 76000000 71000000 65000000 61000000 63000000 63000000 61000000 56000000 67000000 61000000 70000000

With your patch (10 runs):

Each iteration uses 1.11151e+06 nano seconds
Case counts: 0 64000000 64000000 73000000 72000000 69000000 75000000 66000000 70000000 77000000 59000000 50000000 74000000 68000000 58000000 61000000
Each iteration uses 1.28406e+06 nano seconds
Case counts: 0 68000000 63000000 66000000 69000000 68000000 58000000 71000000 60000000 80000000 66000000 80000000 69000000 57000000 62000000 63000000
Each iteration uses 1.18149e+06 nano seconds
Case counts: 0 67000000 68000000 66000000 69000000 71000000 67000000 64000000 69000000 72000000 61000000 73000000 60000000 66000000 71000000 56000000
Each iteration uses 1.30169e+06 nano seconds
Case counts: 0 74000000 66000000 69000000 64000000 70000000 64000000 59000000 61000000 53000000 75000000 74000000 58000000 72000000 68000000 73000000
Each iteration uses 1.15588e+06 nano seconds
Case counts: 0 62000000 66000000 67000000 62000000 79000000 65000000 59000000 54000000 65000000 61000000 62000000 82000000 74000000 68000000 74000000
Each iteration uses 1.16992e+06 nano seconds
Case counts: 0 69000000 64000000 71000000 60000000 60000000 70000000 64000000 77000000 65000000 75000000 61000000 70000000 61000000 77000000 56000000
Each iteration uses 1.2683e+06 nano seconds
Case counts: 0 66000000 69000000 73000000 76000000 72000000 59000000 64000000 61000000 53000000 78000000 66000000 63000000 66000000 57000000 77000000
Each iteration uses 1.17196e+06 nano seconds
Case counts: 0 67000000 69000000 84000000 52000000 56000000 70000000 58000000 64000000 71000000 72000000 67000000 68000000 68000000 73000000 61000000
Each iteration uses 1.28627e+06 nano seconds
Case counts: 0 70000000 70000000 70000000 57000000 73000000 71000000 70000000 57000000 57000000 67000000 69000000 61000000 60000000 76000000 72000000
Each iteration uses 1.28318e+06 nano seconds
Case counts: 0 61000000 72000000 70000000 80000000 68000000 59000000 59000000 65000000 49000000 78000000 65000000 64000000 64000000 77000000 69000000

Here the performance varies a lot depending on whether we are in the dense branch portion, or not. Note also that prediction through the L2 BTB has a lower throughput (as in branches per cycle).

Excluding outliers, the average performance degradation is ~8-10%.

While this analysis has been only conducted on Jaguar, I suspect that similar problems would affect AMD Bobcat too, since branch prediction for that core is similar to the one in Jaguar.

I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

However, at least for now, I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).
Definitely, it should be disabled for Jaguar (BtVer2) and Bobcat.

-Andrea

What is the definition of branch density?

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

Regards,

-Rong

In D46662#1246727, @davidxl wrote:

What is the definition of branch density?

In this context, it is the number of branches per cache line.
(See: "AMD software optimization guide for family 16h processors" - Section 2.7.1.5: "Branch Marker Caching" ).

For Jaguar, L2 BTB entries are consumed when there are more than two branches per cache line.
There is a nice description in Section 2.7.1.2 "Branch Target Buffer".

I hope it helps.
-Andrea

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

Integrated Craig's new review comments, in particular, using vector instead of densemap for MBBInfos.

I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).

@xur are you gonna work on this? As you already measured some numbers, you can enable it for haswell, skylake, etc..

test/CodeGen/X86/condbr_if.ll
35 ↗	(On Diff #166733)	We don't need "dso_local" here

Using new SubtargetFeature method (suggested by Andrea) to make this pass opt-in for subtargets.
Changed the tests accordingly.

Looks better and better :)

if you are interested in some more cmp/jmp opportunities, please also see:
https://bugs.llvm.org/show_bug.cgi?id=38002
https://bugs.llvm.org/show_bug.cgi?id=39116

In D46662#1248780, @xur wrote:

Using new SubtargetFeature method (suggested by Andrea) to make this pass opt-in for subtargets.
Changed the tests accordingly.

Thanks Rong.

I have some comments (See below).

Cheers
-Andrea

lib/Target/X86/X86CondBrFolding.cpp
90–101 ↗	(On Diff #167430)	Is field `MBB` ever used? If not, then you can remove it. It is quite a big auxiliary struct. I wonder if it could be smaller...
131–137 ↗	(On Diff #167430)	MBBInfos is initialized with one element (a descriptors) per each machine basic block. If we don't create new basic blocks, then this code could just be rewritten as: return MBBInfos[MBB->getNumber()].get(); About the assert. If I understand your algorithm corectly, getMBBInfo is never called on a null MBB. That being said, it is not wrong to have an assert that validates a precondition. If you want to keep it, then please add a meaningful message to it.
149 ↗	(On Diff #167430)	This assert is completely redundant. If MBBInfo was null, then you would have already had a segfault at line 148 ... (Also: please add string messages to asserts..)
398–400 ↗	(On Diff #167430)	Replace this loop with: MBBInfos.resize(MF.getNumBLockIDs()); I am not even sure that you actually need this loop. The loop at line 401 is essentially initializing MBBInfos. So, you could just merge these two loops, and have a single loop calls to `MBBInfos.emplace_back(..);`.
446–447 ↗	(On Diff #167430)	unsigned.
474–475 ↗	(On Diff #167430)	Can this ever happen with those opcodes? I think you can safely convert this check into an assert.
493 ↗	(On Diff #167430)	Is this variable really needed? It seems like you only use CC.

Thanks Andrea's more recently review and very helpful suggestions. Here is the updated patch.

lib/Target/X86/X86CondBrFolding.cpp
90–101 ↗	(On Diff #167430)	removed MBB field. the main reason for so many fields is to reuse the analysis information in optimization phrase. we can reduce the struct size but we need to recompute.
131–137 ↗	(On Diff #167430)	You are absolutely right on this. I simplified the code.
149 ↗	(On Diff #167430)	I meant to have the assert right after getMBBInfo() call, as getMBBInfo can return nullptr. Fixed
398–400 ↗	(On Diff #167430)	Indeed, resize() is better here. I don't think we can merge two loops, because the iteration order of the loop are different.
474–475 ↗	(On Diff #167430)	you are right! replaced with assert.
493 ↗	(On Diff #167430)	Good catch. This is a leftover of the refactoring.

Integrated Andrea's review comments

Thanks.
I don’t have other comments.

I didn’t properly review the part where you fix branch probabilities. So, it would be nice if somebody else reviews that part.

lib/Target/X86/X86CondBrFolding.cpp
473 ↗	(On Diff #167775)	“should” is repeated here.
474 ↗	(On Diff #167775)	s/brand/branch

This revision is now accepted and ready to land.Oct 1 2018, 12:22 PM

Fix typos that identified by Andrea.

Closed by commit rL343993: [X86] condition branches folding for three-way conditional codes (authored by xur). · Explain WhyOct 8 2018, 11:54 AM

This revision was automatically updated to reflect the committed changes.

In D46662#1246810, @andreadb wrote:

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

In D46662#1246550, @andreadb wrote:

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

In D46662#1293043, @lebedev.ri wrote:
In D46662#1246810, @andreadb wrote:

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:
19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.
In D46662#1246550, @andreadb wrote:

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them

Sorry Roman,
I completely missed that comment.

Those two benchmarks were attached by Xur to this code review.
You should be able to see the attachments if you expand the “Show Older Changes” section (there is a link at the top of this review).
One of his posts has got 3 attachments. Two of these files are the benchmarks to run.

I hope it helps.

-Andrea

Roman

In D46662#1293043, @lebedev.ri wrote:
In D46662#1246810, @andreadb wrote:

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:
19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.
In D46662#1246550, @andreadb wrote:

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2019, 2:43 PM

Herald added a subscriber: jdoerfert. · View Herald Transcript

4evencases.cc

There is a regression with -O3 -march=native on Haswell. It is slower than standard -O3. GCC 9 with -O3 -march=native produces faster code than standard -O3.

In D46662#1442345, @andreadb wrote:
In D46662#1293043, @lebedev.ri wrote:
In D46662#1246810, @andreadb wrote:

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:
19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.
In D46662#1246550, @andreadb wrote:

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them
Sorry Roman,
I completely missed that comment.

Those two benchmarks were attached by Xur to this code review.
You should be able to see the attachments if you expand the “Show Older Changes” section (there is a link at the top of this review).
One of his posts has got 3 attachments. Two of these files are the benchmarks to run.

Aha! Not sure how i did not find those. Thank you!

I hope it helps.

-Andrea

Roman
In D46662#1293043, @lebedev.ri wrote:
In D46662#1246810, @andreadb wrote:

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:
19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

Measurements (n=25) say that 15 cases improves (avg: -0.18%, median: -0.35%),
and 4 cases appears to improve (avg: -0.03%, median: +0.07%)
I will submit a patch.

In D46662#1246550, @andreadb wrote:

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

Uhm, is this missing some plumbing from clang side?
The pass isn't being run even with -march=native, unless enabled via -mllvm -x86-condbr-folding=true

lebedev.ri added inline comments.Mar 31 2019, 9:13 AM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
455–456	So wait, shouldn't this respect `FeatureMergeToThreeWayBranch`?

Right, PR39658 / D54593, that explains this.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

1 line

3 lines

7 lines

579 lines

4 lines

7 lines

test/

CodeGen/

X86/

O3-pipeline.ll

1 line

condbr_if.ll

178 lines

condbr_switch.ll

167 lines

Diff 168695

llvm/trunk/lib/Target/X86/CMakeLists.txt

	Show All 21 Lines

	set(sources			set(sources
	ShadowCallStack.cpp			ShadowCallStack.cpp
	X86AsmPrinter.cpp			X86AsmPrinter.cpp
	X86CallFrameOptimization.cpp			X86CallFrameOptimization.cpp
	X86CallingConv.cpp			X86CallingConv.cpp
	X86CallLowering.cpp			X86CallLowering.cpp
	X86CmovConversion.cpp			X86CmovConversion.cpp
				X86CondBrFolding.cpp
	X86DomainReassignment.cpp			X86DomainReassignment.cpp
	X86ExpandPseudo.cpp			X86ExpandPseudo.cpp
	X86FastISel.cpp			X86FastISel.cpp
	X86FixupBWInsts.cpp			X86FixupBWInsts.cpp
	X86FixupLEAs.cpp			X86FixupLEAs.cpp
	X86AvoidStoreForwardingBlocks.cpp			X86AvoidStoreForwardingBlocks.cpp
	X86FixupSetCC.cpp			X86FixupSetCC.cpp
	X86FlagsCopyLowering.cpp			X86FlagsCopyLowering.cpp
	Show All 40 Lines

llvm/trunk/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines

	/// Return a pass that removes redundant LEA instructions and redundant address			/// Return a pass that removes redundant LEA instructions and redundant address
	/// recalculations.			/// recalculations.
	FunctionPass *createX86OptimizeLEAs();			FunctionPass *createX86OptimizeLEAs();

	/// Return a pass that transforms setcc + movzx pairs into xor + setcc.			/// Return a pass that transforms setcc + movzx pairs into xor + setcc.
	FunctionPass *createX86FixupSetCC();			FunctionPass *createX86FixupSetCC();

				/// Return a pass that folds conditional branch jumps.
				FunctionPass *createX86CondBrFolding();

	/// Return a pass that avoids creating store forward block issues in the hardware.			/// Return a pass that avoids creating store forward block issues in the hardware.
	FunctionPass *createX86AvoidStoreForwardingBlocks();			FunctionPass *createX86AvoidStoreForwardingBlocks();

	/// Return a pass that lowers EFLAGS copy pseudo instructions.			/// Return a pass that lowers EFLAGS copy pseudo instructions.
	FunctionPass *createX86FlagsCopyLoweringPass();			FunctionPass *createX86FlagsCopyLoweringPass();

	/// Return a pass that expands WinAlloca pseudo-instructions.			/// Return a pass that expands WinAlloca pseudo-instructions.
	FunctionPass *createX86WinAllocaExpander();			FunctionPass *createX86WinAllocaExpander();
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	def FeatureMOVDIRI : SubtargetFeature<"movdiri", "HasMOVDIRI", "true",
"Support movdiri instruction">;		"Support movdiri instruction">;
def FeatureMOVDIR64B : SubtargetFeature<"movdir64b", "HasMOVDIR64B", "true",		def FeatureMOVDIR64B : SubtargetFeature<"movdir64b", "HasMOVDIR64B", "true",
"Support movdir64b instruction">;		"Support movdir64b instruction">;

def FeatureFastBEXTR : SubtargetFeature<"fast-bextr", "HasFastBEXTR", "true",		def FeatureFastBEXTR : SubtargetFeature<"fast-bextr", "HasFastBEXTR", "true",
"Indicates that the BEXTR instruction is implemented as a single uop "		"Indicates that the BEXTR instruction is implemented as a single uop "
"with good throughput.">;		"with good throughput.">;

		// Merge branches using three-way conditional code.
		def FeatureMergeToThreeWayBranch : SubtargetFeature<"merge-to-threeway-branch",
		"ThreewayBranchProfitable", "true",
		"Merge branches to a three-way "
		"conditional branch">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Register File Description		// Register File Description
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86RegisterInfo.td"		include "X86RegisterInfo.td"
include "X86RegisterBanks.td"		include "X86RegisterBanks.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 312 Lines • ▼ Show 20 Lines	def SNBFeatures : ProcessorFeatures<[], [
FeaturePCLMUL,		FeaturePCLMUL,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureLAHFSAHF,		FeatureLAHFSAHF,
FeatureSlow3OpsLEA,		FeatureSlow3OpsLEA,
FeatureFastScalarFSQRT,		FeatureFastScalarFSQRT,
FeatureFastSHLDRotate,		FeatureFastSHLDRotate,
FeatureSlowIncDec,		FeatureSlowIncDec,
		FeatureMergeToThreeWayBranch,
FeatureMacroFusion		FeatureMacroFusion
]>;		]>;

class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,		class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,
SNBFeatures.Value, [		SNBFeatures.Value, [
FeatureSlowUAMem32,		FeatureSlowUAMem32,
FeaturePOPCNTFalseDeps		FeaturePOPCNTFalseDeps
]>;		]>;
▲ Show 20 Lines • Show All 514 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86CondBrFolding.cpp

				//===---- X86CondBrFolding.cpp - optimize conditional branches ------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				// This file defines a pass that optimizes condition branches on x86 by taking
				// advantage of the three-way conditional code generated by compare
				// instructions.
				// Currently, it tries to hoisting EQ and NE conditional branch to a dominant
				// conditional branch condition where the same EQ/NE conditional code is
				// computed. An example:
				// bb_0:
				// cmp %0, 19
				// jg bb_1
				// jmp bb_2
				// bb_1:
				// cmp %0, 40
				// jg bb_3
				// jmp bb_4
				// bb_4:
				// cmp %0, 20
				// je bb_5
				// jmp bb_6
				// Here we could combine the two compares in bb_0 and bb_4 and have the
				// following code:
				// bb_0:
				// cmp %0, 20
				// jg bb_1
				// jl bb_2
				// jmp bb_5
				// bb_1:
				// cmp %0, 40
				// jg bb_3
				// jmp bb_6
				// For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control
				// height for bb_6 is also reduced. bb_4 is gone after the optimization.
				//
				// There are plenty of this code patterns, especially from the switch case
				// lowing where we generate compare of "pivot-1" for the inner nodes in the
				// binary search tree.
				//===----------------------------------------------------------------------===//

				#include "X86.h"
				#include "X86InstrInfo.h"
				#include "X86Subtarget.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/Support/BranchProbability.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-condbr-folding"

				STATISTIC(NumFixedCondBrs, "Number of x86 condbr folded");

				namespace {
				class X86CondBrFoldingPass : public MachineFunctionPass {
				public:
				X86CondBrFoldingPass() : MachineFunctionPass(ID) {}

				StringRef getPassName() const override { return "X86 CondBr Folding"; }

				bool runOnMachineFunction(MachineFunction &MF) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				MachineFunctionPass::getAnalysisUsage(AU);
				AU.addRequired<MachineBranchProbabilityInfo>();
				}

				private:
				static char ID;
				};

				char X86CondBrFoldingPass::ID = 0;
				} // namespace

				FunctionPass *llvm::createX86CondBrFolding() {
				return new X86CondBrFoldingPass();
				}

				// A class the stores the auxiliary information for each MBB.
				struct TargetMBBInfo {
				MachineBasicBlock *TBB;
				MachineBasicBlock *FBB;
				MachineInstr *BrInstr;
				MachineInstr *CmpInstr;
				X86::CondCode BranchCode;
				unsigned SrcReg;
				int CmpValue;
				bool Modified;
				bool CmpBrOnly;
				};

				// A class that optimizes the conditional branch by hoisting and merge CondCode.
				class X86CondBrFolding {
				public:
				X86CondBrFolding(const X86InstrInfo *TII,
				const MachineBranchProbabilityInfo *MBPI,
				MachineFunction &MF)
				: TII(TII), MBPI(MBPI), MF(MF) {}
				bool optimize();

				private:
				const X86InstrInfo *TII;
				const MachineBranchProbabilityInfo *MBPI;
				MachineFunction &MF;
				std::vector<std::unique_ptr<TargetMBBInfo>> MBBInfos;
				SmallVector<MachineBasicBlock *, 4> RemoveList;

				void optimizeCondBr(MachineBasicBlock &MBB,
				SmallVectorImpl<MachineBasicBlock *> &BranchPath);
				void fixBranchProb(MachineBasicBlock NextMBB, MachineBasicBlock RootMBB,
				SmallVectorImpl<MachineBasicBlock *> &BranchPath);
				void replaceBrDest(MachineBasicBlock MBB, MachineBasicBlock OrigDest,
				MachineBasicBlock *NewDest);
				void fixupModifiedCond(MachineBasicBlock *MBB);
				std::unique_ptr<TargetMBBInfo> analyzeMBB(MachineBasicBlock &MBB);
				static bool analyzeCompare(const MachineInstr &MI, unsigned &SrcReg,
				int &CmpValue);
				bool findPath(MachineBasicBlock *MBB,
				SmallVectorImpl<MachineBasicBlock *> &BranchPath);
				TargetMBBInfo getMBBInfo(MachineBasicBlock MBB) const {
				return MBBInfos[MBB->getNumber()].get();
				}
				};

				// Find a valid path that we can reuse the CondCode.
				// The resulted path (if return true) is stored in BranchPath.
				// Return value:
				// false: is no valid path is found.
				// true: a valid path is found and the targetBB can be reached.
				bool X86CondBrFolding::findPath(
				MachineBasicBlock MBB, SmallVectorImpl<MachineBasicBlock > &BranchPath) {
				TargetMBBInfo *MBBInfo = getMBBInfo(MBB);
				assert(MBBInfo && "Expecting a candidate MBB");
				int CmpValue = MBBInfo->CmpValue;

				MachineBasicBlock PredMBB = MBB->pred_begin();
				MachineBasicBlock *SaveMBB = MBB;
				while (PredMBB) {
				TargetMBBInfo *PredMBBInfo = getMBBInfo(PredMBB);
				if (!PredMBBInfo \|\| PredMBBInfo->SrcReg != MBBInfo->SrcReg)
				return false;

				assert(SaveMBB == PredMBBInfo->TBB \|\| SaveMBB == PredMBBInfo->FBB);
				bool IsFalseBranch = (SaveMBB == PredMBBInfo->FBB);

				X86::CondCode CC = PredMBBInfo->BranchCode;
				assert(CC == X86::COND_L \|\| CC == X86::COND_G \|\| CC == X86::COND_E);
				int PredCmpValue = PredMBBInfo->CmpValue;
				bool ValueCmpTrue = ((CmpValue < PredCmpValue && CC == X86::COND_L) \|\|
				(CmpValue > PredCmpValue && CC == X86::COND_G) \|\|
				(CmpValue == PredCmpValue && CC == X86::COND_E));
				// Check if both the result of value compare and the branch target match.
				if (!(ValueCmpTrue ^ IsFalseBranch)) {
				LLVM_DEBUG(dbgs() << "Dead BB detected!\n");
				return false;
				}

				BranchPath.push_back(PredMBB);
				// These are the conditions on which we could combine the compares.
				if ((CmpValue == PredCmpValue) \|\|
				(CmpValue == PredCmpValue - 1 && CC == X86::COND_L) \|\|
				(CmpValue == PredCmpValue + 1 && CC == X86::COND_G))
				return true;

				// If PredMBB has more than on preds, or not a pure cmp and br, we bailout.
				if (PredMBB->pred_size() != 1 \|\| !PredMBBInfo->CmpBrOnly)
				return false;

				SaveMBB = PredMBB;
				PredMBB = *PredMBB->pred_begin();
				}
				return false;
				}

				// Fix up any PHI node in the successor of MBB.
				static void fixPHIsInSucc(MachineBasicBlock MBB, MachineBasicBlock OldMBB,
				MachineBasicBlock *NewMBB) {
				if (NewMBB == OldMBB)
				return;
				for (auto MI = MBB->instr_begin(), ME = MBB->instr_end();
				MI != ME && MI->isPHI(); ++MI)
				for (unsigned i = 2, e = MI->getNumOperands() + 1; i != e; i += 2) {
				MachineOperand &MO = MI->getOperand(i);
				if (MO.getMBB() == OldMBB)
				MO.setMBB(NewMBB);
				}
				}

				// Utility function to set branch probability for edge MBB->SuccMBB.
				static inline bool setBranchProb(MachineBasicBlock *MBB,
				MachineBasicBlock *SuccMBB,
				BranchProbability Prob) {
				auto MBBI = std::find(MBB->succ_begin(), MBB->succ_end(), SuccMBB);
				if (MBBI == MBB->succ_end())
				return false;
				MBB->setSuccProbability(MBBI, Prob);
				return true;
				}

				// Utility function to find the unconditional br instruction in MBB.
				static inline MachineBasicBlock::iterator
				findUncondBrI(MachineBasicBlock *MBB) {
				return std::find_if(MBB->begin(), MBB->end(), [](MachineInstr &MI) -> bool {
				return MI.getOpcode() == X86::JMP_1;
				});
				}

				// Replace MBB's original successor, OrigDest, with NewDest.
				// Also update the MBBInfo for MBB.
				void X86CondBrFolding::replaceBrDest(MachineBasicBlock *MBB,
				MachineBasicBlock *OrigDest,
				MachineBasicBlock *NewDest) {
				TargetMBBInfo *MBBInfo = getMBBInfo(MBB);
				MachineInstr *BrMI;
				if (MBBInfo->TBB == OrigDest) {
				BrMI = MBBInfo->BrInstr;
				unsigned JNCC = GetCondBranchFromCond(MBBInfo->BranchCode);
				MachineInstrBuilder MIB =
				BuildMI(*MBB, BrMI, MBB->findDebugLoc(BrMI), TII->get(JNCC))
				.addMBB(NewDest);
				MBBInfo->TBB = NewDest;
				MBBInfo->BrInstr = MIB.getInstr();
				} else { // Should be the unconditional jump stmt.
				MachineBasicBlock::iterator UncondBrI = findUncondBrI(MBB);
				BuildMI(*MBB, UncondBrI, MBB->findDebugLoc(UncondBrI), TII->get(X86::JMP_1))
				.addMBB(NewDest);
				MBBInfo->FBB = NewDest;
				BrMI = &*UncondBrI;
				}
				fixPHIsInSucc(NewDest, OrigDest, MBB);
				BrMI->eraseFromParent();
				MBB->addSuccessor(NewDest);
				setBranchProb(MBB, NewDest, MBPI->getEdgeProbability(MBB, OrigDest));
				MBB->removeSuccessor(OrigDest);
				}

				// Change the CondCode and BrInstr according to MBBInfo.
				void X86CondBrFolding::fixupModifiedCond(MachineBasicBlock *MBB) {
				TargetMBBInfo *MBBInfo = getMBBInfo(MBB);
				if (!MBBInfo->Modified)
				return;

				MachineInstr *BrMI = MBBInfo->BrInstr;
				X86::CondCode CC = MBBInfo->BranchCode;
				MachineInstrBuilder MIB = BuildMI(*MBB, BrMI, MBB->findDebugLoc(BrMI),
				TII->get(GetCondBranchFromCond(CC)))
				.addMBB(MBBInfo->TBB);
				BrMI->eraseFromParent();
				MBBInfo->BrInstr = MIB.getInstr();

				MachineBasicBlock::iterator UncondBrI = findUncondBrI(MBB);
				BuildMI(*MBB, UncondBrI, MBB->findDebugLoc(UncondBrI), TII->get(X86::JMP_1))
				.addMBB(MBBInfo->FBB);
				MBB->erase(UncondBrI);
				MBBInfo->Modified = false;
				}

				//
				// Apply the transformation:
				// RootMBB -1-> ... PredMBB -3-> MBB -5-> TargetMBB
				// \-2-> \-4-> \-6-> FalseMBB
				// ==>
				// RootMBB -1-> ... PredMBB -7-> FalseMBB
				// TargetMBB <-8-/ \-2-> \-4->
				//
				// Note that PredMBB and RootMBB could be the same.
				// And in the case of dead TargetMBB, we will not have TargetMBB and edge 8.
				//
				// There are some special handling where the RootMBB is COND_E in which case
				// we directly short-cycle the brinstr.
				//
				void X86CondBrFolding::optimizeCondBr(
				MachineBasicBlock &MBB, SmallVectorImpl<MachineBasicBlock *> &BranchPath) {

				X86::CondCode CC;
				TargetMBBInfo *MBBInfo = getMBBInfo(&MBB);
				assert(MBBInfo && "Expecting a candidate MBB");
				MachineBasicBlock *TargetMBB = MBBInfo->TBB;
				BranchProbability TargetProb = MBPI->getEdgeProbability(&MBB, MBBInfo->TBB);

				// Forward the jump from MBB's predecessor to MBB's false target.
				MachineBasicBlock *PredMBB = BranchPath.front();
				TargetMBBInfo *PredMBBInfo = getMBBInfo(PredMBB);
				assert(PredMBBInfo && "Expecting a candidate MBB");
				if (PredMBBInfo->Modified)
				fixupModifiedCond(PredMBB);
				CC = PredMBBInfo->BranchCode;
				// Don't do this if depth of BranchPath is 1 and PredMBB is of COND_E.
				// We will short-cycle directly for this case.
				if (!(CC == X86::COND_E && BranchPath.size() == 1))
				replaceBrDest(PredMBB, &MBB, MBBInfo->FBB);

				MachineBasicBlock *RootMBB = BranchPath.back();
				TargetMBBInfo *RootMBBInfo = getMBBInfo(RootMBB);
				assert(RootMBBInfo && "Expecting a candidate MBB");
				if (RootMBBInfo->Modified)
				fixupModifiedCond(RootMBB);
				CC = RootMBBInfo->BranchCode;

				if (CC != X86::COND_E) {
				MachineBasicBlock::iterator UncondBrI = findUncondBrI(RootMBB);
				// RootMBB: Cond jump to the original not-taken MBB.
				X86::CondCode NewCC;
				switch (CC) {
				case X86::COND_L:
				NewCC = X86::COND_G;
				break;
				case X86::COND_G:
				NewCC = X86::COND_L;
				break;
				default:
				llvm_unreachable("unexpected condtional code.");
				}
				BuildMI(*RootMBB, UncondBrI, RootMBB->findDebugLoc(UncondBrI),
				TII->get(GetCondBranchFromCond(NewCC)))
				.addMBB(RootMBBInfo->FBB);

				// RootMBB: Jump to TargetMBB
				BuildMI(*RootMBB, UncondBrI, RootMBB->findDebugLoc(UncondBrI),
				TII->get(X86::JMP_1))
				.addMBB(TargetMBB);
				RootMBB->addSuccessor(TargetMBB);
				fixPHIsInSucc(TargetMBB, &MBB, RootMBB);
				RootMBB->erase(UncondBrI);
				} else {
				replaceBrDest(RootMBB, RootMBBInfo->TBB, TargetMBB);
				}

				// Fix RootMBB's CmpValue to MBB's CmpValue to TargetMBB. Don't set Imm
				// directly. Move MBB's stmt to here as the opcode might be different.
				if (RootMBBInfo->CmpValue != MBBInfo->CmpValue) {
				MachineInstr *NewCmp = MBBInfo->CmpInstr;
				NewCmp->removeFromParent();
				RootMBB->insert(RootMBBInfo->CmpInstr, NewCmp);
				RootMBBInfo->CmpInstr->eraseFromParent();
				}

				// Invalidate MBBInfo just in case.
				MBBInfos[MBB.getNumber()] = nullptr;
				MBBInfos[RootMBB->getNumber()] = nullptr;

				// Fix branch Probabilities.
				auto fixBranchProb = [&](MachineBasicBlock *NextMBB) {
				BranchProbability Prob;
				for (auto &I : BranchPath) {
				MachineBasicBlock *ThisMBB = I;
				if (!ThisMBB->hasSuccessorProbabilities() \|\|
				!ThisMBB->isSuccessor(NextMBB))
				break;
				Prob = MBPI->getEdgeProbability(ThisMBB, NextMBB);
				if (Prob.isUnknown())
				break;
				TargetProb = Prob * TargetProb;
				Prob = Prob - TargetProb;
				setBranchProb(ThisMBB, NextMBB, Prob);
				if (ThisMBB == RootMBB) {
				setBranchProb(ThisMBB, TargetMBB, TargetProb);
				}
				ThisMBB->normalizeSuccProbs();
				if (ThisMBB == RootMBB)
				break;
				NextMBB = ThisMBB;
				}
				return true;
				};
				if (CC != X86::COND_E && !TargetProb.isUnknown())
				fixBranchProb(MBBInfo->FBB);

				if (CC != X86::COND_E)
				RemoveList.push_back(&MBB);

				LLVM_DEBUG(dbgs() << "After optimization:\nRootMBB is: " << *RootMBB << "\n");
				if (BranchPath.size() > 1)
				LLVM_DEBUG(dbgs() << "PredMBB is: " << *(BranchPath[0]) << "\n");
				}

				// Driver function for optimization: find the valid candidate and apply
				// the transformation.
				bool X86CondBrFolding::optimize() {
				bool Changed = false;
				LLVM_DEBUG(dbgs() << "***** X86CondBr Folding on Function: " << MF.getName()
				<< " *****\n");
				// Setup data structures.
				MBBInfos.resize(MF.getNumBlockIDs());
				for (auto &MBB : MF)
				MBBInfos[MBB.getNumber()] = analyzeMBB(MBB);

				for (auto &MBB : MF) {
				TargetMBBInfo *MBBInfo = getMBBInfo(&MBB);
				if (!MBBInfo \|\| !MBBInfo->CmpBrOnly)
				continue;
				if (MBB.pred_size() != 1)
				continue;
				LLVM_DEBUG(dbgs() << "Work on MBB." << MBB.getNumber()
				<< " CmpValue: " << MBBInfo->CmpValue << "\n");
				SmallVector<MachineBasicBlock *, 4> BranchPath;
				if (!findPath(&MBB, BranchPath))
				continue;

				#ifndef NDEBUG
				LLVM_DEBUG(dbgs() << "Found one path (len=" << BranchPath.size() << "):\n");
				int Index = 1;
				LLVM_DEBUG(dbgs() << "Target MBB is: " << MBB << "\n");
				for (auto I = BranchPath.rbegin(); I != BranchPath.rend(); ++I, ++Index) {
				MachineBasicBlock PMBB = I;
				TargetMBBInfo *PMBBInfo = getMBBInfo(PMBB);
				LLVM_DEBUG(dbgs() << "Path MBB (" << Index << " of " << BranchPath.size()
				<< ") is " << *PMBB);
				LLVM_DEBUG(dbgs() << "CC=" << PMBBInfo->BranchCode
				<< " Val=" << PMBBInfo->CmpValue
				<< " CmpBrOnly=" << PMBBInfo->CmpBrOnly << "\n\n");
				}
				#endif
				optimizeCondBr(MBB, BranchPath);
				Changed = true;
				}
				NumFixedCondBrs += RemoveList.size();
				for (auto MBBI : RemoveList) {
				for (auto *Succ : MBBI->successors())
				MBBI->removeSuccessor(Succ);
				MBBI->eraseFromParent();
				}

				return Changed;
				}

				// Analyze instructions that generate CondCode and extract information.
				bool X86CondBrFolding::analyzeCompare(const MachineInstr &MI, unsigned &SrcReg,
				int &CmpValue) {
				unsigned SrcRegIndex = 0;
				unsigned ValueIndex = 0;
				switch (MI.getOpcode()) {
				// TODO: handle test instructions.
				default:
				return false;
				case X86::CMP64ri32:
				case X86::CMP64ri8:
				case X86::CMP32ri:
				case X86::CMP32ri8:
				case X86::CMP16ri:
				case X86::CMP16ri8:
				case X86::CMP8ri:
				SrcRegIndex = 0;
				ValueIndex = 1;
				break;
				case X86::SUB64ri32:
				case X86::SUB64ri8:
				case X86::SUB32ri:
				case X86::SUB32ri8:
				case X86::SUB16ri:
				case X86::SUB16ri8:
				case X86::SUB8ri:
				SrcRegIndex = 1;
				ValueIndex = 2;
				break;
				}
				SrcReg = MI.getOperand(SrcRegIndex).getReg();
				assert(MI.getOperand(ValueIndex).isImm() && "Expecting Imm operand");
				CmpValue = MI.getOperand(ValueIndex).getImm();
				return true;
				}

				// Analyze a candidate MBB and set the extract all the information needed.
				// The valid candidate will have two successors.
				// It also should have a sequence of
				// Branch_instr,
				// CondBr,
				// UnCondBr.
				// Return TargetMBBInfo if MBB is a valid candidate and nullptr otherwise.
				std::unique_ptr<TargetMBBInfo>
				X86CondBrFolding::analyzeMBB(MachineBasicBlock &MBB) {
				MachineBasicBlock *TBB;
				MachineBasicBlock *FBB;
				MachineInstr *BrInstr;
				MachineInstr *CmpInstr;
				X86::CondCode CC;
				unsigned SrcReg;
				int CmpValue;
				bool Modified;
				bool CmpBrOnly;

				if (MBB.succ_size() != 2)
				return nullptr;

				CmpBrOnly = true;
				FBB = TBB = nullptr;
				CmpInstr = nullptr;
				MachineBasicBlock::iterator I = MBB.end();
				while (I != MBB.begin()) {
				--I;
				if (I->isDebugValue())
				continue;
				if (I->getOpcode() == X86::JMP_1) {
				if (FBB)
				return nullptr;
				FBB = I->getOperand(0).getMBB();
				continue;
				}
				if (I->isBranch()) {
				if (TBB)
				return nullptr;
				CC = X86::getCondFromBranchOpc(I->getOpcode());
				switch (CC) {
				default:
				return nullptr;
				case X86::COND_E:
				case X86::COND_L:
				case X86::COND_G:
				case X86::COND_NE:
				case X86::COND_LE:
				case X86::COND_GE:
				break;
				}
				TBB = I->getOperand(0).getMBB();
				BrInstr = &*I;
				continue;
				}
				if (analyzeCompare(*I, SrcReg, CmpValue)) {
				if (CmpInstr)
				return nullptr;
				CmpInstr = &*I;
				continue;
				}
				CmpBrOnly = false;
				break;
				}

				if (!TBB \|\| !FBB \|\| !CmpInstr)
				return nullptr;

				// Simplify CondCode. Note this is only to simplify the findPath logic
				// and will not change the instruction here.
				switch (CC) {
				case X86::COND_NE:
				CC = X86::COND_E;
				std::swap(TBB, FBB);
				Modified = true;
				break;
				case X86::COND_LE:
				if (CmpValue == INT_MAX)
				return nullptr;
				CC = X86::COND_L;
				CmpValue += 1;
				Modified = true;
				break;
				case X86::COND_GE:
				if (CmpValue == INT_MIN)
				return nullptr;
				CC = X86::COND_G;
				CmpValue -= 1;
				Modified = true;
				break;
				default:
				Modified = false;
				break;
				}
				return llvm::make_unique<TargetMBBInfo>(TargetMBBInfo{
				TBB, FBB, BrInstr, CmpInstr, CC, SrcReg, CmpValue, Modified, CmpBrOnly});
				}

				bool X86CondBrFoldingPass::runOnMachineFunction(MachineFunction &MF) {
				const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();
				if (!ST.threewayBranchProfitable())
				return false;
				const X86InstrInfo *TII = ST.getInstrInfo();
				const MachineBranchProbabilityInfo *MBPI =
				&getAnalysis<MachineBranchProbabilityInfo>();

				X86CondBrFolding CondBr(TII, MBPI, MF);
				return CondBr.optimize();
				}

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 413 Lines • ▼ Show 20 Lines	protected:
/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.		/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.
///		///
// FIXME: this is a known good value for Yonah. How about others?		// FIXME: this is a known good value for Yonah. How about others?
unsigned MaxInlineSizeThreshold = 128;		unsigned MaxInlineSizeThreshold = 128;

/// Indicates target prefers 256 bit instructions.		/// Indicates target prefers 256 bit instructions.
bool Prefer256Bit = false;		bool Prefer256Bit = false;

		/// Threeway branch is profitable in this subtarget.
		bool ThreewayBranchProfitable = false;

/// What processor and OS we're targeting.		/// What processor and OS we're targeting.
Triple TargetTriple;		Triple TargetTriple;

/// GlobalISel related APIs.		/// GlobalISel related APIs.
std::unique_ptr<CallLowering> CallLoweringInfo;		std::unique_ptr<CallLowering> CallLoweringInfo;
std::unique_ptr<LegalizerInfo> Legalizer;		std::unique_ptr<LegalizerInfo> Legalizer;
std::unique_ptr<RegisterBankInfo> RegBankInfo;		std::unique_ptr<RegisterBankInfo> RegBankInfo;
std::unique_ptr<InstructionSelector> InstSelector;		std::unique_ptr<InstructionSelector> InstSelector;
▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	public:
bool hasSHSTK() const { return HasSHSTK; }		bool hasSHSTK() const { return HasSHSTK; }
bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }		bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }
bool hasCLWB() const { return HasCLWB; }		bool hasCLWB() const { return HasCLWB; }
bool hasWBNOINVD() const { return HasWBNOINVD; }		bool hasWBNOINVD() const { return HasWBNOINVD; }
bool hasRDPID() const { return HasRDPID; }		bool hasRDPID() const { return HasRDPID; }
bool hasWAITPKG() const { return HasWAITPKG; }		bool hasWAITPKG() const { return HasWAITPKG; }
bool hasPCONFIG() const { return HasPCONFIG; }		bool hasPCONFIG() const { return HasPCONFIG; }
bool hasSGX() const { return HasSGX; }		bool hasSGX() const { return HasSGX; }
		bool threewayBranchProfitable() const { return ThreewayBranchProfitable; }
bool hasINVPCID() const { return HasINVPCID; }		bool hasINVPCID() const { return HasINVPCID; }
bool useRetpolineIndirectCalls() const { return UseRetpolineIndirectCalls; }		bool useRetpolineIndirectCalls() const { return UseRetpolineIndirectCalls; }
bool useRetpolineIndirectBranches() const {		bool useRetpolineIndirectBranches() const {
return UseRetpolineIndirectBranches;		return UseRetpolineIndirectBranches;
}		}
bool useRetpolineExternalThunk() const { return UseRetpolineExternalThunk; }		bool useRetpolineExternalThunk() const { return UseRetpolineExternalThunk; }

unsigned getPreferVectorWidth() const { return PreferVectorWidth; }		unsigned getPreferVectorWidth() const { return PreferVectorWidth; }
▲ Show 20 Lines • Show All 173 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp

	Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	#include <string>			#include <string>

	using namespace llvm;			using namespace llvm;

	static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",			static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",
	cl::desc("Enable the machine combiner pass"),			cl::desc("Enable the machine combiner pass"),
	cl::init(true), cl::Hidden);			cl::init(true), cl::Hidden);

				static cl::opt<bool> EnableCondBrFoldingPass("x86-condbr-folding",
				cl::desc("Enable the conditional branch "
				"folding pass"),
				cl::init(true), cl::Hidden);

	namespace llvm {			namespace llvm {

	void initializeWinEHStatePassPass(PassRegistry &);			void initializeWinEHStatePassPass(PassRegistry &);
	void initializeFixupLEAPassPass(PassRegistry &);			void initializeFixupLEAPassPass(PassRegistry &);
	void initializeShadowCallStackPass(PassRegistry &);			void initializeShadowCallStackPass(PassRegistry &);
	void initializeX86CallFrameOptimizationPass(PassRegistry &);			void initializeX86CallFrameOptimizationPass(PassRegistry &);
	void initializeX86CmovConverterPassPass(PassRegistry &);			void initializeX86CmovConverterPassPass(PassRegistry &);
	void initializeX86ExecutionDomainFixPass(PassRegistry &);			void initializeX86ExecutionDomainFixPass(PassRegistry &);
	▲ Show 20 Lines • Show All 377 Lines • ▼ Show 20 Lines
	}			}

	bool X86PassConfig::addGlobalInstructionSelect() {			bool X86PassConfig::addGlobalInstructionSelect() {
	addPass(new InstructionSelect());			addPass(new InstructionSelect());
	return false;			return false;
	}			}

	bool X86PassConfig::addILPOpts() {			bool X86PassConfig::addILPOpts() {
				if (EnableCondBrFoldingPass)
				addPass(createX86CondBrFolding());
				lebedev.riUnsubmitted Not Done Reply Inline Actions So wait, shouldn't this respect `FeatureMergeToThreeWayBranch`? lebedev.ri: So wait, shouldn't this respect `FeatureMergeToThreeWayBranch`?
	addPass(&EarlyIfConverterID);			addPass(&EarlyIfConverterID);
	if (EnableMachineCombinerPass)			if (EnableMachineCombinerPass)
	addPass(&MachineCombinerID);			addPass(&MachineCombinerID);
	addPass(createX86CmovConverterPass());			addPass(createX86CmovConverterPass());
	return true;			return true;
	}			}

	bool X86PassConfig::addPreISel() {			bool X86PassConfig::addPreISel() {
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/O3-pipeline.ll

	Show First 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Expand ISel Pseudo-instructions			; CHECK-NEXT: Expand ISel Pseudo-instructions
	; CHECK-NEXT: X86 Domain Reassignment Pass			; CHECK-NEXT: X86 Domain Reassignment Pass
	; CHECK-NEXT: Early Tail Duplication			; CHECK-NEXT: Early Tail Duplication
	; CHECK-NEXT: Optimize machine instruction PHIs			; CHECK-NEXT: Optimize machine instruction PHIs
	; CHECK-NEXT: Slot index numbering			; CHECK-NEXT: Slot index numbering
	; CHECK-NEXT: Merge disjoint stack slots			; CHECK-NEXT: Merge disjoint stack slots
	; CHECK-NEXT: Local Stack Slot Allocation			; CHECK-NEXT: Local Stack Slot Allocation
	; CHECK-NEXT: Remove dead machine instructions			; CHECK-NEXT: Remove dead machine instructions
				; CHECK-NEXT: X86 CondBr Folding
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Natural Loop Construction			; CHECK-NEXT: Machine Natural Loop Construction
	; CHECK-NEXT: Machine Trace Metrics			; CHECK-NEXT: Machine Trace Metrics
	; CHECK-NEXT: Early If-Conversion			; CHECK-NEXT: Early If-Conversion
	; CHECK-NEXT: Machine InstCombiner			; CHECK-NEXT: Machine InstCombiner
	; CHECK-NEXT: X86 cmov Conversion			; CHECK-NEXT: X86 cmov Conversion
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Natural Loop Construction			; CHECK-NEXT: Machine Natural Loop Construction
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/condbr_if.ll

				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=sandybridge %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=ivybridge %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=haswell %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=broadwell %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=skylake %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=skx %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=NOTMERGE

				define i32 @length2_1(i32) {
				%2 = icmp slt i32 %0, 3
				br i1 %2, label %3, label %5

				; <label>:3:
				%4 = tail call i32 (...) @f1()
				br label %13

				; <label>:5:
				%6 = icmp slt i32 %0, 40
				br i1 %6, label %7, label %13

				; <label>:7:
				%8 = icmp eq i32 %0, 3
				br i1 %8, label %9, label %11

				; <label>:9:
				%10 = tail call i32 (...) @f2()
				br label %11

				; <label>:11:
				%12 = tail call i32 (...) @f3() #2
				br label %13

				; <label>:13:
				ret i32 0
				}
				; MERGE-LABEL: length2_1
				; MERGE: cmpl $3
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; NOTMERGE-LABEL: length2_1
				; NOTMERGE: cmpl $2
				; NOTMERGE-NEXT: jg

				define i32 @length2_2(i32) {
				%2 = icmp sle i32 %0, 2
				br i1 %2, label %3, label %5

				; <label>:3:
				%4 = tail call i32 (...) @f1()
				br label %13

				; <label>:5:
				%6 = icmp slt i32 %0, 40
				br i1 %6, label %7, label %13

				; <label>:7:
				%8 = icmp eq i32 %0, 3
				br i1 %8, label %9, label %11

				; <label>:9:
				%10 = tail call i32 (...) @f2()
				br label %11

				; <label>:11:
				%12 = tail call i32 (...) @f3() #2
				br label %13

				; <label>:13:
				ret i32 0
				}
				; MERGE-LABEL: length2_2
				; MERGE: cmpl $3
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; NOTMERGE-LABEL: length2_2
				; NOTMERGE: cmpl $2
				; NOTMERGE-NEXT: jg

				define i32 @length2_3(i32) {
				%2 = icmp sgt i32 %0, 3
				br i1 %2, label %3, label %5

				; <label>:3:
				%4 = tail call i32 (...) @f1()
				br label %13

				; <label>:5:
				%6 = icmp sgt i32 %0, -40
				br i1 %6, label %7, label %13

				; <label>:7:
				%8 = icmp eq i32 %0, 3
				br i1 %8, label %9, label %11

				; <label>:9:
				%10 = tail call i32 (...) @f2()
				br label %11

				; <label>:11:
				%12 = tail call i32 (...) @f3() #2
				br label %13

				; <label>:13:
				ret i32 0
				}
				; MERGE-LABEL: length2_3
				; MERGE: cmpl $3
				; MERGE-NEXT: jl
				; MERGE-NEXT: jle
				; NOTMERGE-LABEL: length2_3
				; NOTMERGE: cmpl $4
				; NOTMERGE-NEXT: jl

				define i32 @length2_4(i32) {
				%2 = icmp sge i32 %0, 4
				br i1 %2, label %3, label %5

				; <label>:3:
				%4 = tail call i32 (...) @f1()
				br label %13

				; <label>:5:
				%6 = icmp sgt i32 %0, -40
				br i1 %6, label %7, label %13

				; <label>:7:
				%8 = icmp eq i32 %0, 3
				br i1 %8, label %9, label %11

				; <label>:9:
				%10 = tail call i32 (...) @f2()
				br label %11

				; <label>:11:
				%12 = tail call i32 (...) @f3() #2
				br label %13

				; <label>:13:
				ret i32 0
				}
				; MERGE-LABEL: length2_4
				; MERGE: cmpl $3
				; MERGE-NEXT: jl
				; MERGE-NEXT: jle
				; NOTMERGE-LABEL: length2_4
				; NOTMERGE: cmpl $4
				; NOTMERGE-NEXT: jl

				declare i32 @f1(...)
				declare i32 @f2(...)
				declare i32 @f3(...)

				define i32 @length1_1(i32) {
				%2 = icmp sgt i32 %0, 5
				br i1 %2, label %3, label %5

				; <label>:3:
				%4 = tail call i32 (...) @f1()
				br label %9

				; <label>:5:
				%6 = icmp eq i32 %0, 5
				br i1 %6, label %7, label %9

				; <label>:7:
				%8 = tail call i32 (...) @f2()
				br label %9

				; <label>:9:
				ret i32 0
				}
				; MERGE-LABEL: length1_1
				; MERGE: cmpl $5
				; MERGE-NEXT: jl
				; MERGE-NEXT: jle
				; NOTMERGE-LABEL: length1_1
				; NOTMERGE: cmpl $6
				; NOTMERGE-NEXT: jl

llvm/trunk/test/CodeGen/X86/condbr_switch.ll

				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=sandybridge %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=ivybridge %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=haswell %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=broadwell %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=skylake %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu -mcpu=skx %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=MERGE
				; RUN: llc -mtriple=x86_64-linux-gnu %s -o - -verify-machineinstrs \| FileCheck %s --check-prefix=NOTMERGE

				@v1 = common dso_local local_unnamed_addr global i32 0, align 4
				@v2 = common dso_local local_unnamed_addr global i32 0, align 4
				@v3 = common dso_local local_unnamed_addr global i32 0, align 4
				@v4 = common dso_local local_unnamed_addr global i32 0, align 4
				@v5 = common dso_local local_unnamed_addr global i32 0, align 4
				@v6 = common dso_local local_unnamed_addr global i32 0, align 4
				@v7 = common dso_local local_unnamed_addr global i32 0, align 4
				@v8 = common dso_local local_unnamed_addr global i32 0, align 4
				@v9 = common dso_local local_unnamed_addr global i32 0, align 4
				@v10 = common dso_local local_unnamed_addr global i32 0, align 4
				@v11 = common dso_local local_unnamed_addr global i32 0, align 4
				@v12 = common dso_local local_unnamed_addr global i32 0, align 4
				@v13 = common dso_local local_unnamed_addr global i32 0, align 4
				@v14 = common dso_local local_unnamed_addr global i32 0, align 4
				@v15 = common dso_local local_unnamed_addr global i32 0, align 4

				define dso_local i32 @fourcases(i32 %n) {
				entry:
				switch i32 %n, label %return [
				i32 111, label %sw.bb
				i32 222, label %sw.bb1
				i32 3665, label %sw.bb2
				i32 4444, label %sw.bb4
				]

				sw.bb:
				%0 = load i32, i32* @v1, align 4
				br label %return

				sw.bb1:
				%1 = load i32, i32* @v2, align 4
				%add = add nsw i32 %1, 12
				br label %return

				sw.bb2:
				%2 = load i32, i32* @v3, align 4
				%add3 = add nsw i32 %2, 13
				br label %return

				sw.bb4:
				%3 = load i32, i32* @v1, align 4
				%4 = load i32, i32* @v2, align 4
				%add5 = add nsw i32 %4, %3
				br label %return

				return:
				%retval.0 = phi i32 [ %add5, %sw.bb4 ], [ %add3, %sw.bb2 ], [ %add, %sw.bb1 ], [ %0, %sw.bb ], [ 0, %entry ]
				ret i32 %retval.0
				}
				; MERGE-LABEL: fourcases
				; MERGE: cmpl $3665
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; NOTMERGE: cmpl $3664
				; NOTMERGE-NEXT: jg

				define dso_local i32 @fifteencases(i32) {
				switch i32 %0, label %32 [
				i32 -111, label %2
				i32 -13, label %4
				i32 25, label %6
				i32 37, label %8
				i32 89, label %10
				i32 111, label %12
				i32 213, label %14
				i32 271, label %16
				i32 283, label %18
				i32 325, label %20
				i32 327, label %22
				i32 429, label %24
				i32 500, label %26
				i32 603, label %28
				i32 605, label %30
				]

				; <label>:2
				%3 = load i32, i32* @v1, align 4
				br label %32

				; <label>:4
				%5 = load i32, i32* @v2, align 4
				br label %32

				; <label>:6
				%7 = load i32, i32* @v3, align 4
				br label %32

				; <label>:8
				%9 = load i32, i32* @v4, align 4
				br label %32

				; <label>:10
				%11 = load i32, i32* @v5, align 4
				br label %32

				; <label>:12
				%13 = load i32, i32* @v6, align 4
				br label %32

				; <label>:14
				%15 = load i32, i32* @v7, align 4
				br label %32

				; <label>:16
				%17 = load i32, i32* @v8, align 4
				br label %32

				; <label>:18
				%19 = load i32, i32* @v9, align 4
				br label %32

				; <label>:20
				%21 = load i32, i32* @v10, align 4
				br label %32

				; <label>:22
				%23 = load i32, i32* @v11, align 4
				br label %32

				; <label>:24
				%25 = load i32, i32* @v12, align 4
				br label %32

				; <label>:26
				%27 = load i32, i32* @v13, align 4
				br label %32

				; <label>:28:
				%29 = load i32, i32* @v14, align 4
				br label %32

				; <label>:30:
				%31 = load i32, i32* @v15, align 4
				br label %32

				; <label>:32:
				%33 = phi i32 [ %31, %30 ], [ %29, %28 ], [ %27, %26 ], [ %25, %24 ], [ %23, %22 ], [ %21, %20 ], [ %19, %18 ], [ %17, %16 ], [ %15, %14 ], [ %13, %12 ], [ %11, %10 ], [ %9, %8 ], [ %7, %6 ], [ %5, %4 ], [ %3, %2 ], [ 0, %1 ]
				ret i32 %33
				}
				; MERGE-LABEL: fifteencases
				; MERGE: cmpl $271
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; MERGE: cmpl $37
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; MERGE: cmpl $429
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; MERGE: cmpl $325
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; MERGE: cmpl $603
				; MERGE-NEXT: jg
				; MERGE-NEXT: jge
				; NOTMERGE-LABEL: fifteencases
				; NOTMERGE: cmpl $270
				; NOTMERGE-NEXT: jle

This is an archive of the discontinued LLVM Phabricator instance.

[X86] condition branches folding for three-way conditional codesClosedPublic

Details

Diff Detail

Event Timeline

Bench: 4evencases.cc

Bench: 15evencases.cc

Revision Contents

Diff 168695

llvm/trunk/lib/Target/X86/CMakeLists.txt

llvm/trunk/lib/Target/X86/X86.h

llvm/trunk/lib/Target/X86/X86.td

llvm/trunk/lib/Target/X86/X86CondBrFolding.cpp

llvm/trunk/lib/Target/X86/X86Subtarget.h

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp

llvm/trunk/test/CodeGen/X86/O3-pipeline.ll

llvm/trunk/test/CodeGen/X86/condbr_if.ll

llvm/trunk/test/CodeGen/X86/condbr_switch.ll

[X86] condition branches folding for three-way conditional codes
ClosedPublic