This is an archive of the discontinued LLVM Phabricator instance.

[X86] condition branches folding for three-way conditional codes
ClosedPublic

Authored by xur on May 9 2018, 2:20 PM.

Details

Summary

This file defines a pass that optimizes condition branches on x86 by taking advantage of the three-way conditional code generated by compare instructions.

Currently, it tries to hoisting EQ and NE conditional branch to a dominant conditional branch condition where the same EQ/NE conditional code is computed. An example:

`bb_0:
  cmp %0, 19
  jg bb_1
  jmp bb_2
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_4
bb_4:
  cmp %0, 20
  je bb_5
  jmp bb_6

Here we could combine the two compares in bb_0 and bb_4 and have the
following code:

bb_0:
  cmp %0, 20
  jg bb_1
  jl bb_2
  jmp bb_5
bb_1:
  cmp %0, 40
  jg bb_3
  jmp bb_6`

For the case of %0 == 20 (bb_5), we eliminate two jumps, and the control height for bb_6 is also reduced. bb_4 is gone after the optimization.

This optimization is motivated by the branch pattern generated by the switch lowering: we always have pivot-1 compare for the inner nodes and we do a pivot compare again the leaf (like above pattern).

My test on Haswell shows that this optimization improves the tight nest loop that consisted of a evenly distributed switch cases by 8% to 10%.

Thanks,

-Rong

Diff Detail

Event Timeline

xur created this revision.May 9 2018, 2:20 PM
xur added a comment.May 9 2018, 2:28 PM

For a reference: there are total of 1301 instances this optimization for a bootstrap build of llvm:

grep "Found one" bootstrap_buildlog |sort |uniq -c

1094 Found one path (len=1):
  176 Found one path (len=2):
   30 Found one path (len=3):
    1 Found one path (len=4):
davidxl added a subscriber: davidxl.
xur added a comment.May 29 2018, 2:12 PM

This has been a while. Kindly ping.

xur added a reviewer: RKSimon.May 29 2018, 2:13 PM

Please can you give more details on what range of tests you performed and whether you tested on anything other than Haswell - bunching conditional branches like this can make things very difficult for some branch prediction units..

xur added a comment.May 31 2018, 10:43 AM

Hi Simon,
This patch was motivated by the code patterns generated in switch lowering. The tests I used were switch statements. I tested 4 cases and 15 cases, and all the cases are evenly distributed.

My initial test was on ixion-haswell.

Upton getting your review comment, I tested some other platforms:
iota-sandybridge:

both 4case and 15 case have no performance difference

iota-ivybridge

4case has 5% performance gain. 15 case has not performance difference.

ixion-broadwell:

4case has -1%-2% performance loss (might be noise). 15case has 8% performance gain

indus-skylake:

4case has 7% performance gain. 15case has 3% performance gain.

Would it be possible for you to put the tests somewhere so we can compare the effect on some AMD and older Intel machines?

I attached my test cases to this email.

They are slight different from the ones I used -- They were using out
internal test infrastructure and now I made them standalone.

I don't have access to the machines other than the ones I mentioned in
previous message. Simon: Could you test on other machines?

Let me know if you prefer I put the test a part of the patch.

xur added a comment.Jun 13 2018, 9:54 AM

Kindly ping.

xbolva00 added inline comments.Jun 23 2018, 5:09 AM
lib/Target/X86/X86CondBrFolding.cpp
202

!TBB etc..

xur updated this revision to Diff 166733.Sep 24 2018, 12:25 PM
xur marked an inline comment as done.

Sync to the latest compiler and integrated xbolva00's comments.

Simpon: have you got chance to measure the performance on other platforms?

Thanks,

-Rong

xbolva00 added inline comments.Sep 25 2018, 1:57 AM
test/CodeGen/X86/condbr_switch.ll
18

Maybe remove

"local_unnamed_addr", "dso_local", .. ?

Did you bootstrap clang/llvm with this patch? maybe also SPEC or similar benchmark?

xur added a comment.Sep 25 2018, 11:41 AM

Did you bootstrap clang/llvm with this patch? maybe also SPEC or similar benchmark?

Yes. Bootstrapped clang/llvm
And tested SPEC2006 and Google internal benchmarks.

Nice! This looks good, ping @RKSimon for more suggestions, if any :)

craig.topper added inline comments.Sep 25 2018, 12:08 PM
lib/Target/X86/X86CondBrFolding.cpp
466

CmpVlaue->CmpValue

523

This might just be me, but I kind of feel like analyzeMBB should be part of the X86condBrFolding and the TargetMBBInfo struct should only be created if its needed with the information the analysis collected. Right now we speculatively create a struct we have to delete if we fail the checks.

569

Minor, add blank line between the functions here.

test/CodeGen/X86/switch-bt.ll
140–141

Is this comment about 29 stale even in the original code?

Nice! This looks good, ping @RKSimon for more suggestions, if any :)

Hi Rong,

Sorry for the late reply.
I can help with testing your patch on AMD Jaguar. Tomorrow morning I will post my findings (I am not at work now).

-Andrea

craig.topper added inline comments.Sep 25 2018, 12:33 PM
lib/Target/X86/X86CondBrFolding.cpp
276

&* isn't needed

350

&* isn't needed

413

I don't think this "&*" is neededed. get() should return a pointer. you shouldn't need to dereference that and take the address of it.

415

This assert doesn't really accomplish anything. MBBInfo was already dereferenced on the line before so we crashed before we got to the assert.

420

&* isn't needed

431

&* isn't needed.

549

&* isn't needed

xur marked 10 inline comments as done.Sep 25 2018, 3:40 PM

Thanks for Craig.'s review. Here is the new patch that integrated most of his comments.
I will post another version that moves analyzeMBB() to X86CondBrFolding, as suggested by Craig.

lib/Target/X86/X86CondBrFolding.cpp
415

moved before the deference.

523

You have a good point. Current way is easier for sharing the analysis result but the structure is not the best.
I'll restructure this in a later patch.

test/CodeGen/X86/switch-bt.ll
140–141

You are right! I did not pay attention to this.
I deleted the stale comments.

xur updated this revision to Diff 167019.Sep 25 2018, 3:41 PM
xur marked 2 inline comments as done.

New patch that integrated Craig's review comments.

xur updated this revision to Diff 167030.Sep 25 2018, 5:07 PM

This patch integrated Craig's suggestion to move analyzeMBB() to X86CondBrFolding class.
Also reformatted the code using clang-format.

craig.topper added inline comments.Sep 25 2018, 5:25 PM
lib/Target/X86/X86CondBrFolding.cpp
141

Drop the &*

395

No need for a condition here. If you return a null std::unique_ptr it can still be moved into the map. Though since you've create an entry in the map for every basic block you might be able to use a vector indexed by MBB_>getNumber. You can get the total number of indices to intially size the vetor from MF->getNumBlockIDs(). Then you can use operator[] on the vector instead of find. This should work as long as no basic blocks are created by this pass.

404

No need for the * and & here.

499

Can you just return nullptr instead of default constructed std::unique_ptr?

Hi Rong,

On Jaguar, this pass may increase branch density to the point where it hurts the performance of the branch predictor.

Branch prediction in Jaguar is affected by branch density.
To give you a bit of background: Jaguar's BTB is logically partitioned in two levels. A first level, which is specialized in sparse branches; a second level which is specialized in dense branches, and it is dynamically allocated (when there are more than 2 branches per cache line).
L2 is a bit slower (dynamically allocated), and tends to have a lower throughput thant the L1. So, ideally, L1 should be used as much as possible.

This patch increases branch density to the point where the L2 BTB usage increases, and the efficiency of the branch predictor decreases.

Bench: 4evencases.cc

Without your patch (10 runs):

Each iteration uses 902058 nano seconds
Case counts: 0 261000000 250000000 246000000 243000000
Each iteration uses 887837 nano seconds
Case counts: 0 281000000 253000000 227000000 239000000
Each iteration uses 887856 nano seconds
Case counts: 0 256000000 254000000 236000000 254000000
Each iteration uses 880632 nano seconds
Case counts: 0 279000000 236000000 244000000 241000000
Each iteration uses 1.03057e+06 nano seconds
Case counts: 0 258000000 257000000 243000000 242000000
Each iteration uses 883759 nano seconds
Case counts: 0 248000000 262000000 278000000 212000000
Each iteration uses 910438 nano seconds
Case counts: 0 248000000 254000000 243000000 255000000
Each iteration uses 885671 nano seconds
Case counts: 0 258000000 266000000 231000000 245000000
Each iteration uses 912325 nano seconds
Case counts: 0 225000000 264000000 270000000 241000000
Each iteration uses 904952 nano seconds
Case counts: 0 261000000 240000000 241000000 258000000

With your patch (10 runs):

Each iteration uses 916110 nano seconds
Case counts: 0 223000000 266000000 263000000 248000000
Each iteration uses 918773 nano seconds
Case counts: 0 266000000 230000000 236000000 268000000
Each iteration uses 903100 nano seconds
Case counts: 0 250000000 249000000 231000000 270000000
Each iteration uses 923196 nano seconds
Case counts: 0 241000000 243000000 276000000 240000000
Each iteration uses 911282 nano seconds
Case counts: 0 241000000 239000000 266000000 254000000
Each iteration uses 910201 nano seconds
Case counts: 0 210000000 263000000 260000000 267000000
Each iteration uses 925672 nano seconds
Case counts: 0 245000000 265000000 236000000 254000000
Each iteration uses 932643 nano seconds
Case counts: 0 235000000 259000000 256000000 250000000
Each iteration uses 937735 nano seconds
Case counts: 0 261000000 242000000 259000000 238000000
Each iteration uses 954895 nano seconds
Case counts: 0 254000000 239000000 271000000 236000000

Overall, 4evencases.cc is ~2% slower with this patch.

Bench: 15evencases.cc

Without your patch (10 runs):

Each iteration uses 1.10148e+06 nano seconds
Case counts: 0 56000000 60000000 68000000 61000000 69000000 64000000 80000000 64000000 68000000 66000000 83000000 74000000 50000000 73000000 64000000
Each iteration uses 1.0648e+06 nano seconds
Case counts: 0 71000000 59000000 55000000 64000000 73000000 57000000 55000000 74000000 76000000 67000000 77000000 57000000 82000000 54000000 79000000
Each iteration uses 1.06872e+06 nano seconds
Case counts: 0 55000000 80000000 59000000 45000000 70000000 61000000 68000000 72000000 77000000 67000000 88000000 63000000 61000000 77000000 57000000
Each iteration uses 1.04146e+06 nano seconds
Case counts: 0 68000000 61000000 67000000 50000000 70000000 68000000 73000000 69000000 61000000 78000000 69000000 64000000 67000000 75000000 60000000
Each iteration uses 1.0549e+06 nano seconds
Case counts: 0 66000000 75000000 64000000 64000000 74000000 78000000 63000000 64000000 67000000 57000000 65000000 63000000 74000000 66000000 60000000
Each iteration uses 1.04246e+06 nano seconds
Case counts: 0 66000000 69000000 63000000 76000000 66000000 78000000 44000000 66000000 61000000 75000000 66000000 70000000 67000000 64000000 69000000
Each iteration uses 1.07907e+06 nano seconds
Case counts: 0 63000000 66000000 81000000 68000000 56000000 71000000 71000000 68000000 58000000 65000000 64000000 75000000 63000000 71000000 60000000
Each iteration uses 1.05432e+06 nano seconds
Case counts: 0 66000000 67000000 70000000 65000000 57000000 53000000 62000000 62000000 63000000 74000000 68000000 81000000 70000000 77000000 65000000
Each iteration uses 1.04041e+06 nano seconds
Case counts: 0 71000000 71000000 65000000 69000000 77000000 67000000 52000000 60000000 73000000 80000000 76000000 66000000 55000000 49000000 69000000
Each iteration uses 1.07782e+06 nano seconds
Case counts: 0 68000000 76000000 63000000 79000000 76000000 71000000 65000000 61000000 63000000 63000000 61000000 56000000 67000000 61000000 70000000

With your patch (10 runs):

Each iteration uses 1.11151e+06 nano seconds
Case counts: 0 64000000 64000000 73000000 72000000 69000000 75000000 66000000 70000000 77000000 59000000 50000000 74000000 68000000 58000000 61000000
Each iteration uses 1.28406e+06 nano seconds
Case counts: 0 68000000 63000000 66000000 69000000 68000000 58000000 71000000 60000000 80000000 66000000 80000000 69000000 57000000 62000000 63000000
Each iteration uses 1.18149e+06 nano seconds
Case counts: 0 67000000 68000000 66000000 69000000 71000000 67000000 64000000 69000000 72000000 61000000 73000000 60000000 66000000 71000000 56000000
Each iteration uses 1.30169e+06 nano seconds
Case counts: 0 74000000 66000000 69000000 64000000 70000000 64000000 59000000 61000000 53000000 75000000 74000000 58000000 72000000 68000000 73000000
Each iteration uses 1.15588e+06 nano seconds
Case counts: 0 62000000 66000000 67000000 62000000 79000000 65000000 59000000 54000000 65000000 61000000 62000000 82000000 74000000 68000000 74000000
Each iteration uses 1.16992e+06 nano seconds
Case counts: 0 69000000 64000000 71000000 60000000 60000000 70000000 64000000 77000000 65000000 75000000 61000000 70000000 61000000 77000000 56000000
Each iteration uses 1.2683e+06 nano seconds
Case counts: 0 66000000 69000000 73000000 76000000 72000000 59000000 64000000 61000000 53000000 78000000 66000000 63000000 66000000 57000000 77000000
Each iteration uses 1.17196e+06 nano seconds
Case counts: 0 67000000 69000000 84000000 52000000 56000000 70000000 58000000 64000000 71000000 72000000 67000000 68000000 68000000 73000000 61000000
Each iteration uses 1.28627e+06 nano seconds
Case counts: 0 70000000 70000000 70000000 57000000 73000000 71000000 70000000 57000000 57000000 67000000 69000000 61000000 60000000 76000000 72000000
Each iteration uses 1.28318e+06 nano seconds
Case counts: 0 61000000 72000000 70000000 80000000 68000000 59000000 59000000 65000000 49000000 78000000 65000000 64000000 64000000 77000000 69000000

Here the performance varies a lot depending on whether we are in the dense branch portion, or not. Note also that prediction through the L2 BTB has a lower throughput (as in branches per cycle).

Excluding outliers, the average performance degradation is ~8-10%.

While this analysis has been only conducted on Jaguar, I suspect that similar problems would affect AMD Bobcat too, since branch prediction for that core is similar to the one in Jaguar.

I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

However, at least for now, I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).
Definitely, it should be disabled for Jaguar (BtVer2) and Bobcat.

-Andrea

What is the definition of branch density?

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

Regards,

-Rong

What is the definition of branch density?

In this context, it is the number of branches per cache line.
(See: "AMD software optimization guide for family 16h processors" - Section 2.7.1.5: "Branch Marker Caching" ).

For Jaguar, L2 BTB entries are consumed when there are more than two branches per cache line.
There is a nice description in Section 2.7.1.2 "Branch Target Buffer".

I hope it helps.
-Andrea

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

xur updated this revision to Diff 167216.Sep 26 2018, 5:08 PM
xur marked 3 inline comments as done.

Integrated Craig's new review comments, in particular, using vector instead of densemap for MBBInfos.

I suggest to make this pass optional (i.e. make this pass opt-in for subtargets).

@xur are you gonna work on this? As you already measured some numbers, you can enable it for haswell, skylake, etc..

test/CodeGen/X86/condbr_if.ll
36

We don't need "dso_local" here

xur updated this revision to Diff 167430.Sep 27 2018, 10:25 PM

Using new SubtargetFeature method (suggested by Andrea) to make this pass opt-in for subtargets.
Changed the tests accordingly.

xbolva00 added a comment.EditedSep 28 2018, 11:13 AM

Looks better and better :)

if you are interested in some more cmp/jmp opportunities, please also see:
https://bugs.llvm.org/show_bug.cgi?id=38002
https://bugs.llvm.org/show_bug.cgi?id=39116

In D46662#1248780, @xur wrote:

Using new SubtargetFeature method (suggested by Andrea) to make this pass opt-in for subtargets.
Changed the tests accordingly.

Thanks Rong.

I have some comments (See below).

Cheers
-Andrea

lib/Target/X86/X86CondBrFolding.cpp
91–102

Is field MBB ever used?

If not, then you can remove it.
It is quite a big auxiliary struct. I wonder if it could be smaller...

132–138

MBBInfos is initialized with one element (a descriptors) per each machine basic block.
If we don't create new basic blocks, then this code could just be rewritten as:

return MBBInfos[MBB->getNumber()].get();

About the assert. If I understand your algorithm corectly, getMBBInfo is never called on a null MBB.
That being said, it is not wrong to have an assert that validates a precondition. If you want to keep it, then please add a meaningful message to it.

150

This assert is completely redundant.
If MBBInfo was null, then you would have already had a segfault at line 148 ...

(Also: please add string messages to asserts..)

399–401

Replace this loop with:

MBBInfos.resize(MF.getNumBLockIDs());

I am not even sure that you actually need this loop.
The loop at line 401 is essentially initializing MBBInfos. So, you could just merge these two loops, and have a single loop calls to MBBInfos.emplace_back(..);.

447–448

unsigned.

475–476

Can this ever happen with those opcodes?
I think you can safely convert this check into an assert.

494

Is this variable really needed? It seems like you only use CC.

xur marked 6 inline comments as done.Oct 1 2018, 10:48 AM

Thanks Andrea's more recently review and very helpful suggestions. Here is the updated patch.

lib/Target/X86/X86CondBrFolding.cpp
91–102

removed MBB field.

the main reason for so many fields is to reuse the analysis information in optimization phrase.
we can reduce the struct size but we need to recompute.

132–138

You are absolutely right on this. I simplified the code.

150

I meant to have the assert right after getMBBInfo() call, as getMBBInfo can return nullptr. Fixed

399–401

Indeed, resize() is better here.

I don't think we can merge two loops, because the iteration order of the loop are different.

475–476

you are right!
replaced with assert.

494

Good catch. This is a leftover of the refactoring.

xur updated this revision to Diff 167775.Oct 1 2018, 10:49 AM
xur marked 5 inline comments as done.

Integrated Andrea's review comments

andreadb accepted this revision.Oct 1 2018, 12:22 PM

Thanks.
I don’t have other comments.

I didn’t properly review the part where you fix branch probabilities. So, it would be nice if somebody else reviews that part.

lib/Target/X86/X86CondBrFolding.cpp
474

“should” is repeated here.

475

s/brand/branch

This revision is now accepted and ready to land.Oct 1 2018, 12:22 PM
xur updated this revision to Diff 167841.Oct 1 2018, 3:07 PM
xur marked 2 inline comments as done.

Fix typos that identified by Andrea.

This revision was automatically updated to reflect the committed changes.
In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them

Sorry Roman,
I completely missed that comment.

Those two benchmarks were attached by Xur to this code review.
You should be able to see the attachments if you expand the “Show Older Changes” section (there is a link at the top of this review).
One of his posts has got 3 attachments. Two of these files are the benchmarks to run.

I hope it helps.

-Andrea

Roman

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2019, 2:43 PM
Herald added a subscriber: jdoerfert. · View Herald Transcript

4evencases.cc

There is a regression with -O3 -march=native on Haswell. It is slower than standard -O3. GCC 9 with -O3 -march=native produces faster code than standard -O3.

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them

Sorry Roman,
I completely missed that comment.

Those two benchmarks were attached by Xur to this code review.
You should be able to see the attachments if you expand the “Show Older Changes” section (there is a link at the top of this review).
One of his posts has got 3 attachments. Two of these files are the benchmarks to run.

Aha! Not sure how i did not find those. Thank you!

I hope it helps.

-Andrea

Roman

In D46662#1246781, @xur wrote:

Hi Andrea,

Thanks for running this test, and the explanation. Can you run the tests
on Bulldozer/Ryzen? I don't have access to these platforms. If I need to do
this in subtarget way, it would be good to know the performance there.

CC'ing @lebedev.ri and @GGanesh.
They should be able to help you with running those tests on Bulldozer/Ryzen. Unfortunately, I don't have access to those machines.

I *think* this should be fine on bdver2, as per https://www.agner.org/optimize/microarchitecture.pdf:

19.15 Branches and loops
The branch prediction mechanism is described on page 34. There is no longer any
restriction on the number of branches per 16 bytes of code that can be predicted efficiently.
The misprediction penalty is quite high because of a long pipeline.

Measurements (n=25) say that 15 cases improves (avg: -0.18%, median: -0.35%),
and 4 cases appears to improve (avg: -0.03%, median: +0.07%)
I will submit a patch.

...
Bench: 4evencases.cc
...
Bench: 15evencases.cc
...
I wouldn't be surprised if instead this patch improves the performance of code on other big AMD cores like Bulldozer/ryzen.

Are these benchmarks available from somewhere? Can i run them somehow?

-Andrea

Roman

lebedev.ri added a comment.EditedMar 31 2019, 8:42 AM

Uhm, is this missing some plumbing from clang side?
The pass isn't being run even with -march=native, unless enabled via -mllvm -x86-condbr-folding=true

lebedev.ri added inline comments.Mar 31 2019, 9:13 AM
llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
455–456 ↗(On Diff #168695)

So wait, shouldn't this respect FeatureMergeToThreeWayBranch?

Right, PR39658 / D54593, that explains this.