This is an archive of the discontinued LLVM Phabricator instance.

Expand SimplifyCFG to convert certain simple switches to selects
AbandonedPublic

Authored by kariddi on Jun 19 2014, 11:06 AM.

Details

Reviewers
None
Summary

This proposed patch adds some code to SimplifySwitch in SimplifyCFG in order to convert some switch constructs to selects when that is possible.

The idea here is to avoid unnecessary control-flow when the switch is simple enough to be converted into a simple one or two-way select with only a comparison. In order for that to be feasible the switch must be a the kind of "producing a value" like for example

switch (a) {

case 5:
  return 10;
case 9:
  return 11;
default:
  return 12;

}

The method uses three methods to try and transform the switch into a select.

First it checks if the number of cases is just 2 and in that case it just generates a simple comparison to check if the value is the one from the first case or from the second.
If this approach fails it then tries to see if case label values are dividable in two ordered groups and tries to emit a select that checks on a pivot value to see from which group we should select the value from.
If this is not possible it tries a third approach which is trying to identify some bits that are always one in one of the groups to select from and that are always zero in the other and use a bit mask to determine in which category the value falls into.

Reducing control flow in this way for these cases should be profitable for all hardware.
Please, comment on what you think about this.

Diff Detail

Event Timeline

kariddi updated this revision to Diff 10645.Jun 19 2014, 11:06 AM
kariddi retitled this revision from to Expand SimplifyCFG to convert certain simple switches to selects.
kariddi updated this object.
kariddi edited the test plan for this revision. (Show Details)
kariddi added a reviewer: deleted.
kariddi set the repository for this revision to rL LLVM.
kariddi added a subscriber: Unknown Object (MLST).

Drive by miscellaneous style comment...

lib/Transforms/Utils/SimplifyCFG.cpp
3762

... this function is very very long. Any way to break it up a bit?

kariddi added inline comments.Jun 19 2014, 6:01 PM
lib/Transforms/Utils/SimplifyCFG.cpp
3762

Yeah, maybe it is possible to break up the various stages (case 1, case 2 and case 3) in helper functions to shorten it a little bit.

kariddi updated this revision to Diff 10673.Jun 19 2014, 6:50 PM

Modified to move some work into helper functions to improve readability

Looks pretty awesome, thanks for splitting it up.

Any performance numbers?

kariddi updated this revision to Diff 10674.Jun 19 2014, 7:15 PM

Updated to apply without hunks after recent commits

I made some micro benchmarks and the results are promising if the backend does a good job with select vs lookup table or branches + phi nodes.

For example on ARM this function:

int foo3_without_def(int a) {

a &= 0x6;
switch (a) {
  case 0:
    return 1;
  case 2:
    return 2;
  case 4:
    return 1;
  case 6:
    return 2;
  default:
    return 10;
}

}

was translated as:

_foo3_without_def_norm:
movw r1, :lower16:(l_switch.table1-(LPC1_0+8))
and r0, r0, #6
movt r1, :upper16:(l_switch.table1-(LPC1_0+8))
LPC1_0:
add r1, pc, r1
ldr r0, [r1, r0, lsl #2]
bx lr

while after the optimization is:

_foo3_without_def:
ubfx r0, r0, #1, #1
add r0, r0, #1
bx lr

Which is much faster.

On X86 I got mixed results with the select version being basically the same performance. After looking at the resulting code I noticed some strange code from the X86 backend.
The same function above without the optimization is translated as:
_foo3_without_def_norm:
andl $6, %edi
leaq l_switch.table1(%rip), %rax
movl (%rax,%rdi,4), %eax
retq

with the optimization:

_foo3_without_def:
pushq %rbp
movq %rsp, %rbp
shrl %edi
andl $1, %edi
leal 1(%rdi), %eax
popq %rbp
retq

As you can see the backend emits some (I think) unnecessary pushing/moving and popping of RBP, which takes away the advantage of the select with respect to the lookup table.
I don't really know why the X86 backend emits those to be honest, but I'm pretty sure they aren't needed and that everything could be done with 3 instruction. In this way this code could be made as fast as the ARM version with respect to the lookup table code.

kariddi added a comment.EditedJun 20 2014, 8:39 AM

Whoops, I didn't notice that on X86 I had frame-pointers enabled by default (default without -fomit-frame-pointer) and that is the cause of those pushing and popping.

After omitting frame pointers I get this output:

_foo3_without_def:
shrl %edi
andl $1, %edi
leal 1(%rdi), %eax
retq

with the one without optimization being:

_foo3_without_def:
andl $6, %edi
leaq l_switch.table1(%rip), %rax
movl (%rax,%rdi,4), %eax
retq

Which is short in instruction count, thanks to X86 assembly being quite expressive, but accesses memory.

In my micro benchmark this is the result of the optimized version:

marcellos-mbp:test_files Kariddi$ time ./test2

real 0m3.880s
user 0m3.878s
sys 0m0.002s

vs unoptimized:

marcellos-mbp:test_files Kariddi$ time ./test2_norm

real 0m4.675s
user 0m4.670s
sys 0m0.003s

for this switch:

int foo3_without_def(int a) {

a &= 0x6;
switch (a) {
  case 0:
    return 1;
  case 2:
    return 2;
  case 4:
    return 1;
  case 6:
    return 2;
  default:
    return 10;
}

}

reames added a subscriber: reames.Jun 20 2014, 3:53 PM

I really like the overall direction of this patch, but it could use a bit of cleanup for readability.

You might also be able to generalize some of the transform:

  • single unique value + default
  • default probably dead without transform
lib/Transforms/Utils/SimplifyCFG.cpp
4000

range loop?

4014

split the two early exit cases? might improve readability, particularly if the CaseSuccessor != CommonDest check is above BI.

4027

You might restructure this to put ResultFound closer to it's use and separate the two if/else clauses.

4039

It looks like you might end up with PHI unset if you had a switch with only a default case. Could such a switch reach here?

4080

It looks like this code is actually handling two cases + default. Update the comment?

Also, is it always profitable to replace one switch with two selects? I would guess not, but will defer to actual measurement.

4128

Could you add examples to your function comments? It would make them a bit more obvious.

4216

Since you're handling defaults, wouldn't it also make sense to handle one unique result and a default? i.e.
switch(i) {
case 4,5,6: return 5;
default: return 0;
}

4269

It seems like proving the switch default is dead is a useful optimization even if we can make no other changes. Should this be separated and generalized?

4278

The general convention used is an early return when a change is made, not a signal value (nullptr here) when a change isn't made. In this particular example, I believe using that pattern would make the code much easier to follow.

4288

Everything from here on could easily be a helper function.

Hi Philip, thanks for the comments!

I have a couple of questions about some of them to being better able to address them.

What you mean exactly with : "default probably dead without transform" ?

The other question is inlined.

kariddi added inline comments.Jun 20 2014, 8:36 PM
lib/Transforms/Utils/SimplifyCFG.cpp
4080

On X86 from the measurements I did with a very simple micro benchmark it seems profitable.

This is the code of the benchmark:

int foo1_with_def(int a) {

switch(a) {
  case 10:
    return 10;
  case 20:
    return 2;
}
return 4;

}

int main() {

int a = 0;
int ret = 0;
for (; a < 2000000000; ++a) {
  ret |= foo1_with_def(a);
}

return ret;

}

And the generated code for the functions containing the switch is (+ performance results):

LLVM without patch:
_foo1_with_def:
movl $10, %eax
cmpl $10, %edi
je LBB0_4
cmpl $20, %edi
jne LBB0_3
movl $2, %eax
retq
LBB0_3:
movl $4, %eax
LBB0_4:
retq

real 0m5.522s
user 0m5.519s
sys 0m0.002s

LLVM with patch:

_foo1_with_def:
cmpl $10, %edi
movl $4, %ecx
cmovel %edi, %ecx
cmpl $20, %edi
movl $2, %eax
cmovnel %ecx, %eax
retq

real 0m3.233s
user 0m3.230s
sys 0m0.002s

I don't have any real world application though to back this up (it looks like a good result though).

In addition to that X86 is quite a control flow "friendly" architecture with respect to many others and if the results are positive on X86 I would expect that on some more custom architectures where control flow is more expensive this optimization might actually be even more effective.

4269

Do you mean making it a generic independent static function in SimplifyCFG.cpp ?

kariddi added inline comments.Jun 21 2014, 3:40 AM
lib/Transforms/Utils/SimplifyCFG.cpp
4000

I checked here. I don't think it is possible with SwitchInst to iterate over the cases of the switch with range loop, because there are no begin()/end() methods and range loops only work over classes that expose those.

kariddi added inline comments.Jun 21 2014, 4:09 AM
lib/Transforms/Utils/SimplifyCFG.cpp
4014

The conditions for the exit cases here are &&ed and not ||ed . So the exit case here is actually one (when both the conditions are verified). What exactly do you mean here?

4128

Very good point, I'll do that definitely in the next revision

4216

This is a good idea. I need to check if that is profitable in performance though (needs two comparisons to check if the value falls in the range). If it seems profitable I'll add it.

4278

I see, I will make the change, thanks.

4288

This is quite needed in order to make the change you requested in the comment above

Sorry for having added the comments in different moments, but the answers came out at different moments in times :P

lib/Transforms/Utils/SimplifyCFG.cpp
4039

I tried to catch a case here where that happens, but I couldn't reproduce this case by trying to compile something like this:

define i32 @foo(i32 %a) #0 {
entry:

switch i32 %a, label %sw.default [
]

sw.default: ; preds = %entry

ret i32 10

}

with opt -simplifycfg

Anyway, if that is the case the test:

if (UniqueResults.size() != 2)
  return false;

in SwitchToSelect() should fail, because a PHI must be set if at least one Result is found, which would prevent the assertion on the next line to trigger.

I think bailing out in a case like that would be ok, because I don't think this is the right place where to handle a case like this one.

kariddi added inline comments.Jun 21 2014, 7:21 AM
lib/Transforms/Utils/SimplifyCFG.cpp
4216

I actually checked and seems like this specific case is already handled by TurnSwitchRangeIntoICmp()

kariddi updated this revision to Diff 10751.EditedJun 23 2014, 6:37 AM

New revision.

In this revision I addressed most of the comments from Philip that I could, in particular:

  • I restructured ResultFound to be closer to it's use.
  • Updated the comment about the two cases case.
  • Added examples on the various cases helper functions
  • Handled directly the case "single range + default" i.e. switch (a) { case 4,5,6: return 20; default: return 10; } before this case was handled by TurnSwitchRangeIntoICmp + SimplifyCondBranch. Now it is handled directly. I updated the patterns of some tests for this , because the name of the generated virtual registers was different, but the generated code is the same.
  • Switched to an early return approach in the function
  • Moved the Switch removal + PHI fixup into an helper function

In addition to that I:

  • Changed the order of execution of TurnSwitchRangeIntoICmp to run after SwitchToSelect , so that this function can handle directly some cases handled by TurnSwitchRangeIntoICmp + SimplifyCondBranch. Changing this order shouldn't have any other side effect.
  • Modified the function to make optimization that have only "one value + default value" (like the "single range + default" above) possible to implement, so in the future if someone has an idea to add here that matches this pattern it is possible to add.

Answering inline questions. Will glance at the revised version in separate review.

lib/Transforms/Utils/SimplifyCFG.cpp
4000

Maybe something we should add. Not asking that of you now.

4014

I had misread the code when scanning quickly. Please ignore previous comment.

4039

Your answer makes sense. I worry a bit that we're developing a series of small optimization tests where a single more general one would be better, but that's a concern for later.

4080

Thanks for the data. I'm not entirely sure about how this will play out on other architectures (i.e. ones without a conditional move). I am not requesting you hold the patch on this mind you. Just be prepared in case someone objects based on non-x86 perf data. :)

4128

Thanks for doing this. It helped a lot.

4216

Thanks for following up. This fits in with my "series of small optimizations" comment above, but not something which needs to be addressed now.

4269

For this change, no action needed.

I was essentially wondering whether the optimization "prove default case is dead based AnalyzeSwitchCases" would apply to switches with more than two groups or for which we can't completely fold to selects. This might be worth looking at as a follow up change.

I'm running out of useful comments. :) I don't have commit rights, so I can't give you an official LGTM, but what I see looks reasonable.

Side note: this patch is getting really long. To make review easier, a set of smaller changes would be a good idea. Particularly since you have several independent changes which build on each other here.

Also, since I'm not sure it was clear before, my wondering about other possible optimizations are _not_ things you need to fix before submission. They might be TODOs to add or suggestions for future work, but unless they lead you to what you consider as a bug, you do not need to purse them. I worry I may have led you slightly astray on that.

lib/Transforms/Utils/SimplifyCFG.cpp
184

This looks like an entirely new optimization rather than moving code around? If so, do you mind separating this as a separate patch? Most of the rest seems fairly solid, this needs a bit more work.

I had trouble tracing through the logic and assumptions here. Example? (Feel free to post a different review.)

236

To make your example clearer, you might vertically separate the two ands. I had missed the &= usage on the left in first read through.

Hi, sorry for the late answer.

Thanks for all your precious feedback Philip.

I think you are right, the patch grew quite a bit and it became quite difficult to audit.

I think I'll split it up into smaller pieces and propose them independently.

kariddi abandoned this revision.Jul 1 2014, 7:27 AM