This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Do not perform tail sinking if there are extra moves introduced
Needs ReviewPublic

Authored by xur on Aug 15 2017, 9:54 AM.

Details

Reviewers
davidxl
jmolloy
Summary

SinkThenElseCodeToEnd() tries to sink the common instructions to a new BB (sink.split).
But it does not take account of the potential moves in the original BBs.

As shown in the newly added test case (test19 in sink-common-code.ll), we have
extra moves in the final assembly (which also prevents the constant folding). This
optimization hurts the performance and not reduce the code size either.

This patch checks if the phi-operand are of constants. If that's the case,
an extra move is likely needed. If the total number of moves inserted is greater
than the number instructions saved by tail sinking, we should not perform.

It only applies to the following pattern:
if (a)

f1;

else if (b)

f2;

For pattern of

if(a)
  f1;
else
  f2;

If the condition is simple and the constants in f1 and f2 are in close range,
tail sinking could remove the branch. This patch won't touch this case

Diff Detail

Event Timeline

xur created this revision.Aug 15 2017, 9:54 AM
davidxl added inline comments.Aug 18 2017, 4:25 PM
lib/Transforms/Utils/SimplifyCFG.cpp
1683

Fix variable naming.

1725

Is the first check needed?

1734

reuced --> reduced.

1735

Should it be InstructionsToSink.size() -1?

1738

Explain a little more on the 'Select' part.

1741

For O3, should we disable the sinking whenever there is runtime overhead introduced (unless there is runtime savings compensating it)?

test/Transforms/SimplifyCFG/sink-common-code.ll
870

here the store can still be sunk -- while other instructions remain in their original location. This can slightly increase register pressure.

I don't feel like this is really the right approach. Maybe it is the only approach that will work, but I'd like to at least try to solve this differently.

Specifically, reasoning about 'mov's at the IR level really doesn't make sense. The IR is much more abstract than that, and in fact is SSA form! =/ I feel like we'll end up with a heuristic that is pretty brittle and also have to deal with canonicalization regressions due to slicing up basic blocks differently...

It is also probably wrong for some architectures. Think about a very RISC architecture or even on x86 if the constants are too large to fold into an immediate operand of an instruction. In those cases, commoning the actual logic and just setting up the constants to flow into them seems like a real win.

Given how much target awareness you end up needing to make this decision even for x86 (a relatively CISC-y architecture) I think I'd suggest a MI level pass that works something like the following...

  1. Build up a table of x86 arithmetic instructions that we can fold immediates into, and the size of immediate that can be folded. These will be similar to the tables in X86InstrInfo.cpp. Eventually, we should encode this in the .td files and extract it, but I'm not suggesting crossing that bridge today.
  1. Use this to try and hoist instructions into their predecessors if doing so allows folding an immediate operand, and thereby reducing # uops and potentially register pressure. If the predecessors aren't reachable from each other, you can even do this if even a single predecessor allows the fold without increasing the dynamically executed instruction count. But we don't have to try to be that fancy at first. On x86 at least, replacing mov <imm>, %reg; <op> %reg, %reg with <op> %imm, %reg seems like a solid win.

Does this make sense? Are there problems with this approach?

A specific place where this approach seems like it would help would be when the immediates require >32 bits and thus will always have a mov regardless of CFG (movabsq).

(if this does make sense, I'm happy to take a stab at implementing it)