Remove redundant copy in recurrences
Needs ReviewPublic

Authored by twoh on Apr 7 2017, 11:19 AM.

Details

Summary

If there is a chain of instructions formulating a recurrence, commuting operands can help removing a redundant copy. In the following example code,

BB#1: ; Loop Header
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6: ; Loop Latch
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def,tied1> = ADD32rr %vreg1<kill,tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1,%vreg0
  %vreg3<def,tied1> = ADD32rr %vreg2<kill,tied0>, %vreg10<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2,%vreg10
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>

Existing two-address generation pass generates following code:

BB#1:
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6:
    Predecessors according to CFG: BB#5 BB#4
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def> = COPY %vreg1<kill>; GR32:%vreg10,%vreg1
  %vreg10<def,tied1> = ADD32rr %vreg10<tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0
  %vreg3<def> = COPY %vreg10<kill>; GR32:%vreg3,%vreg10
  %vreg3<def,tied1> = ADD32rr %vreg3<tied0>, %vreg2<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>
  JMP_1 <BB#7>

This is suboptimal because the assembly code generated has a redundant copy at the end of #BB6 to feed %vreg13 to BB#1:

.LBB0_6:
  addl  %esi, %edi
  addl  %ebx, %edi
  cmpl  $10, %edi
  movl  %edi, %esi
  jl  .LBB0_1

This redundant copy can be elimiated by making instructions in the recurrence chain to compute the value "into" the register that actually holds the feedback value. In this example, this can be achieved by commuting %vreg0 and %vreg1 to compute %vreg10. With that change, code after two-address generation becomes

BB#1:
  %vreg0<def> = COPY %vreg13<kill>; GR32:%vreg0,%vreg13
  ...

BB#6: derived from LLVM BB %bb7
    Predecessors according to CFG: BB#5 BB#4
  %vreg2<def> = COPY %vreg15<kill>; GR32:%vreg2,%vreg15
  %vreg10<def> = COPY %vreg0<kill>; GR32:%vreg10,%vreg0
  %vreg10<def,tied1> = ADD32rr %vreg10<tied0>, %vreg1<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1
  %vreg3<def> = COPY %vreg10<kill>; GR32:%vreg3,%vreg10
  %vreg3<def,tied1> = ADD32rr %vreg3<tied0>, %vreg2<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg3,%vreg2
  CMP32ri8 %vreg3, 10, %EFLAGS<imp-def>; GR32:%vreg3
  %vreg13<def> = COPY %vreg3<kill>; GR32:%vreg13,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use,kill>
  JMP_1 <BB#7>

and the final assembly does not have redundant copy:

.LBB0_6:
  addl  %edi, %eax
  addl  %ebx, %eax
  cmpl  $10, %eax
  jl  .LBB0_1
twoh created this revision.Apr 7 2017, 11:19 AM
twoh edited the summary of this revision. (Show Details)Apr 7 2017, 11:19 AM
twoh updated this revision to Diff 94549.EditedApr 7 2017, 11:22 AM

small changes to the test.

echristo resigned from this revision.Apr 11 2017, 1:48 PM

Probably best to have Quentin or Matthias do it.

Friendly ping. Thanks!

twoh added a comment.Apr 21 2017, 11:16 AM

Ping. Thanks!

This looks pretty complicated for the task at hand. Wouldn't this be a simpler transformation to perform at a place where we are still in Machine SSA and have use-def information for free and also have phis around to indicate loops so we can do without a MachineLoopInfo instance.

twoh added a comment.Apr 21 2017, 11:39 AM

@MatzeB I was considering other passes as well for the sake of simplicity, but I concluded two-address instruction is the right place to do. The problem doesn't appear while we're in SSA, but becomes an issue when we make an decision about which operand register should be the destination register. I think this is why isProfitableToCommute function is in two-address instruction pass.

twoh added a comment.Fri, Apr 28, 9:40 AM

Ping. If you're too busy to review this patch, could you please recommend someone else? Thanks!

twoh added a comment.Fri, May 12, 2:36 PM

Ping. Can someone please let me know what would be the best way to have this patch reviewed? Thanks!

MatzeB added a reviewer: wmi.Mon, May 15, 10:59 AM

I'd like to hear @qcolombet's opinion about this, also +wmi who wrote the similar isRevCopyChain().

This patch

  • Nitpicks below
  • Do you have an idea of the compiletime impact of this patch?
  • I am slightly worried about the use of MachineLoopInfo: Is it really necessary here? Maybe the same can be done with some simpler dominance check? Can you tell when the MachineLoopInfo is/isn't available (as that will impact whether this rule is applied or not).

General Observations

I am not happy with the general approach of throwing more and more patterns at TwoAddressInstructions:

  • Yes, LLVMs handling of TwoAddressInstruction as a pre-RA pass is a bad idea IMO (and I don't know why it was done this way): We are unnecessarily constraining the allocation problem and don't necessarily do a good job upfront without knowing how the allocation will work out.
  • This pass adds another pattern at TwoAddressInstruction: It is similar to isRevCopyChain() but slightly more complicated so we end up with 160 extra lines of analysis for yet another pattern. If this trend continues we will have an even harder time maintaining this pass.
  • On the other hand this improves code quality today without rearchitecting the code.
lib/CodeGen/TwoAddressInstructionPass.cpp
187

This makes no sense! You are starting to use the machine loop info, all you do here is mark it preserved a 2nd time; There was already addPreservedID(MachineLoopInfoID) which had the same effect.

576

Use references for things that cannot be nullptr. Similar for a few other places here.

578–579

Don't use auto when the type isn't immediately obvious just by looking at the line. (auto is unfriendly towards the readers of your code). Similar in a few more places.

584

we don't tend to have a space before the ';' here. Did you try clang-format on your patch?

620

This only gives you the declared/minimum number of operands, there may be more.

713–714

Isn't this check superfluous? I would expect this to be true anyway if the instruction is inside MBB which is checked later.

test/CodeGen/X86/twoaddr-recurrence.ll
2–3

Did you try to create a .mir test so the pass can be tested in isolation? If yes and it didn't work, could you tell us why so we can improve the .mir testing.

twoh updated this revision to Diff 99207.Tue, May 16, 2:42 PM

Addressing comments from @MatzeB.

twoh added a comment.Tue, May 16, 2:42 PM

@MatzeB Thank you for your comments. I'm collection compile time numbers now with spec2006. For our internal benchmark, which takes about ~40min to compile, the compile time difference was negligible.

I don't think this patch can be implemented without loop info, because this is formulated around the loop recurrence pattern. It might be possible but will eventually require more code to produce the information provided by MachineLoopInfo. I'm curious what is your biggest concern about using MachineLoopInfo in this pass, as it is used in other passes as well.

I agree on you that maintainability vs supporting more patterns is a hard trade-off. IMHO, if we want to generate better performing code, adding more code to support more patterns would be unavoidable, unless we re-architecturing register allocation related passes. I think this patch adds more lines of code because this handles the pattern based on the loop structure for the first time.

I tried .mir test, but the output is still not two-address format. I tried

./bin/llc -mtriple=x86_64-- -run-pass=twoaddressinstruction -o - ./input.mir

(input.mir is from -stop-before=twoaddressinstruction), and the output still has instructions such as

%vreg10<def,tied1> = ADD32rr %vreg1<kill,tied0>, %vreg0<kill>, %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg1,%vreg0
.

lib/CodeGen/TwoAddressInstructionPass.cpp
713–714

This check is here to filter the definitions that are outside of Loop. The check below is to see if the definition inside the Loop belongs to MBB.

The problem doesn't appear while we're in SSA [...]

Why is that?
The marking of tie operand is already there, following phis shouldn't be an issue and would eliminate the whole data flow processing.

What am I missing?

Side question; if several recurrence chains share some operand, how do you pick the most profitable one?

twoh added a comment.Wed, May 17, 1:25 PM

@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.

I'm afraid that I couldn't quite understand your side question. This patch commutes operands of an instruction if the instruction is inside the recurrence that meets certain conditions. In line 846-850, it checks if the recurrence pattern is formulated around regC, and returns true to commute operands. If not, it checks same for regB, and returns false to not to commute if it observes the pattern. If the pattern is observed for both regB and regC, it is possible that not to commute the operands generates better result, but as the logic checks regC first, it'll commute the operands.

@MatzeB Below is a compile time difference in percentage for spec2006

400.perlbench 1.80
401.bzip2 0.66
403.gcc 0.77
429.mcf 2.61
433.milc -1.70
444.namd 0.31
445.gobmk -0.05
447.dealII -0.22
450.soplex 0.48
453.povray 1.34
456.hmmer 0.49
458.sjeng 1.30
462.libquantum 1.90
464.h264ref 0.52
470.lbm -3.01
471.omnetpp 2.50
473.astar 0.27
482.sphinx3 1.01
483.xalncbmk -0.47

wmi added a comment.Fri, May 19, 12:04 PM

Hi Taewook,

Thank you for working on it. The testcase you gave is interesting. My previous work in isRevCopyChain is quite limited and only handle recurrence within a BB, so I like to see a more general fix about the redundent copy problem caused by recurrence.

A problem of the patch I can see is that it uses dataflow to track the recurrence, which I think may not be enough. Think about the case below:

r3 = r4;
...
r1 = r2 + r3;
...
r5 = load(r1)
r4 = r5;

r1-->r5-->r4-->r3-->r1, it is a recurrence loop according to the algorithm, however, we don't have problem to coalesce r3 with r4 and coalesce r4 and r5, because r5 and r1 don't have to be allocated to the same register.

So the def-src constructing an recurrence chain interesting for us should only contains def and src which are in a tied operand group, or which are in the same copy. In other word, the recurrence chain can only cause redundent copy problem when all the operands on the chain have to be allocated to the same register. That kind of recurrence loop is what we are interested in.

In addition, I don't think the recurrence chain has to be a strict cycle. Just like your motivational testcase:
vreg0 = vreg13
...
vreg2 = vreg15
vreg10 = add vreg1, vreg0
vreg3 = add vreg2, vreg10
vreg13 = vreg3

Here the recurrence chain starting from vreg0 is: vreg0-->vreg13-->vreg3-->vreg2. This is not a strict cycle, however, vreg2 and vreg0 already have interference with each other. It means it is impossible to allocate all the vregs in the recurrence chain to the same physical register, and some copy has to be kept.

I think those are the two issues that have to be addressed for a more general solution to the recurrence problem.

Thanks,
Wei.

twoh added a comment.Fri, May 19, 1:14 PM

@wmi Thank you for your reply. I agree on you that we should consider tied operand group, and as @qcolombet mentioned in the previous comment, if tie operand information is already available before this pass, I'd like to discuss where would be the best place to implement this.

I'm afraid I couldn't understand your second point. In the example there is a recurrence cycle of vreg0-->vreg13-->vreg3-->vreg10-->vreg0 as well, and what this patch does is commuting (vreg0, vreg1) and (vreg2, vreg10) to remove the copy. Did I miss something?

Hi Taewook,

@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.

I would investigate this as part of the PeepholeOptimizer. It already massages copies.

Regarding my second question, answered it, with that part:

If the pattern is observed for both regB and regC, it is possible that not to commute the operands generates better result, but as the logic checks regC first, it'll commute the operands.

Cheers,
-Quentin